Eric D. Brown, D.Sc.

Data Science | Entrepreneurship | ..and sometimes Photography

Page 2 of 241

Foto Friday – Rocky Mountain National Park

Late last year I had the opportunity to spend a week in Rocky Mountain National Park (RMNP).  Strangely, I’d never actually been to RMNP although I’ve been just about everywhere else around RMNP.

The trip was part of a trip to nearby Denver for a conference so I didn’t get as much time to spend in the park and surrounding areas as I’d wanted to, but I did spend every morning in the park for sunrise – and loved every second that I had there. I’d had a few pre-planned locations found during trip research and got a couple of really good sunrise shots but didn’t get as many opportunity for Elk that I wanted.  That said, I did get surprising access to multiple Moose during the trip as well as a few Pika.

Before we get into the trip photos, let me share the gear I used on the trip. If you want to know more about the gear, let me know and I can share my thoughts.

Now, onto the photos. If you would like to purchase a copy (or copies) of any of these photos, check out my portfolio site.

Sunrise and Fall Colors

Sunrise over Sprague Lake in Rocky Mountain National Park.

Red Sunrise

Sometimes, a quick ‘snap’ of the camera turns into something special. While I was walking around the lake after sunrise, I grabbed this quick snap, which turned out much better than expected.

Black & White Lake

Sprague Lake in Rocky Mountain National Park with a black and white treatment

Rocky Mountain Pika

While in Rocky Mountain National Park, I knew I wanted to find some Pikas. I was lucky and found a perfect habitat for them without much hiking. This is the outcome of my first visit.

The colors of sunrise

While wondering around Rocky Mountain National Pakr (RMNP) I found this spot and thought it’d be a good place for a sunrise photo. There wasn’t a lot of clouds that morning but I did get some fog that rolled in while the sun was rising. The fog plus the few clouds with color add some interest to this photograph.

Moose in the Morning

While at Rocky Mountain National Park, I had the chance to photograph a few moose. While walking down the road toward where a lot of folks said some moose had been spotted, I noticed this Bull Moose standing in the trees perfectly lit by the sunlight.

Moon over the Rockies

Went out to Sprague Lake in RMNP to capture sunrise hoping that the clouds would stick around. While setting up, I took a couple shots while the moon was out….and turned out the moon shots were so much better than the sunrise photos (the clouds disappeared before the sun came up).

See more of my photography here.

The Data Mining Trap

In a post titled Data Mining – A Cautionary Tale, I share the idea that data mining can be dangerous by sharing the story of Cornell’s Brian Wansink, who has had multiple papers retracted due to various data mining methods that aren’t quite ethical (or even correct).

Recently, Gary Smith over at Wired wrote an article called The Exaggerated Promise of so-called Unbiased Data Mining with another good example of the danger of data mining.

In the article, Gary writes of a time that noted physicist and Nobel Laureate Richard Feynman gave his class an exercise to determine the probability of seeing a specific license plate int he parking lot on the way into class (he gave them a specific example of a license plate).  The students worked on the problem and determine that the probability was less than 1 in 17 million that Feynman would see a specific license plate.

According to Smith, what Feynman didn’t tell the students was that he had seen the specific license plate that morning in the parking lot before coming to class, so the probability was actually 1. Smith calls this the ‘Feynman Trap.’

Whether this story is true – I don’t recall ever reading it from Feynman directly – (although he does have a quote about license plates), its a very good description one of the dangers of data mining — knowing what the answer will be before starting the work. In other words, bias.

Bias is everywhere in data science. Some say there are 8 types of bias (not sure I completely agree with 8 as the number, but its as good a place to start as anywhere else). The key is knowing that bias exists, how it exists and how to manage that bias. You have to manage your own bias as well as any bias that might be inherent in the data that you are analyzing. Bias is hard to overcome but knowing it exists makes it easier to manage.

The Data Mining Trap

The ‘Feynman Trap’ (i.e., bias) is a really good thing to keep in mind whenever you do any data analysis.  Thinking back to the story shared in Data Mining – A Cautionary Tale about Dr.Wansink, he was absolutely biased in just about everything he did in the research that was retracted. He had an answer that he wanted to find and then found the data to support that answer.

There’s the trap. Rather than going into data analysis with questions and looking for data to help you find answers, you go into it with answers and try to find patterns to support your answer.

Don’t fall into the data mining trap. Keep an open mind, manage your bias and look for the answers. Also, there’s nothing wrong with finding other questions (and answers) while data mining but keep that bias in check and you’ll be on the right path to avoiding the data mining trap.

Photo by James & Carol Lee on Unsplash

Foto Friday – On the way to Zion National Park

I call this one “On the way to Zion”.

Captured with Canon 5D and Canon 17-40 handheld on the way into Zion National Park.

See more photos at my dedicated Photography website. If you like my photography, feel free to support my addiction habit by purchasing a copy for your wall and/or visiting Amazon (affiliate link) to purchase new or used photographic gear.

 

Purchase a copy for your wall.

An image of Zion national park

While in Zion a few years ago, I stopped by the side of the road to grab a quick snapshot. I didn’t do anything with this at the time but now looking back at it, I really like it. Lots of things to catch your eye.

This one skill will make you a data science rockstar

Want to be a data science rockstar? of course you do! Sorry for the clickbait headline, but I wanted to reach as many people as I can with this important piece of information.

Want to know what the ‘one skill’ is?

It isn’t python or R or Spark or some other new technology or platform.  It isn’t the latest machine learning methods or algorithms. It isn’t being able to write AI algorithms from scratch or analyze terabytes of data in minutes.

While those are important – very important – they aren’t THE skill. In fact, it isn’t a technical skill at all.

The one skill that will make you a data science rockstar is a so-called ‘soft-skill’.  The ability to communicate is what will set you apart from your peers and make you stand out in an increasingly full world of data scientists.

Why do I need to communicate to be a data science rockstar?

You can be the smartest person in the world when it comes to creating some wild machine learning systems to build recommendation engines, but if you can’t communicate the ‘strategy’ behind the system, you’re going to have a hard time.

If you’re able to find some phenomenal patters in data that have the potential to deliver a multiple X increase in revenue but can’t communicate the ‘strategy’ behind your approach, your potential is going to be unrealized.

What do I mean by ‘strategy’?  In addition to the standard information (error rates/metrics, etc) you need to be able to hit the key ‘W’ points (‘what, why, when, where and who’) when you communicate your output/results. You need to be able to clearly define what you did, why you did it, when your approach works (and doesn’t work), where your data came from and who will be effected by what you’ve done.  If you can’t answer these questions succinctly and in a manner that a layperson can understand them, you’re failing a data scientist.

Two real world examples – one rockstar, one not-rockstar

I have two recent examples for you to help highlight the difference between a data science rockstar (i.e., someone that communicates well) and one not-so-much rockstar. I’ll give you the background on both and let you make up your own mind on which person you’d hire as your next data scientist. Both of these people work at the same organization.

Person 1:

She’s been a data scientist for 4 years. She’s got a wide swatch of experience in data exploration, feature engineering, machine learning and data management.  She’s had multiple projects over her career that required a deep dive into large datasets and she’s had to use different systems, platforms and languages during her analysis. For each project she works on, she keeps a running notebook with commentary, ideas, changes and reasons for doing what she’s doing – she’s a scientist after all.   When she provides updates to team members and management, she provides multiple layers of details that can be read or skipped depending on the level of interest by the reader.  She providers a thorough writeup of all her work with detailed notes about why things are being done they way they are done and how potential changes might effect the outcome of her work.  For project ‘wrap-up’ documentation, she delivers an executive summary with many visualizations that succinctly describes the project, the work she did, why she did what she did, what she thinks could be done to improve things and how the project could be improved upon. In addition to the executive summary, she provides a thorough write-up that describes the entire process with multiple appendices and explanatory statements for those people that want to dive deeply into the project. When people are selecting people to work on their projects, her name is the first to come out of their mouths when they start talking about team members.

Person 2:

He’s been a data scientist for 4 years (about 1 month longer than Person 1).  His background is very technical and is the ‘go-to’ person for algorithms and programming languages within the team. He’s well thought of and can do just about anything that is thrown over the wall at him. He’s quite successful and is sought after for advice from people all over the company.  When he works on projects he sort of ‘wings it’ (his words) and keeps few notes about what he’s done and why he’s chosen the things he has chosen.  For example, if you ask him why he chose Random Forests instead of Support Vector Machines on a project, he’ll tell you ‘because it worked better’ but he can’t explain what ‘better’ means.   Now, there’s not many people that would argue against his choices on projects and his work is rarely questions. He’s good at what he does and nobody at the company questions his technical skills, but they always question ‘what is he doing?’ and ‘what did he do?’ during/after projects.  For documentation and presentation of results, he puts together the basic report that is expected with the appropriate information but people always have questions and are always ‘bothering him’ (again…his words). When new projects are being considered, he’s usually last in line for inclusion because there’s ‘just something about working with him’ (actual words from his co-workers).

Who would you choose?

I’m assuming you know which of the two is the data science rockstar. While Person 2 is technically more advanced than Person 1, his communication skills are a bit behind Person 1. Person 1 is the one that everyone goes to for delivering the ‘best’ outcomes from data science in the company they work at.  Communication is the difference. Person 1 is not only able to do the technical work but also share the outcomes in a way that the organization can easily understand.

If you want to be a data science rockstar, you need to learn to communicate. It can be that ‘one skill’ that could help move you into the realm of ‘top data scientists’ and away from the average data scientists who are focusing all of their personal developer efforts on learning another algorithm or another language.

By the way, I’ve written about this before here and here so jump over and read a few more thoughts on the topic if you have time.

Photo by Ben Sweet on Unsplash

Data Mining – A Cautionary Tale

For those of you that might be new to data, keep this small (but extremely important) thing in mind – beware data mining.

What is data mining?  Data mining is the process of discovering information and patterns in data.  Data mining is the first step taken in the Data -> Information -> Knowledge -> Wisdom conversion process.  Data mining is extremely important – but can cause you a lot of problems if you aren’t aware of some of the issues that can arise from data mining.

First, data mining can give you the answer you’re looking for….regardless of whether that answer is even correct.  Many people look at data mining as an iterative process that is a ‘loop’ that lets you mine until you find the data that supports the hypothesis you’re trying to prove (or disprove).  A great example of this is the ‘food science star’ Brian Wansink at Cornell. Dr. Wansink spent years in the spotlight as head of Cornell’s Food & Brand Lab as well as heading up the US Dietary Guidelines committee that influenced public policy around foods and diets in the United States.

Over the last few years, Wansink’s ‘star’ has been fading as other researchers began investigating his work after he posted an article about a graduate research that ‘never said no.’ As part of that post (and subsequent investigation) emails were released that had some interesting commentary around ‘data mining’ that I thought was worth sharing. From Here’s How Cornell Scientist Brian Wansink Turned Shoddy Data Into Viral Studies About How We Eat:

When Siğirci started working with him, she was assigned to analyze a dataset from an experiment that had been carried out at an Italian restaurant. Some customers paid $8 for the buffet, others half price. Afterward, they all filled out a questionnaire about who they were and how they felt about what they’d eaten.

Somewhere in those survey results, the professor was convinced, there had to be a meaningful relationship between the discount and the diners. But he wasn’t satisfied by Siğirci’s initial review of the data.

“I don’t think I’ve ever done an interesting study where the data ‘came out’ the first time I looked at it,” he told her over email.

Emphasis mine.

Since the investigation began, Wansink has had 15 articles retracted from peer-reviewed journals and many more are being reviewed.   Wansink and colleagues were continuously looking through data trying to find a way to ‘sort’ the data to match what they wanted the data to say.

That’s the danger of data mining. You keep working your data until you find an answer you like and ignore the answers you don’t like.

Don’t get me wrong – data mining is absolutely a good thing when done right.  You should go into your data with a hypothesis in mind, then look for patterns and then either accept or reject your hypothesis baed on the analysis.  There’s nothing wrong with then starting over with a new hypothesis or finding patterns that help you develop a new hypothesis but your data and your analysis have to lead you down the road to a valid outcome.

What Wansink is accused of doing is something called ‘p-hacking’ where a researcher is trying to find a ‘p-value’ of 0.05 or less (to signify 95% confidence interval) and allows you to reject the null hypothesis.  P-hacking is the art of continue to sort / manipulate your data to find those data points that give you a p-value of 0.05 or less.  For example, let’s assume that you have a dataset of 500 rows with 4 columns.  You run some analysis –  for this example we’ll say a basic regression analysis – and you get a p-value of 0.2. That’s not great as it suggest weak evidence to reject the null, but it does give you insight into the dataset.   An ethical researcher / data scientist will take what they learned from this analysis and take a look at their data again.  An unethical researcher / data scientist will massage the data to get their p-value to look better. Perhaps make an arbitrary decision to drop any rows with data readings over a certain value and re-run your analysis…and bam…you have a p-value of 0.05. That’s p-hacking and poor data mining.

This is where it gets tricky. There’s could be a very valid reason for why you might have removed the rows of data. Perhaps it was ‘bad data’ or maybe it wasn’t relevant (e.g., the remaining rows have a reading less than 1 and the rows you removed have readings of 10 million) but you need to be able to defend the manipulation of the data, and unethical researchers will generally not be able to do that.

Another ‘gotcha’ can be found in the Wansink story here related to p-hacking and over-analysis.

But for years, Wansink’s inbox has been filled with chatter that, according to independent statisticians, is blatant p-hacking.

“Pattern doesn’t look good,” Payne of New Mexico State wrote to Wansink and David Just, another Cornell professor, in April 2009, after what Payne called a “marathon” data-crunching session for an experiment about eating and TV-watching.

“I also ran — i am not kidding — 400 strategic mediation analyses to no avail…” Payne wrote. In other words, testing 400 variables to find one that might explain the relationship between the experiment and the outcomes. “The last thing to try — but I shutter to think of it — is trying to mess around with the mood variables. Ideas…suggestions?”

Two days later, Payne was back with promising news: By focusing on the relationship between two variables in particular, he wrote, “we get exactly what we need.” (The study does not appear to have been published.)

Don’t do that. That’s bad data mining and bad data science.  If you have to run an analysis 400 times to find a couple of variables that give you a good p-value, you are doing things wrong.

Data mining is absolutely a valid approach to data. Everyone does it but not everyone does it right.  Be careful of massaging the data to fit your needs and get the answer you want. Let your data tell you how it wants to be handled and what answers its going to give.

Foto Friday – Red Sunrise, Sprague Lake

Sunrise over Sprague Lake in Rocky Mountain National Park, Estes Park Colorado.

Made with Sony A7rIII and Sony 16-35 2.8 GM Lens. Click the photo to be taken to a larger version on 500px.

See more photos at my dedicated Photography website. If you like my photography, feel free to support my addiction habit by purchasing a copy for your wall and/or visiting Amazon (affiliate link) to purchase new or used photographic gear.

Purchase a copy for your wall.

a photo of a sunrise over Sprague Lake, Colorado

 

« Older posts Newer posts »

If you'd like to receive updates when new posts are published, signup for my mailing list. I won't sell or share your email.