In a post titled Data Mining – A Cautionary Tale, I share the idea that data mining can be dangerous by sharing the story of Cornell’s Brian Wansink, who has had multiple papers retracted due to various data mining methods that aren’t quite ethical (or even correct).
Recently, Gary Smith over at Wired wrote an article called The Exaggerated Promise of so-called Unbiased Data Mining with another good example of the danger of data mining.
In the article, Gary writes of a time that noted physicist and Nobel Laureate Richard Feynman gave his class an exercise to determine the probability of seeing a specific license plate int he parking lot on the way into class (he gave them a specific example of a license plate). The students worked on the problem and determine that the probability was less than 1 in 17 million that Feynman would see a specific license plate.
According to Smith, what Feynman didn’t tell the students was that he had seen the specific license plate that morning in the parking lot before coming to class, so the probability was actually 1. Smith calls this the ‘Feynman Trap.’
Whether this story is true – I don’t recall ever reading it from Feynman directly – (although he does have a quote about license plates), its a very good description one of the dangers of data mining — knowing what the answer will be before starting the work. In other words, bias.
Bias is everywhere in data science. Some say there are 8 types of bias (not sure I completely agree with 8 as the number, but its as good a place to start as anywhere else). The key is knowing that bias exists, how it exists and how to manage that bias. You have to manage your own bias as well as any bias that might be inherent in the data that you are analyzing. Bias is hard to overcome but knowing it exists makes it easier to manage.
The Data Mining Trap
The ‘Feynman Trap’ (i.e., bias) is a really good thing to keep in mind whenever you do any data analysis. Thinking back to the story shared in Data Mining – A Cautionary Tale about Dr.Wansink, he was absolutely biased in just about everything he did in the research that was retracted. He had an answer that he wanted to find and then found the data to support that answer.
There’s the trap. Rather than going into data analysis with questions and looking for data to help you find answers, you go into it with answers and try to find patterns to support your answer.
Don’t fall into the data mining trap. Keep an open mind, manage your bias and look for the answers. Also, there’s nothing wrong with finding other questions (and answers) while data mining but keep that bias in check and you’ll be on the right path to avoiding the data mining trap.