Data Mining - A Cautionary Tale

Data Mining - A Cautionary Tale

For those of you that might be new to data, keep this small (but extremely important) thing in mind – beware data mining.

What is data mining?  Data mining is the process of discovering information and patterns in data.  Data mining is the first step taken in the Data -> Information -> Knowledge -> Wisdom conversion process.  Data mining is extremely important – but can cause you a lot of problems if you aren’t aware of some of the issues that can arise from data mining.

First, data mining can give you the answer you’re looking for….regardless of whether that answer is even correct.  Many people look at data mining as an iterative process that is a ‘loop’ that lets you mine until you find the data that supports the hypothesis you’re trying to prove (or disprove).  A great example of this is the ‘food science star’ Brian Wansink at Cornell. Dr. Wansink spent years in the spotlight as head of Cornell’s Food & Brand Lab as well as heading up the US Dietary Guidelines committee that influenced public policy around foods and diets in the United States.

Over the last few years, Wansink’s ‘star’ has been fading as other researchers began investigating his work after he posted an article about a graduate research that ‘never said no.’ As part of that post (and subsequent investigation) emails were released that had some interesting commentary around ‘data mining’ that I thought was worth sharing. From Here’s How Cornell Scientist Brian Wansink Turned Shoddy Data Into Viral Studies About How We Eat:

When Siğirci started working with him, she was assigned to analyze a dataset from an experiment that had been carried out at an Italian restaurant. Some customers paid $8 for the buffet, others half price. Afterward, they all filled out a questionnaire about who they were and how they felt about what they’d eaten.

Somewhere in those survey results, the professor was convinced, there had to be a meaningful relationship between the discount and the diners. But he wasn’t satisfied by Siğirci’s initial review of the data.

“I don’t think I’ve ever done an interesting study where the data ‘came out’ the first time I looked at it,” he told her over email.

Emphasis mine.

Since the investigation began, Wansink has had 15 articles retracted from peer-reviewed journals and many more are being reviewed.   Wansink and colleagues were continuously looking through data trying to find a way to ‘sort’ the data to match what they wanted the data to say.

That’s the danger of data mining. You keep working your data until you find an answer you like and ignore the answers you don’t like.

Don’t get me wrong – data mining is absolutely a good thing when done right.  You should go into your data with a hypothesis in mind, then look for patterns and then either accept or reject your hypothesis baed on the analysis.  There’s nothing wrong with then starting over with a new hypothesis or finding patterns that help you develop a new hypothesis but your data and your analysis have to lead you down the road to a valid outcome.

What Wansink is accused of doing is something called ‘p-hacking’ where a researcher is trying to find a ‘p-value’ of 0.05 or less (to signify 95% confidence interval) and allows you to reject the null hypothesis.  P-hacking is the art of continue to sort / manipulate your data to find those data points that give you a p-value of 0.05 or less.  For example, let’s assume that you have a dataset of 500 rows with 4 columns.  You run some analysis –  for this example we’ll say a basic regression analysis – and you get a p-value of 0.2. That’s not great as it suggest weak evidence to reject the null, but it does give you insight into the dataset.   An ethical researcher / data scientist will take what they learned from this analysis and take a look at their data again.  An unethical researcher / data scientist will massage the data to get their p-value to look better. Perhaps make an arbitrary decision to drop any rows with data readings over a certain value and re-run your analysis…and bam…you have a p-value of 0.05. That’s p-hacking and poor data mining.

This is where it gets tricky. There’s could be a very valid reason for why you might have removed the rows of data. Perhaps it was ‘bad data’ or maybe it wasn’t relevant (e.g., the remaining rows have a reading less than 1 and the rows you removed have readings of 10 million) but you need to be able to defend the manipulation of the data, and unethical researchers will generally not be able to do that.

Another ‘gotcha’ can be found in the Wansink story here related to p-hacking and over-analysis.

But for years, Wansink’s inbox has been filled with chatter that, according to independent statisticians, is blatant p-hacking.

“Pattern doesn’t look good,” Payne of New Mexico State wrote to Wansink and David Just, another Cornell professor, in April 2009, after what Payne called a “marathon” data-crunching session for an experiment about eating and TV-watching.

“I also ran — i am not kidding — 400 strategic mediation analyses to no avail…” Payne wrote. In other words, testing 400 variables to find one that might explain the relationship between the experiment and the outcomes. “The last thing to try — but I shutter to think of it — is trying to mess around with the mood variables. Ideas…suggestions?”

Two days later, Payne was back with promising news: By focusing on the relationship between two variables in particular, he wrote, “we get exactly what we need.” (The study does not appear to have been published.)

Don’t do that. That’s bad data mining and bad data science.  If you have to run an analysis 400 times to find a couple of variables that give you a good p-value, you are doing things wrong.

Data mining is absolutely a valid approach to data. Everyone does it but not everyone does it right.  Be careful of massaging the data to fit your needs and get the answer you want. Let your data tell you how it wants to be handled and what answers its going to give.