The Data Mining Trap

a photo of a lobster trap by the see

In a post titled Data Mining – A Cautionary Tale, I share the idea that data mining can be dangerous by sharing the story of Cornell’s Brian Wansink, who has had multiple papers retracted due to various data mining methods that aren’t quite ethical (or even correct).

Recently, Gary Smith over at Wired wrote an article called The Exaggerated Promise of so-called Unbiased Data Mining with another good example of the danger of data mining.

In the article, Gary writes of a time that noted physicist and Nobel Laureate Richard Feynman gave his class an exercise to determine the probability of seeing a specific license plate int he parking lot on the way into class (he gave them a specific example of a license plate).  The students worked on the problem and determine that the probability was less than 1 in 17 million that Feynman would see a specific license plate.

According to Smith, what Feynman didn’t tell the students was that he had seen the specific license plate that morning in the parking lot before coming to class, so the probability was actually 1. Smith calls this the ‘Feynman Trap.’

Whether this story is true – I don’t recall ever reading it from Feynman directly – (although he does have a quote about license plates), its a very good description one of the dangers of data mining — knowing what the answer will be before starting the work. In other words, bias.

Bias is everywhere in data science. Some say there are 8 types of bias (not sure I completely agree with 8 as the number, but its as good a place to start as anywhere else). The key is knowing that bias exists, how it exists and how to manage that bias. You have to manage your own bias as well as any bias that might be inherent in the data that you are analyzing. Bias is hard to overcome but knowing it exists makes it easier to manage.

The Data Mining Trap

The ‘Feynman Trap’ (i.e., bias) is a really good thing to keep in mind whenever you do any data analysis.  Thinking back to the story shared in Data Mining – A Cautionary Tale about Dr.Wansink, he was absolutely biased in just about everything he did in the research that was retracted. He had an answer that he wanted to find and then found the data to support that answer.

There’s the trap. Rather than going into data analysis with questions and looking for data to help you find answers, you go into it with answers and try to find patterns to support your answer.

Don’t fall into the data mining trap. Keep an open mind, manage your bias and look for the answers. Also, there’s nothing wrong with finding other questions (and answers) while data mining but keep that bias in check and you’ll be on the right path to avoiding the data mining trap.

Photo by James & Carol Lee on Unsplash

This one skill will make you a data science rockstar

Image for data science rockstar

Want to be a data science rockstar? of course you do! Sorry for the clickbait headline, but I wanted to reach as many people as I can with this important piece of information.

Want to know what the ‘one skill’ is?

It isn’t python or R or Spark or some other new technology or platform.  It isn’t the latest machine learning methods or algorithms. It isn’t being able to write AI algorithms from scratch or analyze terabytes of data in minutes.

While those are important – very important – they aren’t THE skill. In fact, it isn’t a technical skill at all.

The one skill that will make you a data science rockstar is a so-called ‘soft-skill’.  The ability to communicate is what will set you apart from your peers and make you stand out in an increasingly full world of data scientists.

Why do I need to communicate to be a data science rockstar?

You can be the smartest person in the world when it comes to creating some wild machine learning systems to build recommendation engines, but if you can’t communicate the ‘strategy’ behind the system, you’re going to have a hard time.

If you’re able to find some phenomenal patters in data that have the potential to deliver a multiple X increase in revenue but can’t communicate the ‘strategy’ behind your approach, your potential is going to be unrealized.

What do I mean by ‘strategy’?  In addition to the standard information (error rates/metrics, etc) you need to be able to hit the key ‘W’ points (‘what, why, when, where and who’) when you communicate your output/results. You need to be able to clearly define what you did, why you did it, when your approach works (and doesn’t work), where your data came from and who will be effected by what you’ve done.  If you can’t answer these questions succinctly and in a manner that a layperson can understand them, you’re failing a data scientist.

Two real world examples – one rockstar, one not-rockstar

I have two recent examples for you to help highlight the difference between a data science rockstar (i.e., someone that communicates well) and one not-so-much rockstar. I’ll give you the background on both and let you make up your own mind on which person you’d hire as your next data scientist. Both of these people work at the same organization.

Person 1:

She’s been a data scientist for 4 years. She’s got a wide swatch of experience in data exploration, feature engineering, machine learning and data management.  She’s had multiple projects over her career that required a deep dive into large datasets and she’s had to use different systems, platforms and languages during her analysis. For each project she works on, she keeps a running notebook with commentary, ideas, changes and reasons for doing what she’s doing – she’s a scientist after all.   When she provides updates to team members and management, she provides multiple layers of details that can be read or skipped depending on the level of interest by the reader.  She providers a thorough writeup of all her work with detailed notes about why things are being done they way they are done and how potential changes might effect the outcome of her work.  For project ‘wrap-up’ documentation, she delivers an executive summary with many visualizations that succinctly describes the project, the work she did, why she did what she did, what she thinks could be done to improve things and how the project could be improved upon. In addition to the executive summary, she provides a thorough write-up that describes the entire process with multiple appendices and explanatory statements for those people that want to dive deeply into the project. When people are selecting people to work on their projects, her name is the first to come out of their mouths when they start talking about team members.

Person 2:

He’s been a data scientist for 4 years (about 1 month longer than Person 1).  His background is very technical and is the ‘go-to’ person for algorithms and programming languages within the team. He’s well thought of and can do just about anything that is thrown over the wall at him. He’s quite successful and is sought after for advice from people all over the company.  When he works on projects he sort of ‘wings it’ (his words) and keeps few notes about what he’s done and why he’s chosen the things he has chosen.  For example, if you ask him why he chose Random Forests instead of Support Vector Machines on a project, he’ll tell you ‘because it worked better’ but he can’t explain what ‘better’ means.   Now, there’s not many people that would argue against his choices on projects and his work is rarely questions. He’s good at what he does and nobody at the company questions his technical skills, but they always question ‘what is he doing?’ and ‘what did he do?’ during/after projects.  For documentation and presentation of results, he puts together the basic report that is expected with the appropriate information but people always have questions and are always ‘bothering him’ (again…his words). When new projects are being considered, he’s usually last in line for inclusion because there’s ‘just something about working with him’ (actual words from his co-workers).

Who would you choose?

I’m assuming you know which of the two is the data science rockstar. While Person 2 is technically more advanced than Person 1, his communication skills are a bit behind Person 1. Person 1 is the one that everyone goes to for delivering the ‘best’ outcomes from data science in the company they work at.  Communication is the difference. Person 1 is not only able to do the technical work but also share the outcomes in a way that the organization can easily understand.

If you want to be a data science rockstar, you need to learn to communicate. It can be that ‘one skill’ that could help move you into the realm of ‘top data scientists’ and away from the average data scientists who are focusing all of their personal developer efforts on learning another algorithm or another language.

By the way, I’ve written about this before here and here so jump over and read a few more thoughts on the topic if you have time.

Photo by Ben Sweet on Unsplash

Data Mining – A Cautionary Tale

Beware Data Mining

For those of you that might be new to data, keep this small (but extremely important) thing in mind – beware data mining.

What is data mining?  Data mining is the process of discovering information and patterns in data.  Data mining is the first step taken in the Data -> Information -> Knowledge -> Wisdom conversion process.  Data mining is extremely important – but can cause you a lot of problems if you aren’t aware of some of the issues that can arise from data mining.

First, data mining can give you the answer you’re looking for….regardless of whether that answer is even correct.  Many people look at data mining as an iterative process that is a ‘loop’ that lets you mine until you find the data that supports the hypothesis you’re trying to prove (or disprove).  A great example of this is the ‘food science star’ Brian Wansink at Cornell. Dr. Wansink spent years in the spotlight as head of Cornell’s Food & Brand Lab as well as heading up the US Dietary Guidelines committee that influenced public policy around foods and diets in the United States.

Over the last few years, Wansink’s ‘star’ has been fading as other researchers began investigating his work after he posted an article about a graduate research that ‘never said no.’ As part of that post (and subsequent investigation) emails were released that had some interesting commentary around ‘data mining’ that I thought was worth sharing. From Here’s How Cornell Scientist Brian Wansink Turned Shoddy Data Into Viral Studies About How We Eat:

When Siğirci started working with him, she was assigned to analyze a dataset from an experiment that had been carried out at an Italian restaurant. Some customers paid $8 for the buffet, others half price. Afterward, they all filled out a questionnaire about who they were and how they felt about what they’d eaten.

Somewhere in those survey results, the professor was convinced, there had to be a meaningful relationship between the discount and the diners. But he wasn’t satisfied by Siğirci’s initial review of the data.

“I don’t think I’ve ever done an interesting study where the data ‘came out’ the first time I looked at it,” he told her over email.

Emphasis mine.

Since the investigation began, Wansink has had 15 articles retracted from peer-reviewed journals and many more are being reviewed.   Wansink and colleagues were continuously looking through data trying to find a way to ‘sort’ the data to match what they wanted the data to say.

That’s the danger of data mining. You keep working your data until you find an answer you like and ignore the answers you don’t like.

Don’t get me wrong – data mining is absolutely a good thing when done right.  You should go into your data with a hypothesis in mind, then look for patterns and then either accept or reject your hypothesis baed on the analysis.  There’s nothing wrong with then starting over with a new hypothesis or finding patterns that help you develop a new hypothesis but your data and your analysis have to lead you down the road to a valid outcome.

What Wansink is accused of doing is something called ‘p-hacking’ where a researcher is trying to find a ‘p-value’ of 0.05 or less (to signify 95% confidence interval) and allows you to reject the null hypothesis.  P-hacking is the art of continue to sort / manipulate your data to find those data points that give you a p-value of 0.05 or less.  For example, let’s assume that you have a dataset of 500 rows with 4 columns.  You run some analysis –  for this example we’ll say a basic regression analysis – and you get a p-value of 0.2. That’s not great as it suggest weak evidence to reject the null, but it does give you insight into the dataset.   An ethical researcher / data scientist will take what they learned from this analysis and take a look at their data again.  An unethical researcher / data scientist will massage the data to get their p-value to look better. Perhaps make an arbitrary decision to drop any rows with data readings over a certain value and re-run your analysis…and bam…you have a p-value of 0.05. That’s p-hacking and poor data mining.

This is where it gets tricky. There’s could be a very valid reason for why you might have removed the rows of data. Perhaps it was ‘bad data’ or maybe it wasn’t relevant (e.g., the remaining rows have a reading less than 1 and the rows you removed have readings of 10 million) but you need to be able to defend the manipulation of the data, and unethical researchers will generally not be able to do that.

Another ‘gotcha’ can be found in the Wansink story here related to p-hacking and over-analysis.

But for years, Wansink’s inbox has been filled with chatter that, according to independent statisticians, is blatant p-hacking.

“Pattern doesn’t look good,” Payne of New Mexico State wrote to Wansink and David Just, another Cornell professor, in April 2009, after what Payne called a “marathon” data-crunching session for an experiment about eating and TV-watching.

“I also ran — i am not kidding — 400 strategic mediation analyses to no avail…” Payne wrote. In other words, testing 400 variables to find one that might explain the relationship between the experiment and the outcomes. “The last thing to try — but I shutter to think of it — is trying to mess around with the mood variables. Ideas…suggestions?”

Two days later, Payne was back with promising news: By focusing on the relationship between two variables in particular, he wrote, “we get exactly what we need.” (The study does not appear to have been published.)

Don’t do that. That’s bad data mining and bad data science.  If you have to run an analysis 400 times to find a couple of variables that give you a good p-value, you are doing things wrong.

Data mining is absolutely a valid approach to data. Everyone does it but not everyone does it right.  Be careful of massaging the data to fit your needs and get the answer you want. Let your data tell you how it wants to be handled and what answers its going to give.

You Need a Chief Data Officer. Here’s Why.

Image of the word "why"

Image of the word "why"Big data has moved from buzzword to being a part of everyday life within enterprise organizations. An IDG survey reports that 75% of enterprise organizations have deployed or plan to deploy big data projects. The challenge now is capturing strategic value from that data and delivering high-impact business outcomes. That’s where a Chief Data Officer (CDO) enters the picture. While CDO’s have been hired in the past to manage data governance and data management, their role is transitioning into one focused on how to best organize and use data as a strategic asset within organizations.

Gartner estimates that 90% of large global organizations will have a CDO by 2019. Given that estimate, it’s important for CIOs and the rest of the C-suite to understand how a CDO can deliver maximum impact for data-driven transformation. CDOs often don’t have the resources, budget, or authority to drive digital transformation on their own, so the CDO needs to help the CIO drive transformation via collaboration and evangelism.

“The CDO should not just be part of the org chart, but also have an active hand in launching new data initiatives,” Patricia Skarulis, SVP & CIO of Memorial Sloan Kettering Cancer Center, said at the recent CIO Perspectives conference in New York.

Chief Data Officer – What, when, how

A few months ago, I was involved in a conversation with the leadership team of a large organization. This conversation revolved around whether they needed to hire a Chief Data Officer and, if they did, what that individual’s role should be. It’s always difficult creating a new role, especially one like the CDO whose oversight spans multiple departments. In order to create this role (and have the person succeed), the leadership team felt they needed to clearly articulate the specific responsibilities and understand the “what, when, and how” aspects of the position.

The “when” was an easy answer: Now.

The “what” and the “how” are a bit more complex, but we can provide some generalizations of what the CDO should be focused on and how they should go about their role.

First, as I’ve said, the CDO needs to be a collaborator and communicator to help align the business and technology teams in a common vision for their data strategies and platforms, to drive digital transformation and meet business objectives.

In addition to the strategic vision, the CDO needs to work closely with the CIO to create and maintain a data-driven culture throughout the organization. This data-driven culture is an absolute requirement in order to support the changes brought on by digital transformation today and into the future.

“My role as Chief Data Officer has evolved to govern data, curate data, and convince subject matter experts that the data belongs to the business and not [individual] departments,” Stu Gardos, CDO at Memorial Sloan Kettering Cancer Center, said at the CIO Perspectives conference.

Lastly, the CDO needs to work with the CIO and the IT team to implement proper data management and data governance systems and processes to ensure data is trustworthy, reliable, and available for analysis across the organization. That said, the CDO can’t get bogged down in technology and systems but should keep their focus on the people and processes as it is their role to understand and drive the business value with the use of data.

In the meeting I mentioned earlier, I was asked what a successful Chief Data Officer looks like. It’s clear that a successful CDO crosses the divide between business and technology and institutes data as trusted currency that is used to drive revenue and transform the business.

Originally published on CIO.com.

Beware the Models

Beware the Models

Beware the Models“But….all of our models have accuracies above 90%…our system should be working perfectly!”

Those were the words spoken by the CEO of a mid-sized manufacturing company. These comments were made during a conversation about their various forecasting models and the poor performance of those models.

This CEO had spent about a million dollars over the last few years with a consulting company who had been tasked with creating new methods and models for forecasting sales and manufacturing. Over the previous decade, the company had done very well for themselves using a very manual and instinct-driven process to forecast sales and the manufacture processes needed to ensure sales targets were met.

About three years ago, the CEO decided they needed to take advantage of the large amount of data available within the organization to help manage the organization’s various departments and businesses.

As part of this initiative, a consultant from a well known consulting organization was brought in to help build new forecasting models. These models were developed with many different data sets from across the organization and – on paper – they look really good. The presentation of these models include the ‘right’ statistical measures to show that they provide anywhere from 90% to 95% accuracies.

The models, their descriptions and the nearly 300 pages of documentation about how these new models will help the company make many millions of dollars over the coming years weren’t doing that they were designed to do. The results of the models were so far from the reality of what was happening with this organization’s real-world sales and manufacturing processes.

Due to the large divergence between model and reality, the CEO wanted an independent review of the models to determine what wasn’t working and why.  He reached out to me and asked for my help.

You may be hoping that I’m about to tell you what a terrible job the large, well known consultants did.  We all like to see the big, expensive, successful consulting companies thrown under the bus, right?

But…that’s not what this story is about.

The moral of this story? Just because you build a model with better than average accuracy (or even one with great accuracy), there’s no telling what that model will do once it meets the real world. Sometimes, models just don’t work. Or…they stop working. Even worse, sometimes they work wonderfully for a little while only to fail miserably some time in the near future.

Why is this?

There could be a variety of reasons. Here’s a few that I see often:

  • It could be from data mining and building a model based on a biased view of the data.
  • It could be poor data management that allows poor quality data into the modeling process. Building models with poor quality data creates poor quality models with good accuracy (based on poor input data).
  • It could be a poor understanding of the modeling process. There a lot of ‘data scientists’ out there today that have very little understanding of what the data analysis and modeling process should look like.
  • It could be – and this is worth repeating – sometimes models just don’t work. You can do everything right and the model just can’t perform in the real world.

Beware the models. Just because they look good on paper doesn’t mean they will be perfect (or even average) in the real world.  Remember to ask yourself (and your data / modeling teams) – are your models good enough?

Modeling is both an art and a science. You can do everything right and still get models that will make you say ‘meh’ (or even !&[email protected]^@$). That said, as long as the modeling process is approached correctly and the ‘science’ in data science isn’t forgotten, the outcome of analysis / modeling initiatives should at least provide some insight into the processes, systems and data management capabilities within an organization.

 

Big Data Roadmap – A roadmap for success with big data

Big Data Roadmap

Big Data RoadmapI’m regularly asked about how to get started with big data. My response is always the same: I give them my big data roadmap for success.  Most organizations want to jump in a do something ‘cool’ with big data. They want to do a project that brings in new revenue or adds some new / cool service or product, but I always point them to this roadmap and say ‘start here’.

The big data roadmap for success looks starts with the following initiatives:

  • Data Quality / Data Management systems (if you don’t have these in place, that should be the absolute first thing you do)
  • Build a data lake (and utilize it)
  • Create self-service reporting and analytical systems / processes.
  • Bring your data into the line-of-business.

These are fairly broad types of initiatives, but they are general enough for any organization to be able to find some value.

Data Management / Data Quality / Data Governance

First of all, if you don’t have proper data management / data quality / data governance, fix that. Don’t do anything else until you can say with absolute certainty that you know where your data has been, who has touched your data and where that data is today. Without this first step, you are playing with fire when it comes to your data. If you aren’t sure how good your data is, there’s no way to really understand how good the output is of whatever data initiative(s) you undertake.

Build a data lake (and utilize it)

I cringe anytime I (or anyone else) says/writes data lake because it reminds me too much of the data warehouse craze that took CIO’s and IT departments by storm a number of years ago. That said, data lakes are valuable (just like data warehouses where/are valuable) but it isn’t enough to just build a data lake…you need to utilize it. Rather than just being a large data store, a data lake should store data and give your team(s) the ability to find and use the data in the lake.

Create self-service reporting and analytical systems / processes.

Combined with the below initiative or implemented separately, developing self-service access and reporting to your data is something that can free up your IT and analytics staff. Your organization will be much more efficient if any member of the team can build and run a report rather than waiting for a custom report to be created and executed for them. This type of project might feel a bit like ‘dashboards’ but it should be much more than that – your people should be able to get into the data, see the data and manipulate the data and then build a report or visualization based on those manipulations. Of course, you need a good data governance process in place to ensure that the right people can see the right data.

Bring your data into the Line of Business

This particular initiative can be (and probably should be) combined with the previous one (self-service), but by itself it still makes sense to focus on by itself. By bringing your data into the line of business, you are getting it closer to the people that best understand the data and the context of the data. By bringing data into the line of business (and providing the ability to easily access and utilize said data), you are exponentially growing the data analytical capabilities of your organization.

Big Data Roadmap – a guarantee?

There’s no guarantee’s in life, but I can tell you that if you follow this roadmap you will have a much better chance at success than if you don’t.  The key here is to ensure that your ‘data in’ isn’t garbage (hence the data governance and data lake aspects) and that you get as much data as you can in the hands of the people that understand the context of that data.

This big data roadmap won’t guarantee success, but it will get you further up the road toward success then you would have been without it.