The Data Mining Trap

a photo of a lobster trap by the see

In a post titled Data Mining – A Cautionary Tale, I share the idea that data mining can be dangerous by sharing the story of Cornell’s Brian Wansink, who has had multiple papers retracted due to various data mining methods that aren’t quite ethical (or even correct).

Recently, Gary Smith over at Wired wrote an article called The Exaggerated Promise of so-called Unbiased Data Mining with another good example of the danger of data mining.

In the article, Gary writes of a time that noted physicist and Nobel Laureate Richard Feynman gave his class an exercise to determine the probability of seeing a specific license plate int he parking lot on the way into class (he gave them a specific example of a license plate).  The students worked on the problem and determine that the probability was less than 1 in 17 million that Feynman would see a specific license plate.

According to Smith, what Feynman didn’t tell the students was that he had seen the specific license plate that morning in the parking lot before coming to class, so the probability was actually 1. Smith calls this the ‘Feynman Trap.’

Whether this story is true – I don’t recall ever reading it from Feynman directly – (although he does have a quote about license plates), its a very good description one of the dangers of data mining — knowing what the answer will be before starting the work. In other words, bias.

Bias is everywhere in data science. Some say there are 8 types of bias (not sure I completely agree with 8 as the number, but its as good a place to start as anywhere else). The key is knowing that bias exists, how it exists and how to manage that bias. You have to manage your own bias as well as any bias that might be inherent in the data that you are analyzing. Bias is hard to overcome but knowing it exists makes it easier to manage.

The Data Mining Trap

The ‘Feynman Trap’ (i.e., bias) is a really good thing to keep in mind whenever you do any data analysis.  Thinking back to the story shared in Data Mining – A Cautionary Tale about Dr.Wansink, he was absolutely biased in just about everything he did in the research that was retracted. He had an answer that he wanted to find and then found the data to support that answer.

There’s the trap. Rather than going into data analysis with questions and looking for data to help you find answers, you go into it with answers and try to find patterns to support your answer.

Don’t fall into the data mining trap. Keep an open mind, manage your bias and look for the answers. Also, there’s nothing wrong with finding other questions (and answers) while data mining but keep that bias in check and you’ll be on the right path to avoiding the data mining trap.

Photo by James & Carol Lee on Unsplash

This one skill will make you a data science rockstar

Image for data science rockstar

Want to be a data science rockstar? of course you do! Sorry for the clickbait headline, but I wanted to reach as many people as I can with this important piece of information.

Want to know what the ‘one skill’ is?

It isn’t python or R or Spark or some other new technology or platform.  It isn’t the latest machine learning methods or algorithms. It isn’t being able to write AI algorithms from scratch or analyze terabytes of data in minutes.

While those are important – very important – they aren’t THE skill. In fact, it isn’t a technical skill at all.

The one skill that will make you a data science rockstar is a so-called ‘soft-skill’.  The ability to communicate is what will set you apart from your peers and make you stand out in an increasingly full world of data scientists.

Why do I need to communicate to be a data science rockstar?

You can be the smartest person in the world when it comes to creating some wild machine learning systems to build recommendation engines, but if you can’t communicate the ‘strategy’ behind the system, you’re going to have a hard time.

If you’re able to find some phenomenal patters in data that have the potential to deliver a multiple X increase in revenue but can’t communicate the ‘strategy’ behind your approach, your potential is going to be unrealized.

What do I mean by ‘strategy’?  In addition to the standard information (error rates/metrics, etc) you need to be able to hit the key ‘W’ points (‘what, why, when, where and who’) when you communicate your output/results. You need to be able to clearly define what you did, why you did it, when your approach works (and doesn’t work), where your data came from and who will be effected by what you’ve done.  If you can’t answer these questions succinctly and in a manner that a layperson can understand them, you’re failing a data scientist.

Two real world examples – one rockstar, one not-rockstar

I have two recent examples for you to help highlight the difference between a data science rockstar (i.e., someone that communicates well) and one not-so-much rockstar. I’ll give you the background on both and let you make up your own mind on which person you’d hire as your next data scientist. Both of these people work at the same organization.

Person 1:

She’s been a data scientist for 4 years. She’s got a wide swatch of experience in data exploration, feature engineering, machine learning and data management.  She’s had multiple projects over her career that required a deep dive into large datasets and she’s had to use different systems, platforms and languages during her analysis. For each project she works on, she keeps a running notebook with commentary, ideas, changes and reasons for doing what she’s doing – she’s a scientist after all.   When she provides updates to team members and management, she provides multiple layers of details that can be read or skipped depending on the level of interest by the reader.  She providers a thorough writeup of all her work with detailed notes about why things are being done they way they are done and how potential changes might effect the outcome of her work.  For project ‘wrap-up’ documentation, she delivers an executive summary with many visualizations that succinctly describes the project, the work she did, why she did what she did, what she thinks could be done to improve things and how the project could be improved upon. In addition to the executive summary, she provides a thorough write-up that describes the entire process with multiple appendices and explanatory statements for those people that want to dive deeply into the project. When people are selecting people to work on their projects, her name is the first to come out of their mouths when they start talking about team members.

Person 2:

He’s been a data scientist for 4 years (about 1 month longer than Person 1).  His background is very technical and is the ‘go-to’ person for algorithms and programming languages within the team. He’s well thought of and can do just about anything that is thrown over the wall at him. He’s quite successful and is sought after for advice from people all over the company.  When he works on projects he sort of ‘wings it’ (his words) and keeps few notes about what he’s done and why he’s chosen the things he has chosen.  For example, if you ask him why he chose Random Forests instead of Support Vector Machines on a project, he’ll tell you ‘because it worked better’ but he can’t explain what ‘better’ means.   Now, there’s not many people that would argue against his choices on projects and his work is rarely questions. He’s good at what he does and nobody at the company questions his technical skills, but they always question ‘what is he doing?’ and ‘what did he do?’ during/after projects.  For documentation and presentation of results, he puts together the basic report that is expected with the appropriate information but people always have questions and are always ‘bothering him’ (again…his words). When new projects are being considered, he’s usually last in line for inclusion because there’s ‘just something about working with him’ (actual words from his co-workers).

Who would you choose?

I’m assuming you know which of the two is the data science rockstar. While Person 2 is technically more advanced than Person 1, his communication skills are a bit behind Person 1. Person 1 is the one that everyone goes to for delivering the ‘best’ outcomes from data science in the company they work at.  Communication is the difference. Person 1 is not only able to do the technical work but also share the outcomes in a way that the organization can easily understand.

If you want to be a data science rockstar, you need to learn to communicate. It can be that ‘one skill’ that could help move you into the realm of ‘top data scientists’ and away from the average data scientists who are focusing all of their personal developer efforts on learning another algorithm or another language.

By the way, I’ve written about this before here and here so jump over and read a few more thoughts on the topic if you have time.

Photo by Ben Sweet on Unsplash

Beware the Models

Beware the Models

Beware the Models“But….all of our models have accuracies above 90%…our system should be working perfectly!”

Those were the words spoken by the CEO of a mid-sized manufacturing company. These comments were made during a conversation about their various forecasting models and the poor performance of those models.

This CEO had spent about a million dollars over the last few years with a consulting company who had been tasked with creating new methods and models for forecasting sales and manufacturing. Over the previous decade, the company had done very well for themselves using a very manual and instinct-driven process to forecast sales and the manufacture processes needed to ensure sales targets were met.

About three years ago, the CEO decided they needed to take advantage of the large amount of data available within the organization to help manage the organization’s various departments and businesses.

As part of this initiative, a consultant from a well known consulting organization was brought in to help build new forecasting models. These models were developed with many different data sets from across the organization and – on paper – they look really good. The presentation of these models include the ‘right’ statistical measures to show that they provide anywhere from 90% to 95% accuracies.

The models, their descriptions and the nearly 300 pages of documentation about how these new models will help the company make many millions of dollars over the coming years weren’t doing that they were designed to do. The results of the models were so far from the reality of what was happening with this organization’s real-world sales and manufacturing processes.

Due to the large divergence between model and reality, the CEO wanted an independent review of the models to determine what wasn’t working and why.  He reached out to me and asked for my help.

You may be hoping that I’m about to tell you what a terrible job the large, well known consultants did.  We all like to see the big, expensive, successful consulting companies thrown under the bus, right?

But…that’s not what this story is about.

The moral of this story? Just because you build a model with better than average accuracy (or even one with great accuracy), there’s no telling what that model will do once it meets the real world. Sometimes, models just don’t work. Or…they stop working. Even worse, sometimes they work wonderfully for a little while only to fail miserably some time in the near future.

Why is this?

There could be a variety of reasons. Here’s a few that I see often:

  • It could be from data mining and building a model based on a biased view of the data.
  • It could be poor data management that allows poor quality data into the modeling process. Building models with poor quality data creates poor quality models with good accuracy (based on poor input data).
  • It could be a poor understanding of the modeling process. There a lot of ‘data scientists’ out there today that have very little understanding of what the data analysis and modeling process should look like.
  • It could be – and this is worth repeating – sometimes models just don’t work. You can do everything right and the model just can’t perform in the real world.

Beware the models. Just because they look good on paper doesn’t mean they will be perfect (or even average) in the real world.  Remember to ask yourself (and your data / modeling teams) – are your models good enough?

Modeling is both an art and a science. You can do everything right and still get models that will make you say ‘meh’ (or even !&[email protected]^@$). That said, as long as the modeling process is approached correctly and the ‘science’ in data science isn’t forgotten, the outcome of analysis / modeling initiatives should at least provide some insight into the processes, systems and data management capabilities within an organization.

 

When it comes to big data, think these three words: analyze; contextualize; internalize

change your mindset about big data - analyze, contextualize and internalize

change your mindset about big data - analyze, contextualize and internalizeIf you don’t know, I’m a bit of a data nerd.  I’ve been writing about big data, data science, machine learning and other ‘new’ stuff for years.  I believe in data science and I believe in big data. I’m a fan of machine learning (but think you probably don’t need it) for the majority of problems that the majority of organizations run across.

So…with that in mind…let me say this:  Big data and data science is nothing new.  Everyone is talking about big data, machine learning, artificial intelligence and data science like these things are ‘brand new’ to the world, but they aren’t. All of these ‘buzzword bingo’ candidates have been around for years…think 50+ years in one form or another.  Its wonderful to see the buzz around them these days since we finally have computing power to actually implement some of these ideas in a much more scalable way.

That said…don’t let scalable fool you into thinking that all you need to do is ‘scale’ and things will be hunky-dory.  The ability to scale to handle larger problems and larger data-sets is extremely important, but without the very basics of data science and applied statistics, all your big data / machine learning / AI projects aren’t going to be as valuable to you / your organization as you might hope.

According to IBM, we now generate 2.5 quintillion bytes of data per day. What are we doing with all that data?  Surely it isn’t all being used by good data scientists to build new models, generate revenue and deliver actionable insights to organizations?  I know for a fact it isn’t, although there are plenty of companies that are taking advantage of that data (think Google and Facebook). I once wrote that ‘today we are drowning in data and starved for information’ (which was a small change to John Naisbitt’s 1982 masterpiece Megatrends in which he wrote ‘we are drowning in information and starved for knowledge.’

Today, we are working with enormous data-sets today and there’s no reason to think these data-sets won’t continue to get larger. But, the size of your data isn’t necessarily what you should be worried about.  Beyond the important basics (data quality, data governance, etc) – which, by the way, have very little to do with data ‘size’ – the next most important aspect of any data project is the ability to analyze data and create some form of knowledge from that data.

When I talk to companies about data projects, they generally want to talk about technologies and platforms first, but that’s the wrong first step.  Those discussions are needed but I always tell them not to get hung up on the Spark’s, Hadoop’s, Map-reducer’s or other technologies / approaches.  I push them to talk about whether they and their organization have the right skills to analyze, contextualize and internalize whatever data they may have.  By having the ability to analyze, contextualize  and internalize, you add meaning to data, which is how you move from data to knowledge.

To do this work, organizations need to ensure they have people with statistical skills as well as development skills to be able to take whatever data you have and infer something from that data.  We need these types of skills more-so than we need the ability to spin up Hadoop clusters. I know 25 people that I can call tomorrow to turn up some big data infrastructure for me that could handle the largest of the large data-sets…but I only know a handful of people that I would feel comfortable calling and asking them to “find the insights from this data-set” and trust that they have all the skills (technical, statistical AND soft-skills) to do the job right.

Don’t forget, there IS a science to big data (ahem…it IS called data science after all). This science is needed to work your way up the ‘data -> information -> knowledge’ ladder. By adding context to your data, you create information. By adding meaning to your information, you create knowledge. Technology is an enabler for data scientists to add context and meaning, but it is still up to the individual to do the hard work.

Don’t get me wrong, the technical skills for these types of system are important. Data Scientists need to be able to code and use whatever systems are available to them, but the real work and the value comes from create information and knowledge from data.  That said, you don’t work up the ‘data -> information -> knowledge’ ladder without being able to understand and contextualize data and technology can’t (generally) do those very important steps for you (although with Artificial Intelligence, we may get their someday).

Stop thinking about the technologies and buzzwords.  Don’t think ‘Spark’, ‘python’, ‘SAS’ or ‘Hadoop’…think ‘analyze’ and ‘contextualize.’ Rather than chasing new platforms, chase new ways to ‘internalize’ data. Unless you and your team can find ways to analyze, contextualize and internalize data, your ability to make a real business impact with big data will be in jeopardy.

Data Quality – The most important data dimension?

data quality

data qualityIn a recent article I wrote over on CIO.com titled Want to Speed Up Your Digital Transformation Initiatives? Take a Look at Your Data, I discuss the importance of data quality and data management in an organization’s digital transformation efforts.  That article can be summarized with the closing paragraph (but feel free to go read the full version):

To speed up your transformation projects and initiatives, you need to take a long, hard look at your data. Good data management and governance practices will put you a step ahead of companies that don’t yet view their data as a strategic asset.

I wanted to highlight this, because it continues to be the biggest issue I find when working with clients today. Many organizations have people that are interested in data and they are finding the budget to get their team’s up to speed on data analytics and data science…but they are still missing the boat on the basics of good data management and data quality.

What is data quality?

Informatica defines data quality in the following manner:

Data quality refers to the overall utility of a dataset(s) as a function of its ability to be easily processed and analyzed for other uses, usually by a database, data warehouse, or data analytics system. … To be of high quality, data must be consistent and unambiguous. Data quality issues are often the result of database merges or systems/cloud integration processes in which data fields that should be compatible are not due to schema or format inconsistencies

Emphasis mine.

Not a bad definition. My definition of data quality is:

Data quality is both simultaneously a measurement and a state of your data. It describes the consistency, availability, reliability, usability, relevancy, security and audibility of your data.

Now, some may argue that this definition covers data management and data governance more than data quality…and they may be correct…but I’ve found that most people that aren’t ‘data people’ get really confused (and bored) when you start throwing lots of different terms out there at them so I try to cover as much of the master data management world under data quality. I’ve found its more relatable to most folks when you talk about ‘data quality’ vs ‘data governance’, etc.

Data quality in the real world

Last month, I spoke to the CEO and CIO of a medium sized company about a new data initiative they are planning.  The project is a great idea for them and should lead to some real growth in both revenue and data sophistication. While I won’t go into the specifics, they are looking to spend a little over $5 million in the next two years to bring data to the forefront of all of their decision making process.

While listening to their pitch (yes…they were pitching me…I’m not used to that) I asked one my ‘go-to’ questions related to data quality. I asked:  “Can you tell me about your data quality processes/systems?” They asked me to explain what I meant by data quality. I provided my definition and spent a few minutes discussing the need for data quality.  We spoke for an hour about data management, data quality and data governance. We discussed how each of these would ‘fit’ into their data initiative(s) and what additional steps they need to take before they go full-speed into the data world.

Early today I had a follow up conversation with the CEO. She told me that they are moving forward with their data initiative with a fairly large change – the first step is implementing proper data management / quality processes and systems.   Thankfully for this organization both the CEO and CIO are smart enough to realize how important data quality is and how important having quality data to feed into their analysis process/systems is for trusting that analysis that comes from their data.

As I said in the CIO.com article: ‘Good data management and governance practices will put you a step ahead of companies that don’t yet view their data as a strategic asset.’ This CEO / CIO pair definitly see data as a strategic asset and are willing to do what it takes to make quality, governance and data management a part of their organization.

Don’t forget the “Science” in Data Science

Don't forget the science in data science

Don't forget the science in data scienceJust a reminder to everyone out there: This isn’t Data Magic….it is Data Science.  The word ‘science’ is included there for a reason.

I would LOVE for magic to be involved in data analytics. I could then whip up a couple of spells and say “abra cadabra’ and have my data tell me something meaningful.  But that’s not how it works.You can say fancy incantations all day long, but your data is going to be meaningless until you do some work on it.

This ‘work’ that you need to do involves lots of very unglamorous activities. Lots of data munging and manipulation. Lots of trial and error and a whole lot of “well that didn’t work!”

Data science requires a systematic approach to collecting, cleaning, storing and analysis data.  Without ‘science’, you don’t have anything but a lot of data.

Let’s take a look at what the word ‘science’ means. Dictionary.com defines “science” as:

  • a branch of knowledge or study dealing with a body of facts or truths systematically arranged and showing the operation of general laws
  • systematic knowledge of the physical or material world gained through observation and experimentation.
  • any of the branches of natural or physical science.
  • systematized knowledge in general.
  • knowledge, as of facts or principles; knowledge gained by systematic study.
  • a particular branch of knowledge.
  • skill, especially reflecting a precise application of facts or principles

You’ll notice that the word ‘magic’ isn’t included anywhere in that definition but the word ‘systematic’ shows up a few times. While we’re at it, let’s take a look at a definition of data science (from Wikipedia):

an interdisciplinary field about scientific methods, processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured

Again…nothing about ‘abra cadabra’ in there.

If you want to “do’ data science correctly, you have to do the hard work. You have to follow some form of systematic process(es) to get your data, clean your data, understand your data and then use that data to test out some hypotheses.

Doing data science without ‘science’ is nothing more than throwing darts at a dart board and thinking the results are meaningful.