Just a reminder to everyone out there: This isn’t Data Magic….it is Data Science. The word ‘science’ is included there for a reason.
I would LOVE for magic to be involved in data analytics. I could then whip up a couple of spells and say “abra cadabra’ and have my data tell me something meaningful. But that’s not how it works.You can say fancy incantations all day long, but your data is going to be meaningless until you do some work on it.
This ‘work’ that you need to do involves lots of very unglamorous activities. Lots of data munging and manipulation. Lots of trial and error and a whole lot of “well that didn’t work!”
Data science requires a systematic approach to collecting, cleaning, storing and analysis data. Without ‘science’, you don’t have anything but a lot of data.
You’ll notice that the word ‘magic’ isn’t included anywhere in that definition but the word ‘systematic’ shows up a few times. While we’re at it, let’s take a look at a definition of data science (from Wikipedia):
an interdisciplinary field about scientific methods, processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured
Again…nothing about ‘abra cadabra’ in there.
If you want to “do’ data science correctly, you have to do the hard work. You have to follow some form of systematic process(es) to get your data, clean your data, understand your data and then use that data to test out some hypotheses.
Doing data science without ‘science’ is nothing more than throwing darts at a dart board and thinking the results are meaningful.
Statistically speaking, you and/or your company really don’t need machine learning.
By ‘statistically speaking’, I mean that most companies today have no absolutely no need for machine learning (ML). The majority of problems that companies want to throw at machine learning are fairly straightforward problems that can be ‘solved’ with a form of regression. They may not be the simple linear regression of your Algebra 1 class, but they are probably nonetheless regression problems. Robin Hanson summed up these thoughts recently when he tweeted the following:
Good CS expert says: Most firms that thinks they want advanced AI/ML really just need linear regression on cleaned-up data.
Of particular note is the ‘cleaned-up data’ piece. That’s huge and something that many companies forget (or ignore) when working with their data. Without proper data quality, data governance and data management processes / systems, you’ll most likely fall into the Garbage in / Garbage out trap that has befallen many data projects.
Now, I’m not a data management / data quality guru. Far from it. For that, you want people like Jim Harris and Dan Power, but I know enough about the topic(s) to know what bad (or non-existent) data management looks like – and I see it often in organizations. In my experiences working with organizations wanting to kick off new data projects (and most today are talking about machine learning and deep learning), the first question I always ask is “tell me about your data management processes.” If they can’t adequately describe these processes, they aren’t ready for machine learning. Over the last five years, I’d guess that 75% of the time the response to my data management query is “well, we have some of our data stored in a database and other data stored on file shares with proper permissions.” This isn’t data management…it’s data storage.
If you and/or your organization don’t have good, clean data, you are most definitely not ready for machine learning. Data management should be your first step before diving into any other data project(s).
What if you have good data management?
A small minority of the organizations I’ve worked with do have proper master data management processes in place. They really understand how important quality, governance and management is to good data and good analysis. If your company understand this importance, congratulations…you’re a few steps ahead of many others.
Let me caution you thought. Just because you have good, clean data doesn’t mean you can or should jump into machine learning. Of course you can jump into it I guess, but you most likely don’t need to.
Out of all the companies I’ve worked with over the last five years, I’d say about 90% of the problems that were initially tagged for machine learning were solved with some fairly standard regression approaches. It always seems to come as a surprise to clients when I recommend simple regression to solve a ‘complex’ problem when they had their heart set on building out multiple machine learning (ML) / deep learning (DL) models. I always tell them that they could go the machine learning route – and their may be some value in that approach – but wouldn’t it be nice to know what basic modeling / regression can do for you to be able to know whether ML / DL is doing anything better than basic regression?
But…I want to use machine learning!
Go right ahead. There’s nothing stopping you from diving into the deep end of ML / DL. There is a time and a place for machine learning…just don’t go running full-speed toward machine learning before you have a good grasp of your data and what ‘legacy’ approaches can do for the problems you are trying to solve.
I hate to be the bearer of bad news….but your data project is going to fail.
Maybe not the one you’re working on today. Maybe not the one you’re starting next month. Heck, maybe not the one you don’t even know about yet…but at some point in the future – if you stay in the data world long enough – your data project is going to fail.
There are may ways your data project could fail. Martin Goodson shares his thoughts on Ten Ways your project could fail, and I’ve seen failure’s driven by each of Martin’s “ten ways” during my career. The most spectacular failures have come from the lack of clear strategy for any data projects.
It should be common sense in the business world that if you don’t have a strategy and plan to execute that strategy in place, you are going to have a hard time. When I use the word ‘strategy’, I don’t just mean some over-arching plan that somebody has written up because they think ‘data is the new oil‘ and by ‘doing’ data projects they’ll somehow magically make the business bigger / better / richer / strong /etc.
Data projects are just like any other project. Imagine you need to move your data center…you wouldn’t just start unplugging servers and loading them into your car to drive to the new data center, would you?
Would you go and spend $20 million to hire a brand new sales team without building a thorough strategic plan for how that sales team will do what they need to do? You wouldn’t hire the people, on-board them and then say ‘start making phone calls’ without planning sales territories, building ‘go to market’ plans and building other tactical plans to outline how the team will execute on your strategy would you? Scratch that…I know some companies that have done that (and they failed miserably).
Data projects require just as much strategic thinking and planning as any other type of project. Just because your CEO (or CIO or CMO or …) read an article about machine learning doesn’t mean you should run out and start spending money on machine learning. Most of you are probably nodding along with me. I can hear you thinking “this is common sense….tell me something I don’t know.” But let me tell you…in my experience, it isn’t common sense because I see it happen all the time with my clients.
So we agree that if you don’t have a strategy, your data project is going to fail, right? Does that mean if you do the strategic planning process correctly, you’ll be swimming in the deep end of data success in the future? Maybe. Maybe not. The strategic plan isn’t everything. If you were successful because you planned well, then every company that ever hired McKinsey would be #1 in their industry with no hope of ever being surpassed by their competitors.
After you’ve spent some time on the strategy of your data project(s), you’ve got to spend time on the execution phase of your project. This is where having the right people, the right systems / technologies in place to ‘do’ the data work comes into play. Again, every one of you is probably nodding right now and thinking something like “sure you need those things!” But this is another area that companies fall down time and time again. They kick off data projects without having the right people analyzing the data and the right people / systems supporting the projects.
Take a look at Martin’s “Ten Ways” again, specifically #3. I watched a project get derailed because the VP of IT wouldn’t approve the installation of R Studio and other tools onto each of the team member’s computers. That team spent three weeks waiting to get the necessary tools installed on their machines before they could get started diving into any data. This is an extreme case of course, but things like this happen regularly in my experience.
Hiring the best people and building / buying the best systems aren’t enough either. You need to have a good ‘data culture’, meaning you have to have people that understand data, understand how to use data. Additionally, your organization needs to understand the dichotomy of data – it is both important and not important at the same time. Yes data is important and yes data projects are important, but without all the other things combined (people, strategy, systems, process, etc), data is just data. Data is meaningless unless you convert it to information (and then convert it yet again into knowledge). To convert data, you need a company culture that allows people the freedom to ‘convert’ data into information / knowledge.
So…you think you have the best strategy, people, systems, process and culture, yes? You think you’ve done everything right and your data projects are set up for success. I hate to tell you, but your data project is going to fail. If you have the right strategy, people, systems, process and culture in place, you aren’t guaranteed success but you will be in a much better position to recover from that failure.
In a recent speech, John Costello, former president of Dunkin Donuts, is reported to have said“Big data is not a strategy…”. Well…let me say that big data isn’t the answer.
I wish I had said that sometime in the past few years. I think I’ve similar things but I haven’t come right out and said those exact words (that I can recall). Again, I wish I had.
I hear people talking (and writing) about big data today. There are some folks out there that take a very common sense approach to big data, but quite a few have gone ‘ga ga’ over big data.
Blogs and articles are written that describe the utopia that big data can bring to an organization. They talk about how great big data is and what great things big data can bring. For the most part, these people are right. Big Data can bring great returns on the investments into the technology, systems and people…but big data isn’t the answer. Big data isn’t about finding answers…big data is all about finding more questions.
Big data isn’t a strategy and it surely isn’t the answer. Big data is just one more tool that can be used in the toolbox that an organization can use to improve.
How many of you would go sky diving without learning all the necessary precautions and safety measures necessary to keep you alive? How many of you would let your kid drive your car without first teaching them the basics of driving? While not as life-and-death as the above questions, data preparation is just as important to proper data analytics as learning the basics of driving before getting behind a wheel.
Experienced data scientists will tell you data prep is (almost) everything and is the area that they spend the majority of their time. Blue Hill research reports that data analysts spend at least 2 hours per day in data preparation activities. At 2 hours per day, Blue Hill estimates that it costs about $22,000 per year per data analyst to prepare data for use in data analytics activities.
One of the reasons that prep takes up so much time is that it is generally a very manual process. You can throw tons of technology and systems at your data, but the front-end of the data analytics workflow is still very manual. While there are automated tools available to help with data preparation, this step in the process is still a very manual process.
Data preparation is important. But…what exactly is it?
The Importance of Data Preparation
Data prep is really nothing more than making sure your data meets the needs of your plans for that data. Data needs to be high quality, describable and in a format that is easily used in future analysis and has some context included around the data.
There’s tons of ways to ‘do’ data preparation. You can use databases, scripts, data management systems or just plain old excel. In fact, according to Blue Hill, 78% of analysts use excel for the majority of their data preparation work. Interestingly, 89% of those same analysts claim that they use excel for the majority of their entire data analytics workflow.
As I mentioned before, there are some tools / systems out there today to help with data prep, but they are still in their infancy. One of these companies, Paxata, is doing some very interesting stuff with data preparation, but I think we are a few years off before these types of tools become widespread.
Data preparation is integral to successful data analytics projects. To do it right, it takes a considerable amount of time and can often take the majority of a data analyst’s time. Whether you use excel, databases or a fancy system to help you with data prep, just remember the importance of data preparation.
If you don’t prepare your data correctly, your data analytics may fail miserable. The old saying of “garbage in, garbage out” definitely applies here.
How focused are you on data preparation within your organization?
You’ve collected tons of data. You’ve got terabytes and terabytes of data. You are happy because you’ve got data. But, what are you going to do with that data?You’ll analyze it of course. But how are you going to analyze it and what are you going to do with that analysis? How does data analytics come into play?
Will you use your data to predict service outages or will you use your data to describe those service outages? Your answer to the ‘how’ and the ‘what’ questions are important to the success of your big data initiatives.
Two different approaches to Data Analytics
There are two basic approaches to data analytics – descriptive and prescriptive. Some folks out there might add a third type called ‘predictive,’ but I feel like predictive and prescriptive are built on top of one other (prescriptive requires predictive) – so i tend to lump prescriptive and predictive analytics together while others keep them separated.
Let’s dig into the two different types of analytics.
Descriptive analytics are pretty much what they sound like. Using statistical analysis, you ‘describes’ and summarizes the data using simple and complex statistical analysis techniques. Using aggregation, filtering and statistical methods, data is described using counts, means, sums, percentages, min/max values and other descriptive values to help you (and others) understand the data. Descriptive analytics can tell you what has happened or what is happening now.
Prescriptive analytics are based on modeling data to understand what could happen and, eventually recommend what the next step should be based on previous steps taken. Using data modeling, machine learning and complex statistical methods, analysts can build models to forecast possible outcomes (e.g., forecasting inventory levels in a store). From that model, additional data can be fed back into the model (i.e., a feedback loop) to then build a prescriptive model to help users determine what you should do given a particular forecast and/or action that occurs. Prescriptive analytics can help you understand what might happen as well as help you make a decision about how to react.
Both approaches to data analytics are important. You must use descriptive analytics to understand your data. To make that data useful, you should use prescriptive (and/or predictive) analytics to understand how changes within your dataset can change your business.
To use the ‘data -> information -> knowledge’ construct, descriptive analytics gets you some information while prescriptive (and/or predictive) gets you into realm of knowledge.
Are you Descriptive or Prescriptive?
In my experience most people today are stuck in the descriptive analytics realm. They are filtering, measuring and analyzing terabytes of data. They understand their data better than anyone ever and they can point to new measures and knowledge gained from this analysis. That said, they are missing out on quite a lot of value by not diving into prescriptive (and/or predictive) analytical approaches.
When I run across ‘data scientists’ (using the term liberally), I always ask about modeling, forecasting and decision support techniques. Many (most) look at me like I’m crazy. They then drive the conversation back towards the things they know about. Filtering, averaging, analyzing, describing data. They can tell me things like ‘average social shares’ and ‘click-through-rates’ but they can’t tell me much more than that. These are absolutely necessary and good pieces of information for an organization to have, but until they are put into action and used for something, they’ll not turn into ‘knowledge.’
Prescriptive analytics is much more involved than descriptive analytics. Just about anyone can do descriptive analytics with the right amount of tools. Prescriptive analytics is where you find the real tried and true data scientists. They are the ones building models, testing those models and then putting those models into use within an organization to help drive business activity.
If you are ‘doing’ big data and are only doing descriptive analytics, you aren’t seeing the entire value of big data and data analytics. You need to find a way to move into prescriptive (and/or predictive) analytics quickly.