Beware the Models

Beware the Models

Beware the Models“But….all of our models have accuracies above 90%…our system should be working perfectly!”

Those were the words spoken by the CEO of a mid-sized manufacturing company. These comments were made during a conversation about their various forecasting models and the poor performance of those models.

This CEO had spent about a million dollars over the last few years with a consulting company who had been tasked with creating new methods and models for forecasting sales and manufacturing. Over the previous decade, the company had done very well for themselves using a very manual and instinct-driven process to forecast sales and the manufacture processes needed to ensure sales targets were met.

About three years ago, the CEO decided they needed to take advantage of the large amount of data available within the organization to help manage the organization’s various departments and businesses.

As part of this initiative, a consultant from a well known consulting organization was brought in to help build new forecasting models. These models were developed with many different data sets from across the organization and – on paper – they look really good. The presentation of these models include the ‘right’ statistical measures to show that they provide anywhere from 90% to 95% accuracies.

The models, their descriptions and the nearly 300 pages of documentation about how these new models will help the company make many millions of dollars over the coming years weren’t doing that they were designed to do. The results of the models were so far from the reality of what was happening with this organization’s real-world sales and manufacturing processes.

Due to the large divergence between model and reality, the CEO wanted an independent review of the models to determine what wasn’t working and why.  He reached out to me and asked for my help.

You may be hoping that I’m about to tell you what a terrible job the large, well known consultants did.  We all like to see the big, expensive, successful consulting companies thrown under the bus, right?

But…that’s not what this story is about.

The moral of this story? Just because you build a model with better than average accuracy (or even one with great accuracy), there’s no telling what that model will do once it meets the real world. Sometimes, models just don’t work. Or…they stop working. Even worse, sometimes they work wonderfully for a little while only to fail miserably some time in the near future.

Why is this?

There could be a variety of reasons. Here’s a few that I see often:

  • It could be from data mining and building a model based on a biased view of the data.
  • It could be poor data management that allows poor quality data into the modeling process. Building models with poor quality data creates poor quality models with good accuracy (based on poor input data).
  • It could be a poor understanding of the modeling process. There a lot of ‘data scientists’ out there today that have very little understanding of what the data analysis and modeling process should look like.
  • It could be – and this is worth repeating – sometimes models just don’t work. You can do everything right and the model just can’t perform in the real world.

Beware the models. Just because they look good on paper doesn’t mean they will be perfect (or even average) in the real world.  Remember to ask yourself (and your data / modeling teams) – are your models good enough?

Modeling is both an art and a science. You can do everything right and still get models that will make you say ‘meh’ (or even !&[email protected]^@$). That said, as long as the modeling process is approached correctly and the ‘science’ in data science isn’t forgotten, the outcome of analysis / modeling initiatives should at least provide some insight into the processes, systems and data management capabilities within an organization.

 

Big Data Roadmap – A roadmap for success with big data

Big Data Roadmap

Big Data RoadmapI’m regularly asked about how to get started with big data. My response is always the same: I give them my big data roadmap for success.  Most organizations want to jump in a do something ‘cool’ with big data. They want to do a project that brings in new revenue or adds some new / cool service or product, but I always point them to this roadmap and say ‘start here’.

The big data roadmap for success looks starts with the following initiatives:

  • Data Quality / Data Management systems (if you don’t have these in place, that should be the absolute first thing you do)
  • Build a data lake (and utilize it)
  • Create self-service reporting and analytical systems / processes.
  • Bring your data into the line-of-business.

These are fairly broad types of initiatives, but they are general enough for any organization to be able to find some value.

Data Management / Data Quality / Data Governance

First of all, if you don’t have proper data management / data quality / data governance, fix that. Don’t do anything else until you can say with absolute certainty that you know where your data has been, who has touched your data and where that data is today. Without this first step, you are playing with fire when it comes to your data. If you aren’t sure how good your data is, there’s no way to really understand how good the output is of whatever data initiative(s) you undertake.

Build a data lake (and utilize it)

I cringe anytime I (or anyone else) says/writes data lake because it reminds me too much of the data warehouse craze that took CIO’s and IT departments by storm a number of years ago. That said, data lakes are valuable (just like data warehouses where/are valuable) but it isn’t enough to just build a data lake…you need to utilize it. Rather than just being a large data store, a data lake should store data and give your team(s) the ability to find and use the data in the lake.

Create self-service reporting and analytical systems / processes.

Combined with the below initiative or implemented separately, developing self-service access and reporting to your data is something that can free up your IT and analytics staff. Your organization will be much more efficient if any member of the team can build and run a report rather than waiting for a custom report to be created and executed for them. This type of project might feel a bit like ‘dashboards’ but it should be much more than that – your people should be able to get into the data, see the data and manipulate the data and then build a report or visualization based on those manipulations. Of course, you need a good data governance process in place to ensure that the right people can see the right data.

Bring your data into the Line of Business

This particular initiative can be (and probably should be) combined with the previous one (self-service), but by itself it still makes sense to focus on by itself. By bringing your data into the line of business, you are getting it closer to the people that best understand the data and the context of the data. By bringing data into the line of business (and providing the ability to easily access and utilize said data), you are exponentially growing the data analytical capabilities of your organization.

Big Data Roadmap – a guarantee?

There’s no guarantee’s in life, but I can tell you that if you follow this roadmap you will have a much better chance at success than if you don’t.  The key here is to ensure that your ‘data in’ isn’t garbage (hence the data governance and data lake aspects) and that you get as much data as you can in the hands of the people that understand the context of that data.

This big data roadmap won’t guarantee success, but it will get you further up the road toward success then you would have been without it.

 

Are your machine learning models good enough?

Are your machine learning models good enough?

Are your machine learning models good enough?Imagine you’re the CEO of XYZ Widget company.  Your Chief Marketing Officer (CMO),  Chief Data Officer (CDO) and Chief Operations Officer (COO) just finished their quarterly presentations and were highlighting the success from the various machine learning projects that have been in the works. After the presentations were complete, you begin to wonder – ‘are these machine learning models good enough?’

You’ve invested a significant portion of your annual budget on big data and machine learning projects and based on what your CMO and CDO tell you, things are looking really good. For example, your production and revenue forecasting projects are both delivering some very promising results with recent forecasts being within 2% of actual numbers.

You don’t really understand any of the machine learning stuff though. It seems like magic to you but you trust that the people doing the work understand it and are doing things ‘right’. That said, you have a feeling deep down that something isn’t quite right.  Sure, things look good but just like magic – the output of these machine learning initiatives could just be an illusion.

Are these machine learning models good enough? — Getting past the illusion

While machine learning, deep learning and big data can provide an enormous amount of value to an organization, there is ample opportunity to mess things up dramatically. There are plenty of times where small errors (and even massive errors) can be introduced into the process. For example, during the data munging / exploration phase, a simple error can introduce changes in the data, which could cause massive changes in the results of any modeling.

Additionally, bias can easily be introduced to the process (either on purpose or by accident). This bias can push the results to tell a story that people want the data / models to tell.  It is very easy to fall into the “let’s use statistics to support our view” trap that many fall into.  Rather than look for data and/or  outputs to support your view (and hence build an illusion), your machine learning initiatives (and any other data projects) should be as bias free as possible.

When done right, there’s very little ‘illusion’ in machine learning. The results are the results just like the data is the data.   You either find answers to your questions (and hopefully find more questions) or you don’t.   The results may not be what you wanted to see, but they are what they are…and this is the exact reason you need to be able to trust the process that was used to find those results. You need to understand if (and where) bias was introduced. You need to understand the process in general.

Can your team describe how was the data gathered and cleaned? Where the models used in the process optimized and/or overfit. Can your team explain their rationale for doing what they did?   Your forecasting models are within 2% of actual numbers in recent months, but that doesn’t mean your models are well built and will hold up over time…it could just mean they are overfit and are doing well with very similar numbers to what you’ve given your machine learning algorithm. What do your models really show for things like R-Squared and Mean Absolute Error (MAE)?  Do you understand why R-Squared and MAE are important?  If not, your teams need to make sure they are explained in general terms and describe why those things are important. Also..here’s a few links for you to learn more about R-Squared and MAE.

You don’t have to become an expert

It takes time and a willingness to ‘get your hands dirty’ to get anywhere close to being an expert in machine learning. Most business leaders don’t need to become an expert but you if you spend a little time understanding the basics and the process that your team follows, it might help remove the ‘magic’ aspect associated with machine learning

My suggestion is to spend some time talking to your team(s) about the following topics to get a basic understanding of the three main steps / processes in machine learning.  Below, I’ve outlined the three main areas and included some questions for you to consider.  Note: These aren’t a definitive list of questions / areas but they’ll get you started.

Data Gathering / Preparation / Cleaning

  • How was the data gathered?
  • What data quality measures / methods were undertaken to ensure the data’s accuracy and provenance?
  • What steps were taken to clean / prepare the data?
  • How is new data being gathered / cleaned / prepared for inclusion in existing / new models?
  • Who has access to the data?

Modeling

  • Why was the model (or models) chosen?
  • Were other models considered? If so, why weren’t they used?
  • Did you ‘build your own’ or use existing libraries to build the model?
  • Where the proper data preparation steps taken for the model(s) selected?

Evaluation &Interpretation of Results

  • How do you know the model is ‘good enough’?
  • When and why did you stop iterating on the model / data?
  • What accuracy measures are you using for the model(s)?
  • Are we sure the data isn’t being overfitted? How?
  • Why are the visualizations that are presented used? (Note: the use or non-use of certain visualizations can be a tip-off that something isn’t right about the data / model).

Again – these aren’t meant to be a definitive list of questions / topical areas for you to consider but they should get you started asking good questions of your team.   I particularly love to ask the How do you know the model is good enough question because it sheds a lot of light on the entire process and the mental approach to the problem.

Are these machine learning models good enough?

The answers to the above questions should help you get a better feel for how your team(s) approached the issue at hand and help you (and the rest of your leadership team) understand the approach to data preparation, modeling and evaluation in your machine learning initiatives.

The above questions and answers might not specifically answer the ‘are your machine learning models good enough’ question, but they will get you and your team(s) to a point where they are constantly thinking about whether ‘good enough’ is enough. Sometimes it is…others it isn’t. That’s why you need to understand a bit more about the process to understand whether good enough is good enough.

Of course, if you need help trying to understand all this stuff…you can always hire me to help. Give me a call or drop me an email and let’s discuss your needs.

Deep learning – when should it be used?

Deep Learning vs other approaches

“When should I use deep learning?”

I get asked that question constantly.

The answer to this question is both complicated and simplistic at the same time.

The answer I usually give us something along the lines of ‘if you a lot of data and an interesting / challenging problem, then you should try out deep learning’.

How a much is ‘a lot of data’?  That’s the complicated part.

Let’s use some examples to try to clarify things.

  • If you have 5 years of monthly sales data and want to use deep learning to build a forecaster, you’ll most likely be wasting your time.  Deep learning will work technically, but it generally won’t give you much better results than some simpler machine learning or even more simpler regression techniques.
  • If you have 20 years of real estate sales data with multiple features (e.g., square footage of the house, location, comparables, etc etc) and want to try to predict sales prices within a neighborhood/state/country, then deep learning is definitely an approach to take.  This a wonderful usage for deep learning.
  • If you want to build a forecaster to help develop a budget for your organization, maybe deep learning is a good approach…and maybe it isn’t.
  • If you want to build a “Hotdog Not Hotdog” app, deep learning is the right approach.
  • If you want to forecast how many widgets you’ll need to build next year with the previous 10 years of data, I’d recommend going with regression first then moving into some basic machine learning techniques. Deep learning (e.g., neural networks) could work here but it might not make a lot of sense depending on the size of the data.
  • If you want to predict movements in the stock market using the last 100 years of stock market data combined with hundreds of technical and/or fundamental indicators, deep learning could be a good approach but a good machine learning methodology would work as well. Just be careful of data mining and other bias’ you can introduce when working with time series data like the markets.

Ok. So I didn’t exactly clarify things, but hopefully you get the point.

Andrew Ng uses this graphic to highlight where deep learning make sense (from his Deeplearning.ai Coursera Course):

Deep Learning vs other approaches

 

 

 

 

 

 

 

 

 

 

 

 

 

The various lines on the chart are different approaches (regression, machine learning, deep learning) with the ‘standard’ approaches of regression and machine learning shown in red/orange.  You can see from the left of the chart that these types of approaches are similar performance-wise with small data sets.

Deep learning really begins to diverge in performance when your data-set starts to get sufficiently large. The ‘problem’ here is that ‘sufficiently large’ is hard to define.  That’s why I usually tell people to start with the basics first and try out regression then move to machine learning (Random Forest, SVM’s, etc etc) and then – once you have a feel for your data AND the performance of your approach isn’t delivering the results you expected, then try out deep learning.

That said, there are obviously times that deep learning makes sense initially. When you are looking at things like machine vision, natural language processing, autonomous driving, text translation in real-time, etc you want to investigate deep learning right away. Additionally, you can use deep learning appraoches for any problem you want to, but the performance is best when you have a large data set.

So…when should you consider deep learning? It depends on the challenge you are trying to solve.  Sorry…there’s not an ‘easy’ answer for the question.

 

When it comes to big data, think these three words: analyze; contextualize; internalize

change your mindset about big data - analyze, contextualize and internalize

change your mindset about big data - analyze, contextualize and internalizeIf you don’t know, I’m a bit of a data nerd.  I’ve been writing about big data, data science, machine learning and other ‘new’ stuff for years.  I believe in data science and I believe in big data. I’m a fan of machine learning (but think you probably don’t need it) for the majority of problems that the majority of organizations run across.

So…with that in mind…let me say this:  Big data and data science is nothing new.  Everyone is talking about big data, machine learning, artificial intelligence and data science like these things are ‘brand new’ to the world, but they aren’t. All of these ‘buzzword bingo’ candidates have been around for years…think 50+ years in one form or another.  Its wonderful to see the buzz around them these days since we finally have computing power to actually implement some of these ideas in a much more scalable way.

That said…don’t let scalable fool you into thinking that all you need to do is ‘scale’ and things will be hunky-dory.  The ability to scale to handle larger problems and larger data-sets is extremely important, but without the very basics of data science and applied statistics, all your big data / machine learning / AI projects aren’t going to be as valuable to you / your organization as you might hope.

According to IBM, we now generate 2.5 quintillion bytes of data per day. What are we doing with all that data?  Surely it isn’t all being used by good data scientists to build new models, generate revenue and deliver actionable insights to organizations?  I know for a fact it isn’t, although there are plenty of companies that are taking advantage of that data (think Google and Facebook). I once wrote that ‘today we are drowning in data and starved for information’ (which was a small change to John Naisbitt’s 1982 masterpiece Megatrends in which he wrote ‘we are drowning in information and starved for knowledge.’

Today, we are working with enormous data-sets today and there’s no reason to think these data-sets won’t continue to get larger. But, the size of your data isn’t necessarily what you should be worried about.  Beyond the important basics (data quality, data governance, etc) – which, by the way, have very little to do with data ‘size’ – the next most important aspect of any data project is the ability to analyze data and create some form of knowledge from that data.

When I talk to companies about data projects, they generally want to talk about technologies and platforms first, but that’s the wrong first step.  Those discussions are needed but I always tell them not to get hung up on the Spark’s, Hadoop’s, Map-reducer’s or other technologies / approaches.  I push them to talk about whether they and their organization have the right skills to analyze, contextualize and internalize whatever data they may have.  By having the ability to analyze, contextualize  and internalize, you add meaning to data, which is how you move from data to knowledge.

To do this work, organizations need to ensure they have people with statistical skills as well as development skills to be able to take whatever data you have and infer something from that data.  We need these types of skills more-so than we need the ability to spin up Hadoop clusters. I know 25 people that I can call tomorrow to turn up some big data infrastructure for me that could handle the largest of the large data-sets…but I only know a handful of people that I would feel comfortable calling and asking them to “find the insights from this data-set” and trust that they have all the skills (technical, statistical AND soft-skills) to do the job right.

Don’t forget, there IS a science to big data (ahem…it IS called data science after all). This science is needed to work your way up the ‘data -> information -> knowledge’ ladder. By adding context to your data, you create information. By adding meaning to your information, you create knowledge. Technology is an enabler for data scientists to add context and meaning, but it is still up to the individual to do the hard work.

Don’t get me wrong, the technical skills for these types of system are important. Data Scientists need to be able to code and use whatever systems are available to them, but the real work and the value comes from create information and knowledge from data.  That said, you don’t work up the ‘data -> information -> knowledge’ ladder without being able to understand and contextualize data and technology can’t (generally) do those very important steps for you (although with Artificial Intelligence, we may get their someday).

Stop thinking about the technologies and buzzwords.  Don’t think ‘Spark’, ‘python’, ‘SAS’ or ‘Hadoop’…think ‘analyze’ and ‘contextualize.’ Rather than chasing new platforms, chase new ways to ‘internalize’ data. Unless you and your team can find ways to analyze, contextualize and internalize data, your ability to make a real business impact with big data will be in jeopardy.

Data and Culture go hand in hand

data and culture go hand in hand

data and culture go hand in handA few weeks ago, I spent an afternoon talking to the CEO of a mid-sized services company.  He’s interested in ‘big data’ and is interviewing consultants / companies to help his organization ‘take advantage of their data’.  In preparation for this meeting, I had spent the previous weeks talking to various managers throughout the company to get a good sense of how the organization uses and embraces data.  I wanted to see how well data and culture mixed at this company.

Our conversation started out like they always do in these types of meetings. He started asking me about big data, how big data can help companies and what big data would mean to their organization.  As I always do, I tried to provide a very direct and non-sales focused message to the CEO about the pros/cons of big data, data science and what it means to be a data-informed organization.

This particular CEO stopped me when I started talking about being ‘data-informed’.  He described his organization is being a ‘data-driven company!’ (the exclamation was implied in the forcefulness of his comment).  He then spent the next 15 minutes describing his organization’s embracing of data. He described how they’ve been using data for years to make decisions and that he’d put his organization up against any other when it comes to being data-driven.  He showed me sales literature that touts their data-driven culture and described how they were one of the first companies in their space to really use data to drive their business.

After this CEO finished exclaiming the virtues of his data-driven organization, I made the following comment (paraphrasing of course…but this is the gist of the comment):

“You say this is a data-driven organization…but the culture of this organization is not one that I would call data-driven at all.   Every one of your managers tells me most decisions in the organization are made by ‘gut feel’.  They tell me that data is everywhere and is used in making decisions but only after the decision has been made.   Data is used to support a decision rather than informing the decisions. There’s a big difference between that and being a data-informed and a being a data-driven organization.

After what felt like much more than the few seconds it was, the CEO smiled and asked me to help him understand ‘just what in the hell I was talking about’.

What am I talking about?

I’m talking about the need to view data as more than just a supporting actor in the theatrical play that is your business.  Data must go hand-in-hand with every initiative your organization undertakes.   There’s some folks out there that argue that you need to build a data-driven culture, but that’s a hard thing to sell to most people and simply because they don’t really understand what a ‘data-driven’ culture is.

So…what is a ‘data-driven culture’?  If you ask 34 experts on the subject, you’ll get 34 different explanations.  I suspect if you ask another 100 experts, you’ll get 100 additional answers.  Rather than trying to be a data-driven culture, its much better to integrate the idea of data into every aspect of your culture. Rather than try to create a new culture that nobody really understands (or can define), work on tweaking the culture you have to be one that embraces data and the intelligent use of data.

This is what happens when you become start moving toward being a data-informed organization.   Rather than using data to provide reasons for the decisions that you make, you need to incorporate data into your decision making process. Data needs to be used by your people (an important point…don’t forget about the people) to make decisions. Data needs to be a part of every activity in the organization and it needs to be available to be used by anyone within the organization. This is where a good data governance / data management system/process comes into play.

During my meeting with the CEO, I spent about 2 hours walking through the topics of data and culture.  We touched on many different topics in our conversation but always seemed to come back around to him not understanding how his organization isn’t “data-driven”.  He truly believed that he was doing the right things that a company needs to do to be ‘data-driven’. I couldn’t argue that he wasn’t doing the right things but I did point out the fact that data was considered as an afterthought in every conversation I had with his leadership team.

Data and culture go hand in hand

Since that meeting, the CEO has called me a few times and we’ve talked through some plans for helping bring data to the forefront of his organization.  This type of work is quite different than the ‘big data’ work that the CEO had original wanted to talk about.  There’s no reason not to continue down the path of implementing the right systems, processes and people to build a great data science team within the company, but to get the most from this work, its best to also take a stab at tweaking your culture to ensure data is embraced and not just tolerated.

A culture that embraces data is one that ensures data is available from the CEO down to the most junior of employees.  This requires not only cultural change but also systematic changes to ensure you have proper data governance and data management in place.

Data science, big data and the whole world that those worlds entail is much more than just something you install and use.  Its a shift from a culture focused on making decisions by gut-feel and using data to back that decision up to one that intuitively uses data throughout the decision making process, including starting with data to find new factors to make decisions on.

What about your organization? Does data and culture go hand in hand or are you trying to force data into a culture that doesn’t understand or embrace it?