Machine learning risks are real. Do you know what they are?

Machine Learning Risks are real and can be very dangerous if not managed / mitigated.

Everyone wants to ‘do’ machine learning and lots of people are talking about it, blogging about it and selling services and products to help with it. I get it…machine learning can bring a lot of value to an organization – but only if that organization knows the associated risks.

Deloitte splits machine learning risks into 3 main categories: Data, Design & Output . This isn’t a bad categorization scheme, but I like to add an additional bucket in order to make a more nuanced argument machine learning risks.

My list of ‘big’ machine learning risks fall into these four categories:

Bias – Bias can be introduced in many ways and can cause models to be wildly inaccurate.
Data – Not having enough data and/or having bad data can bring enormous risk to any modeling process, but really comes into play with machine learning.
Lack of Model Variability (aka over-optimization) – You’ve built a model. It works great. You are a genius…or are you?
Output interpretation – Just like any other type of modeling exercise, how you use and interpret the model can be a huge risk.

In the remainder of this article, I spend a little bit of time talking about each of these categories of machine learning risks.

Machine Learning Risks

Bias

One of the things that naive people argue as a benefit for machine learning is that it will be an unbiased decision maker / helper / facilitator. This can’t be further from the truth.

Machine learning models are built by people. People have biases whether they realize it or not. Bias exists and will be built into a model. Just realize that bias is there and try to manage the process to minimize that bias.

Cathy O’Neill argues this very well in her book Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy . Now, I’m not a huge fan of the book (the book is a bit too politically bent and there are too many uses of the words ‘fair’ and ‘unfair’….who’s to judge what is fair?) but there are some very good arguments about bias that are worth the time to read.

In addition to the bias that might be introduced by people, data can be biased as well. Bias that’s introduced via data is more dangerous because its much harder to ‘see’ but it is easier to manage.

For example, assume you are building a model to understand and manage mortgage delinquencies. You grab some credit scoring data and build a model that predicts that people with good credit scores and a long history of mortgage payments are less likely to default. Makes sense, right? But…what if a portion of those people with good credit scores had mortgages that were supported in some form by tax breaks or other benefits and those benefits expire tomorrow. What happens to your model if those tax breaks go away? I’d put money on the fact that your model isn’t going to be able to predict the increase in numbers of people defaulting that are probably going to happen.

Data bias is dangerous and needs to be carefully managed. You need domain experts and good data management processes (which we’ll talk about shortly) to overcome bias in your machine learning processes.

From the mortgage example above, you can (hopefully) imagine how big of a risk bias can be for machine learning. Managing bias is a very large aspect to managing machine learning risks.

Data

The second risk area to consider for machine learning is the data used to build the original models as well as the data used once the model is in production. I talked a bit about data bias above but there are plenty of other issues that can be introduced via data. With data, you can have many different risks including:

Data Quality (e.g., Bad data) – do you know where your data has been, who has touched it and what the ‘pedigree’ of your data is? If you don’t, you might not have the data that you think you do.
Not enough data – you can build a great model on a small amount of data but that model isn’t going to be a very good model long-term due unless all your future data looks exactly like the small amount of data you used to build it. When building models (whether they are machine learning models or ‘standard’ models), you want as much data as you can get.
Homogeneous data – similar to the ‘not enough data’ above risk above, this risk comes from a lack of data – but not necessarily the lack of the amount of data but the lack of variability of the data. For example, if you want to forecast home prices in a city, you probably want to get as many different data sets as you can find to build these models. Don’t use just one data-set from the local tax office….who knows how accurate that data is. Find a couple of different data sets with many different types of demo-graphical data points and then spend time doing some feature engineering to find the best model inputs for accurate outputs.
Fake Data – this really belongs in the ‘bad data’ risk, but I wanted to highlight it separately because it can be (and has been) a very large issue . For example, assume you are trying to forecast revenue and growth numbers for a large multi-national organization who has offices in North America, South America and Asia. You’ve pulled together a great deal of data including economic forecasts and built what looks to be a great model. Your organization begins planning their future business based on the outcome of this model and use the model to help make decisions going forward. How sure are you that the economic data is real ?
Data Compliance issues – You have some data…but can you (or should you) use it? Simple question but one that vexes many data scientists – and one that doesn’t have an easy answer.

Lack of Model Variability (aka over-optimization)

You spend weeks building a model. You train it and train it and train it. You optimize it and get an outstanding measure for accuracy. You’re going to be famous. Then…the real data starts hitting the model. Your accuracy goes into the toilet. Your model is worthless.

What happened? You over-optimized. I see this all the time in the financial markets when people try to build a strategy to invest in the stock market. They build a model strategy and then tweak inputs and variables until they get some outrageous accuracy numbers that would make them millionaires in a few months. But that rarely (never?) happens.

What happens is this – an investing strategy (e.g., model) is built using a particular set of data. The inputs are tweaked to give the absolute best output without regards to variability of data (e.g., new data is never introduced). When the investing strategy is then applied to new, real world data, it doesn’t perform anywhere near as well as it did on the old tested data. The dreams of being a millionaire quickly fade as the investor watches their investing account value dwindle.

In the world of investing, this over-optimization can be managed with various performance measures and using a method called walk-forward optimization to try to get as much data in as many different timeframes as possible into the model.

Similar approaches should be taken in other model building exercises. Don’t over-optimize. Make sure the data you are feeding your machine learning models are varied across both data types, timeframes, demo-graphical data-sets and as many other forms of variability that you can find.

Some folks might call ‘lack of model variability’ by another name — Generalization Error . Regardless of what you call this risk…its a risk that exists and should be carefully managed throughout your machine learning modeling processes.

Output Interpretation

* xkcd on Machine Learning *

You spend a lot of time making sure you have good data, the right data and the as much data as you can. You do everything right and build a really good machine learning model and process. Then, your boss takes a look at it and interprets the results in a way that is so far from accurate that it makes your head spin.

This happens all the time. Model output is misinterpreted, used incorrectly and/or the assumptions that were used to build the machine learning model are ignored or misunderstood. A model provides estimates and guidance but its up to us to interpret the results and ensure the models are used appropriately.

Here’s an example that I ran across recently. This is a silly one and might be hard to believe – but its a good example to use. An organization had one of their data scientists build a machine learning model to help with sales forecasting. The model was built on the assumption that all data would be rolled up to quarterly data for modeling and reporting purposes. While i’m not a fan of up-sampling data from high to low granularity, but it made sense for this particular modeling exercise.

This particular model was built on quarterly data with a fairly good mean error rate and good variance measures. Looking at all the statistics, it was a good model. The output of the model was provided to the VP of Sales who immediately got angry. He called up the manager of the data scientist and read her the riot act. He told her the reports were off by a factor of anywhere from 5 to 10 times what it should be. He was furious and shot off an email to the data team, the sales team and the leadership team decrying the ‘fancy’ forecasting techniques declaring that it was forecasting 10x growth of the next year and “had to be wrong!”

Turns out he had missed that the output was showing quarterly sales revenue instead of weekly revenue like he was used to seeing.

Again – this is a simplistic example but hopefully it makes sense that you need to understand how a model was built, what assumptions were made and what the output is telling you before you start your interpretation of the output.

One more thing about output interpretation…a good data scientist is going to be just as good at presenting outputs and reporting on findings as they are at building the models. Data scientists need to be just as good at communicating as they are at data manipulation and model building.

Finishing things up…

This has been a long one…thanks for reading to here. Hopefully its been informative. Before we finish up completely, you might be asking something along the lines of ‘what other machine learning risks exists?’

If you asked 100 data scientists and you’ll probably get as many different answers of what the ‘big’ risks are – but I’d bet that if you sit down and categorize them all, the majority of them would fall into these four categories. There may be some outliers (and I’d love to add those outliers to my list if you have some to share).

What can you do as a CxO looking at machine learning / deep learning / AI to help mitigate these machine learning risks? Like my friend Gene De Libero says : ‘Test, learn, repeat (bruises from bumping into furniture in the dark are OK).”

Go slow and go small. Learn about your data and your businesses capabilities when it comes to data and data science. I know everyone ‘needs’ to be doing machine learning / AI but you really don’t need to throw caution to the wind. Take your time to understand the risks inherent in the process and find ways to mitigate the machine learning risks and challenges.

I can help mitigate those risks. Feel free to contact me to see how I might be able to help manage machine learning risks within your project / organization.