Machine learning risks are real. Do you know what they are? 

Machine Learning Risks

Machine Learning RisksMachine Learning Risks are real and can be very dangerous if not managed / mitigated.

Everyone wants to ‘do’ machine learning and lots of people are talking about it, blogging about it and selling services and products to help with it. I get it…machine learning can bring a lot of value to an organization – but only if that organization knows the associated risks.

Deloitte splits machine learning risks into 3 main categories: Data, Design & Output. This isn’t a bad categorization scheme, but I like to add an additional bucket in order to make a more nuanced argument machine learning risks.

My list of ‘big’ machine learning risks fall into these four categories:

  1. Bias – Bias can be introduced in many ways and can cause models to be wildly inaccurate.
  2. Data – Not having enough data and/or having bad data can bring enormous risk to any modeling process, but really comes into play with machine learning.
  3. Lack of Model Variability (aka over-optimization) – You’ve built a model. It works great.  You are a genius…or are you?
  4. Output interpretation – Just like any other type of modeling exercise, how you use and interpret the model can be a huge risk.

In the remainder of this article, I spend a little bit of time talking about each of these categories of machine learning risks.

Machine Learning Risks

Bias

One of the things that naive people argue as a benefit for machine learning is that it will be an unbiased decision maker / helper / facilitator.  This can’t be further from the truth.

Machine learning models are built by people. People have biases whether they realize it or not. Bias exists and will be built into a model. Just realize that bias is there and try to manage the process to minimize that bias.

Cathy O’Neill argues this very well in her book Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Now, I’m not a huge fan of the book (the book is a bit too politically bent and there are too many uses of the words ‘fair’ and ‘unfair’….who’s to judge what is fair?) but there are some very good arguments about bias that are worth the time to read.

In addition to the bias that might be introduced by people, data can be biased as well. Bias that’s introduced via data is more dangerous because its much harder to ‘see’ but it is easier to manage.

For example, assume you are building a model to understand and manage mortgage delinquencies. You grab some credit scoring data and build a model that predicts that people with good credit scores and a long history of mortgage payments are less likely to default.  Makes sense, right?  But…what if a portion of those people with good credit scores had mortgages that were supported in some form by tax breaks or other benefits and those benefits expire tomorrow.  What happens to your model if those tax breaks go away?  I’d put money on the fact that your model isn’t going to be able to predict the increase in numbers of people defaulting that are probably going to happen.

Data bias is dangerous and needs to be carefully managed. You need domain experts and good data management processes (which we’ll talk about shortly) to overcome bias in your machine learning processes.

From the mortgage example above, you can (hopefully) imagine how big of a risk bias can be for machine learning.  Managing bias is a very large aspect to managing machine learning risks.

Data

The second risk area to consider for machine learning is the data used to build the original models as well as the data used once the model is in production. I talked a bit about data bias above but there are plenty of other issues that can be introduced via data. With data, you can have many different risks including:

  • Data Quality (e.g., Bad data)– do you know where your data has been, who has touched it and what the ‘pedigree’ of your data is? If you don’t, you might not have the data that you think you do.
  • Not enough data – you can build a great model on a small amount of data but that model isn’t going to be a very good model long-term due unless all your future data looks exactly like the small amount of data you used to build it. When building models (whether they are machine learning models or ‘standard’ models), you want as much data as you can get.
  • Homogeneous data – similar to the ‘not enough data’ above risk above, this risk comes from a lack of data – but not necessarily the lack of the amount of data but the lack of variability of the data.   For example, if you want to forecast home prices in a city, you probably want to get as many different data sets as you can find to build these models.  Don’t use just one data-set from the local tax office….who knows how accurate that data is. Find a couple of different data sets with many different types of demo-graphical data points and then spend time doing some feature engineering to find the best model inputs for accurate outputs.
  • Fake Data – this really belongs in the ‘bad data’ risk, but I wanted to highlight it separately because it can be (and has been) a very large issue.  For example, assume you are trying to forecast revenue and growth numbers for a large multi-national organization who has offices in North America, South America and Asia. You’ve pulled together a great deal of data including economic forecasts and built what looks to be a great model.  Your organization begins planning their future business based on the outcome of this model and use the model to help make decisions going forward.  How sure are you that the economic data is real?
  • Data Compliance issues – You have some data…but can you (or should you) use it?  Simple question but one that vexes many data scientists – and one that doesn’t have an easy answer.

Lack of Model Variability (aka over-optimization)

You spend weeks building a model. You train it and train it and train it. You optimize it and get an outstanding measure for accuracy. You’re going to be famous.  Then…the real data starts hitting the model.  Your accuracy goes into the toilet.  Your model is worthless.

What happened? You over-optimized.  I see this all the time in the financial markets when people try to build a strategy to invest in the stock market. They build a model strategy and then tweak inputs and variables until they get some outrageous accuracy numbers that would make them millionaires in a few months.  But that rarely (never?) happens.

What happens is this – an investing strategy (e.g., model) is built using a particular set of data. The inputs are tweaked to give the absolute best output without regards to variability of data (e.g., new data is never introduced). When the investing strategy is then applied to new, real world data, it doesn’t perform anywhere near as well as it did on the old tested data.  The dreams of being a millionaire quickly fade as the investor watches their investing account value dwindle.

In the world of investing, this over-optimization can be managed with various performance measures and using a method called walk-forward optimization to try to get as much data in as many different timeframes as possible into the model.

Similar approaches should be taken in other model building exercises.  Don’t over-optimize. Make sure the data you are feeding your machine learning models are varied across both data types, timeframes, demo-graphical data-sets and as many other forms of variability that you can find.

Some folks might call ‘lack of model variability’ by another name — Generalization Error. Regardless of what you call this risk…its a risk that exists and should be carefully managed throughout your machine learning modeling processes.

Output Interpretation

You spend a lot of time making sure you have good data, the right data and the as much data as you can. You do everything right and build a really good machine learning model and process.   Then, your boss takes a look at it and interprets the results in a way that is so far from accurate that it makes your head spin.

This happens all the time. Model output is misinterpreted, used incorrectly and/or the assumptions that were used to build the machine learning model are ignored or misunderstood. A model provides estimates and guidance but its up to us to interpret the results and ensure the models are used appropriately.

Here’s an example that I ran across recently. This is a silly one and might be hard to believe – but its a good example to use. An organization had one of their data scientists build a machine learning model to help with sales forecasting. The model was built on the assumption that all data would be rolled up to quarterly data for modeling and reporting purposes. While i’m not a fan of up-sampling data from high to low granularity, but it made sense for this particular modeling exercise.

This particular model was built on quarterly data with a fairly good mean error rate and good variance measures. Looking at all the statistics, it was a good model. The output of the model was provided to the VP of Sales who immediately got angry.  He called up the manager of the data scientist and read her the riot act. He told her the reports were off by a factor of anywhere from 5 to 10 times what it should be. He was furious and shot off an email to the data team, the sales team and the leadership team decrying the ‘fancy’ forecasting techniques declaring that it was forecasting 10x growth of the next year and “had to be wrong!”

Turns out he had missed that the output was showing quarterly sales revenue instead of weekly revenue like he was used to seeing.

Again – this is a simplistic example but hopefully it makes sense that you need to understand how a model was built, what assumptions were made and what the output is telling you before you start your interpretation of the output.

One more thing about output interpretation…a good data scientist is going to be just as good at presenting outputs and reporting on findings as they are at building the models.  Data scientists need to be just as good at communicating as they are at data manipulation and model building.

Finishing things up…

This has been a long one…thanks for reading to here. Hopefully its been informative. Before we finish up completely, you might be asking something along the lines of  ‘what other machine learning risks exists?’

If you asked 100 data scientists and you’ll probably get as many different answers of what the ‘big’ risks are – but I’d bet that if you sit down and categorize them all, the majority of them would fall into these four categories. There may be some outliers (and I’d love to add those outliers to my list if you have some to share).

What can you do as a CxO looking at machine learning / deep learning / AI to help mitigate these machine learning risks?  Like my friend Gene De Libero says: ‘Test, learn, repeat (bruises from bumping into furniture in the dark are OK).”

Go slow and go small. Learn about your data and your businesses capabilities when it comes to data and data science. I know everyone ‘needs’ to be doing machine learning / AI but you really don’t need to throw caution to the wind. Take your time to understand the risks inherent in the process and find ways to mitigate the machine learning risks and challenges.

I can help mitigate those risks. Feel free to contact me to see how I might be able to help manage machine learning risks within your project / organization.

What is the cost of bad data?

Cost of Bad Data

Cost of Bad DataHow much is bad data costing you? It could be very little – or it could be a great deal. In this article I give an example of what the cost of bad data really is.

A few days ago, I received a nice, well designed sales/marketing piece in the mail yesterday. In it, a local window company warned me of the dangers of old windows and the costs associated with them (higher energy costs, etc etc). Note: This was the third such piece I’ve received from this company in about 3 months.

It was a well thought out piece of sales/marketing material. If I had been thinking about new windows, I most likely would be given them a call.

However…my house is less than a year old. So is every other house in the neighborhood of about a thousand homes that are all less than 5 years old. Talking to my neighbors, everyone got a similar sales pitch. I’m not a window salesperson, but I wouldn’t think we are the target market for these types of pitches.

That said, the neighborhood directly beside us is a 20+ year old neighborhood that would be ideal for the pitch. I hope this window company pitched them as well as they pitched me (and I’m assuming they did).

What I suspect happened is this window company bought a ‘targeted’ list from a list broker that promised ‘accurate and up to date’ listings of homeowners in a zip code. Sure, the list is accurate (I am a homeowner) but its not really targeted correctly.

The cost of bad data

I won’t get into the joys of buying lists like this because we all know some mistakes are made. There will always be bad data regardless of what your data management practices are but a good data governance/management process will help eliminate as much bad data as possible.

Of course, in this example we’re talking about a small business. What do they know about data management? Probably nothing…and most likely they don’t need to know too much but they do need to understand how much bad data is costing them.

Let’s look at the costs for this window company.

I went out to one of those list websites and built a demographic to buy lists of homeowners in my zip code. The price was about $3,000 for about 18K addresses. Next, I found a direct mailing cost estimator website that helped me estimate the cost to mail out the material that I had received from the window company. The mailing cost was about $10,000 (which seems high to me…but what do I know about mailings?). This sounds about right considering it would cost about $8500 to send out 18,000 letters with standard postage.

I’m going to assume this company got a deal for their mailings and paid $20,000 for the 3 mailing campaigns that I received a letter. With the price of the list, that brings us to $23K total cost, or about $1.28 per letter sent. That doesn’t seem like a lot of money to spend on sales/marketing until you realize how much of that money was wasted on homes that don’t need the service.

We have roughly 1,000 homes in our neighborhood. A random sampling of the homeowners tells me 90% of them received more than one mailing from this window company. This gives us 900 homes. I’ll assume each home only received 2 mailings, which brings a cost to this window company of about $2,300 ($1.28 x 900 x 2).

That’s $2,300 spent trying to sell windows to homes that don’t need it. That’s 10% of the budget.

So…for this small company trying to sell windows, 10% of their budget was wasted on marketing their services to homes that didn’t need their services. That’s a big number, even for a small company.

Of course, some of you may argue that these costs aren’t all wasted because some of the marketing material might have made it into the hands of friends / family or a homeowner may remember this window company in future years – and you are probably right. But…is the possibility of maybe potentially getting work in the future worth spending 10% of your marketing budget?

To me it is, especially since that 10% could have been redirected to a higher potential marketing opportunity.

The cost of bad data is high regardless of what the number actually shows. If you spend $1 because bad data ‘tricked’ you into doing so, that cost is wasted.

The real question is – what are you doing to understand how good or bad your data is?

What can you DO with Machine Learning?

What can you do with machine learning

What can you do with machine learningEveryone’s talking about machine learning (ML) and Artificial Intelligence (AI) these days.  If you are a CxO or work in IT or marketing, I’d bet that you hear these terms more than you probably want to. It feels an awful lot like the early data of Big Data or Business Intelligence or the days when the “Intranet” was first making waves within organizations.

Like most new technologies (ahem…buzzwords), machine learning and AI can seem like solutions looking for problems.  While I would argue there are people / companies looking for problems to throw their experience with AI and machine learning at, there are some viable problems out there for ML/AI. That said, I still stand behind my argument that you probably don’t need machine learning…but every organization should investigate the use of ML/AI.

Rather than buy a solution and then look for a problem to through it at (like many vendors / consultants are pushing these days), its worthwhile for every company to spend some time looking at a few important areas within their businesses to see if there’s anything that ML/AI can do to help.

Below are a few examples I’ve helped organizations with over the last few years.

Areas to start investigating the use of Machine Learning

Improving/Personalizing Customer Service

Customer service is one of those areas that you either immediately think “yes…that’s a perfect place to use ML/AI” or “uh…what?”. Hopefully you fall into the former category because customer service is an ideal space for implementing machine learning and artificial intelligence to help improve service, better understand your customers and personalize interactions.  Why’s it an ideal space?  Because you have a lot of data – some of which is structured and some of which is unstructured.  What better place to start with machine learning than a place that you have a long history of data and have multiple types of data? It’s a perfect problem for a machine learning solution.

Additionally, the use of AI for things like chatbots can drive a great deal of value for your organization. In a reported described by Business Insider,  44% of consumers surveyed stated that they would use chatbots if the experience could be perfected/improved.  That’s an impressive number given that these chatbots are automated and people claim to want to speak with ‘real’ humans when contacting an organization.

Fraud Detection and Analysis

You don’t have to be a large credit card company to benefit from machine learning for fraud detection.  While those organizations do benefit greatly from implementing ML / AI systems and approaches, any organization that has large enough volumes of transactions can use various machine learning approaches to detect fraudulent activities. How much is ‘large enough’? I can’t tell you that…but if you have transactional data covering multiple years, you should have plenty of data to build an anomaly detection algorithm to see those transactions that are out of the ordinary. Fraudulent activity detection isn’t something every organization can benefit from, but it is a large area the lends itself well to machine learning approaches.

Supply Chain Management

Another area ripe for machine learning is the supply chain.  If you sell products and manage logistics, you have a great deal of data just waiting to have machine learning turned loose on it. You can find new efficiencies in your supply chain, find areas that can be improved upon and find new avenues for cost cutting as well as revenue.  The supply chain has a great deal of both structured and unstructured data as well as many different types of data that cover many different types of metadata (e.g., costs, times, production requirements, etc). The large amount of data as well as the various types of data provide an ideal base of data to apply ML techniques to better understand and manage the supply chain.

Measuring marketing ‘reach’ / brand exposure / campaign success

One of the first things that many organizations want to do with machine learning is to throw their marketing data at it to ‘do things better’.  While I find this fairly naive, I also love the enthusiasm.  Marketing and the data that marketing groups have is an ideal place for organizations to start investigating the use of ML/AI as there is generally plenty of data of varying types throughout every marketing organization. Using ML, organizations can get a better feel for who their customers are, how to reach them quicker and more effectively and how well their campaigns have performed.

Creating a better hiring process

When I was first approached by a client and asked if I could help them ‘improve their hiring process’ using machine learning, I was skeptical. I’ve always been skeptical of most hiring processes and have rarely seen an automated hiring process within HR that I would consider to be ‘good’.   I shared my concerns with them – and they agreed completely with me – so I agreed to help them build a proof of concept system that used machine learning and natural language processing to sift through resumes to fitler the ‘best’ ones to the top. Our first attempts were no better than  their existing keyword search systems but we quickly found an approach using keywords combined with other ‘flags’ that could find those types of people that this organization liked to hire and filter them to the top of the queue.

Using Machine Learning / AI during the hiring process is still a tricky concept because a human with domain experience will generally find the best candidates for a position, but ML can help filter the candidate pool.

What are you doing with machine learning?

There’s a lot of buzz about machine learning and AI these days. Most of that buzz is because of the real value that can be found with properly implemented machine learning/AI using quality data.

What cool things are you implementing with machine learning?

Beware the Models

Beware the Models

Beware the Models“But….all of our models have accuracies above 90%…our system should be working perfectly!”

Those were the words spoken by the CEO of a mid-sized manufacturing company. These comments were made during a conversation about their various forecasting models and the poor performance of those models.

This CEO had spent about a million dollars over the last few years with a consulting company who had been tasked with creating new methods and models for forecasting sales and manufacturing. Over the previous decade, the company had done very well for themselves using a very manual and instinct-driven process to forecast sales and the manufacture processes needed to ensure sales targets were met.

About three years ago, the CEO decided they needed to take advantage of the large amount of data available within the organization to help manage the organization’s various departments and businesses.

As part of this initiative, a consultant from a well known consulting organization was brought in to help build new forecasting models. These models were developed with many different data sets from across the organization and – on paper – they look really good. The presentation of these models include the ‘right’ statistical measures to show that they provide anywhere from 90% to 95% accuracies.

The models, their descriptions and the nearly 300 pages of documentation about how these new models will help the company make many millions of dollars over the coming years weren’t doing that they were designed to do. The results of the models were so far from the reality of what was happening with this organization’s real-world sales and manufacturing processes.

Due to the large divergence between model and reality, the CEO wanted an independent review of the models to determine what wasn’t working and why.  He reached out to me and asked for my help.

You may be hoping that I’m about to tell you what a terrible job the large, well known consultants did.  We all like to see the big, expensive, successful consulting companies thrown under the bus, right?

But…that’s not what this story is about.

The moral of this story? Just because you build a model with better than average accuracy (or even one with great accuracy), there’s no telling what that model will do once it meets the real world. Sometimes, models just don’t work. Or…they stop working. Even worse, sometimes they work wonderfully for a little while only to fail miserably some time in the near future.

Why is this?

There could be a variety of reasons. Here’s a few that I see often:

  • It could be from data mining and building a model based on a biased view of the data.
  • It could be poor data management that allows poor quality data into the modeling process. Building models with poor quality data creates poor quality models with good accuracy (based on poor input data).
  • It could be a poor understanding of the modeling process. There a lot of ‘data scientists’ out there today that have very little understanding of what the data analysis and modeling process should look like.
  • It could be – and this is worth repeating – sometimes models just don’t work. You can do everything right and the model just can’t perform in the real world.

Beware the models. Just because they look good on paper doesn’t mean they will be perfect (or even average) in the real world.  Remember to ask yourself (and your data / modeling teams) – are your models good enough?

Modeling is both an art and a science. You can do everything right and still get models that will make you say ‘meh’ (or even !&[email protected]^@$). That said, as long as the modeling process is approached correctly and the ‘science’ in data science isn’t forgotten, the outcome of analysis / modeling initiatives should at least provide some insight into the processes, systems and data management capabilities within an organization.

 

Big Data Roadmap – A roadmap for success with big data

Big Data Roadmap

Big Data RoadmapI’m regularly asked about how to get started with big data. My response is always the same: I give them my big data roadmap for success.  Most organizations want to jump in a do something ‘cool’ with big data. They want to do a project that brings in new revenue or adds some new / cool service or product, but I always point them to this roadmap and say ‘start here’.

The big data roadmap for success looks starts with the following initiatives:

  • Data Quality / Data Management systems (if you don’t have these in place, that should be the absolute first thing you do)
  • Build a data lake (and utilize it)
  • Create self-service reporting and analytical systems / processes.
  • Bring your data into the line-of-business.

These are fairly broad types of initiatives, but they are general enough for any organization to be able to find some value.

Data Management / Data Quality / Data Governance

First of all, if you don’t have proper data management / data quality / data governance, fix that. Don’t do anything else until you can say with absolute certainty that you know where your data has been, who has touched your data and where that data is today. Without this first step, you are playing with fire when it comes to your data. If you aren’t sure how good your data is, there’s no way to really understand how good the output is of whatever data initiative(s) you undertake.

Build a data lake (and utilize it)

I cringe anytime I (or anyone else) says/writes data lake because it reminds me too much of the data warehouse craze that took CIO’s and IT departments by storm a number of years ago. That said, data lakes are valuable (just like data warehouses where/are valuable) but it isn’t enough to just build a data lake…you need to utilize it. Rather than just being a large data store, a data lake should store data and give your team(s) the ability to find and use the data in the lake.

Create self-service reporting and analytical systems / processes.

Combined with the below initiative or implemented separately, developing self-service access and reporting to your data is something that can free up your IT and analytics staff. Your organization will be much more efficient if any member of the team can build and run a report rather than waiting for a custom report to be created and executed for them. This type of project might feel a bit like ‘dashboards’ but it should be much more than that – your people should be able to get into the data, see the data and manipulate the data and then build a report or visualization based on those manipulations. Of course, you need a good data governance process in place to ensure that the right people can see the right data.

Bring your data into the Line of Business

This particular initiative can be (and probably should be) combined with the previous one (self-service), but by itself it still makes sense to focus on by itself. By bringing your data into the line of business, you are getting it closer to the people that best understand the data and the context of the data. By bringing data into the line of business (and providing the ability to easily access and utilize said data), you are exponentially growing the data analytical capabilities of your organization.

Big Data Roadmap – a guarantee?

There’s no guarantee’s in life, but I can tell you that if you follow this roadmap you will have a much better chance at success than if you don’t.  The key here is to ensure that your ‘data in’ isn’t garbage (hence the data governance and data lake aspects) and that you get as much data as you can in the hands of the people that understand the context of that data.

This big data roadmap won’t guarantee success, but it will get you further up the road toward success then you would have been without it.

 

Are your machine learning models good enough?

Are your machine learning models good enough?

Are your machine learning models good enough?Imagine you’re the CEO of XYZ Widget company.  Your Chief Marketing Officer (CMO),  Chief Data Officer (CDO) and Chief Operations Officer (COO) just finished their quarterly presentations and were highlighting the success from the various machine learning projects that have been in the works. After the presentations were complete, you begin to wonder – ‘are these machine learning models good enough?’

You’ve invested a significant portion of your annual budget on big data and machine learning projects and based on what your CMO and CDO tell you, things are looking really good. For example, your production and revenue forecasting projects are both delivering some very promising results with recent forecasts being within 2% of actual numbers.

You don’t really understand any of the machine learning stuff though. It seems like magic to you but you trust that the people doing the work understand it and are doing things ‘right’. That said, you have a feeling deep down that something isn’t quite right.  Sure, things look good but just like magic – the output of these machine learning initiatives could just be an illusion.

Are these machine learning models good enough? — Getting past the illusion

While machine learning, deep learning and big data can provide an enormous amount of value to an organization, there is ample opportunity to mess things up dramatically. There are plenty of times where small errors (and even massive errors) can be introduced into the process. For example, during the data munging / exploration phase, a simple error can introduce changes in the data, which could cause massive changes in the results of any modeling.

Additionally, bias can easily be introduced to the process (either on purpose or by accident). This bias can push the results to tell a story that people want the data / models to tell.  It is very easy to fall into the “let’s use statistics to support our view” trap that many fall into.  Rather than look for data and/or  outputs to support your view (and hence build an illusion), your machine learning initiatives (and any other data projects) should be as bias free as possible.

When done right, there’s very little ‘illusion’ in machine learning. The results are the results just like the data is the data.   You either find answers to your questions (and hopefully find more questions) or you don’t.   The results may not be what you wanted to see, but they are what they are…and this is the exact reason you need to be able to trust the process that was used to find those results. You need to understand if (and where) bias was introduced. You need to understand the process in general.

Can your team describe how was the data gathered and cleaned? Where the models used in the process optimized and/or overfit. Can your team explain their rationale for doing what they did?   Your forecasting models are within 2% of actual numbers in recent months, but that doesn’t mean your models are well built and will hold up over time…it could just mean they are overfit and are doing well with very similar numbers to what you’ve given your machine learning algorithm. What do your models really show for things like R-Squared and Mean Absolute Error (MAE)?  Do you understand why R-Squared and MAE are important?  If not, your teams need to make sure they are explained in general terms and describe why those things are important. Also..here’s a few links for you to learn more about R-Squared and MAE.

You don’t have to become an expert

It takes time and a willingness to ‘get your hands dirty’ to get anywhere close to being an expert in machine learning. Most business leaders don’t need to become an expert but you if you spend a little time understanding the basics and the process that your team follows, it might help remove the ‘magic’ aspect associated with machine learning

My suggestion is to spend some time talking to your team(s) about the following topics to get a basic understanding of the three main steps / processes in machine learning.  Below, I’ve outlined the three main areas and included some questions for you to consider.  Note: These aren’t a definitive list of questions / areas but they’ll get you started.

Data Gathering / Preparation / Cleaning

  • How was the data gathered?
  • What data quality measures / methods were undertaken to ensure the data’s accuracy and provenance?
  • What steps were taken to clean / prepare the data?
  • How is new data being gathered / cleaned / prepared for inclusion in existing / new models?
  • Who has access to the data?

Modeling

  • Why was the model (or models) chosen?
  • Were other models considered? If so, why weren’t they used?
  • Did you ‘build your own’ or use existing libraries to build the model?
  • Where the proper data preparation steps taken for the model(s) selected?

Evaluation &Interpretation of Results

  • How do you know the model is ‘good enough’?
  • When and why did you stop iterating on the model / data?
  • What accuracy measures are you using for the model(s)?
  • Are we sure the data isn’t being overfitted? How?
  • Why are the visualizations that are presented used? (Note: the use or non-use of certain visualizations can be a tip-off that something isn’t right about the data / model).

Again – these aren’t meant to be a definitive list of questions / topical areas for you to consider but they should get you started asking good questions of your team.   I particularly love to ask the How do you know the model is good enough question because it sheds a lot of light on the entire process and the mental approach to the problem.

Are these machine learning models good enough?

The answers to the above questions should help you get a better feel for how your team(s) approached the issue at hand and help you (and the rest of your leadership team) understand the approach to data preparation, modeling and evaluation in your machine learning initiatives.

The above questions and answers might not specifically answer the ‘are your machine learning models good enough’ question, but they will get you and your team(s) to a point where they are constantly thinking about whether ‘good enough’ is enough. Sometimes it is…others it isn’t. That’s why you need to understand a bit more about the process to understand whether good enough is good enough.

Of course, if you need help trying to understand all this stuff…you can always hire me to help. Give me a call or drop me an email and let’s discuss your needs.