I’ve recently written about the risks of machine learning (ML), but with this post I wanted to take a step back and talk about ML and general. I want to talk about the ‘why’ of machine learning and whether you and/or your company should be investigating machine learning. Do you need machine learning? Maybe. Maybe not.
The first question you have to ask yourself (and then answer) is this: Why do you want to be involved with machine learning? What problem(s) are you really trying to solve? Are you trying to forecast revenue for next quarter? You can probably do just fine with standard time series modeling techniques. Are you trying to predict house prices in cities/neighborhoods around the world? Machine learning is probably a good idea.
I use this rule of thumb when talking to clients about machine learning:
If you are trying to forecast something with a small number of values / features – start with standard forecasting / modeling techniques. You can always move on to machine learning after working through the standard approaches.
If you have a complex model / algorithm with many features, then machine learning is something to consider.
The key here is ‘complex’.
Sure, machine learning can be applied to simple problems but there’s plenty of other approaches that might be just as good. Take the forecasting revenue example – there are multitudes of time series forecasting techniques you can use to create these forecasts. Even if you have hundreds of product lines, you are most likely using a few ‘features’ to forecast one outcome which can easily be handled by Holt-Winters, ARIMA and other time-series forecasting techniques. You could throw this same problem at a ML algorithm / method and possibly get slightly better (or worse) results but the amount of time and effort to implement an ML approach may be wasted.
Where you get the most value from machine learning is when you have a problem that really vexes you. The problem is so complex that you just don’t know where to start. THAT is when you reach for machine learning.
Do you really need machine learning?
There are a LOT of people that will immediately tell you ‘yes!’ when asked if you should be investigating ML. They are also the people that are trying to sell you ML / AI services and/or platforms. They are the people that have jumped on the band wagon and are chasing the latest buzzwords in the marketplace. In 2 years, those same people will be jumping up and down telling you need to implement whatever is at the top of the buzzword queue at the time. They are the same people that were telling you that you needed to implement a data warehouse and business intelligence platforms in the past. Don’t get me wrong – data warehouses and business intelligence have their places but they weren’t right for every organization and/or every problem.
Do you need machine learning? Maybe.
Do you have complex stream of data that you need to process and turn into knowledge and actionable intelligence? Definitely look into machine learning.
Do you need machine learning? Maybe not.
If you want to ‘do’ machine learning because everyone else is, feel free to investigate it and start building up your skills but don’t throw an enormous budget at it until you know beyond a shadow of a doubt that you need machine learning.
Or you could call me. I can help you figure out if you really need machine learning.
Digital transformation has taken center stage in many organizations. Need convincing?
IDC predicts that two-thirds of the CEOs of Global 2000 companies will have digital transformation at the center of their corporate strategies by the end of 2017.
Four in 10 IT leaders in the Computerworld 2017 Tech Forecast study say more than 50% of their organization has undergone digital transformation.
According to Gartner, CIOs are spending 18% of their budget on digitization efforts and expect to see that number grow to 28% by 2018.
Based on this data (and in my regular talks with CIOs), there’s a high probability that you have an initiative underway to digitize one or more aspects of your organization. You may even be well along the digital transformation path and feeling pretty good about your progress. I don’t want to rain on your digital transformation parade, but before you go any further on your journey, you should take a long, hard look at your data.
Data is the driving force behind every organization today, and thus the driving force behind any digital behind any digital transformation initiative. Without good, clean, accessible, and trustworthy data, your digital transformation journey may be a slow (and possibly difficult) one. Leveraging data to help speed up your digital transformation initiatives first requires proper data management and governance. Once that’s in place, you can begin to explore ways to open up the data throughout the organization.
Digital transformation is doomed to fail if some (or all) of your data is stored in silos. Those data silos may have worked great for your business in the past by segmenting data for ease of management and accessibility, but they have to be demolished in order to compete and thrive in the digital world. To transform into a truly digital organization, you can no longer allow marketing’s data to remain with marketing and finance data to remain within finance. Not only do these data silos make data management and governance more complex, they are challenges to the types of analysis that deliver new insights into the business (e.g., analyzing revenue streams by looking at new ways of combining marketing and financial data). Data needs to be accessible using modern data management, data governance and data integration systems (with the proper security protocols in place) in order to make data accurate and usable to be a used as a driving force for digital transformation.
Removing data silos is just one aspect of the required data management and governance needed for driving digital transformation. Implementing data management and governance systems and processes that allow your data to remain secure while remaining available for analysis is a building stone for digital transformation.
In order to speed up your transformation projects and initiatives, you really need to take a long, hard look at your data. If you have good data management and governance throughout your organization, you are one step ahead of those companies that haven’t focused on managing their data as a strategic asset rather than allowing data to be hoarded and live in silos around the organization.
Digital transformation will be one of the key areas of focus for CIOs for some time to come and it just might just be the key to remaining competitive in your market, so anything you can do today to help your transformation projects succeed should be immediately considered. Having a good data management and governance plan and system in place should help drastically speed up your digitization initiatives.
Data is the lifeblood of any organization today so it should be easy to understand that security of that data is just as important (if not more important) that the data itself. It seems that data security (or rather the lack thereof) has been in the news regularly over the last few years. The inability for organizations to secure their has caused millions (if not billions) of dollars in damages from lost revenue in addition to the loss of trust. A machine learning approach will never fully replace a human in the security chain, but it can help IT professionals monitor IT system and data security as well as monitor who (and how) data is accessed and used throughout the organization.
That said, only a small percentage of these same security conscious people have systems or processes in place that accurately and quickly monitor how secure their data is. In my experiences, sensitive data in most organizations is generally secure but isn’t regularly monitored or audited due to the costs and time commitment needed for analyzing access patterns and ensuring there’s been no intrusions. In fact, in many organizations, IT professionals would be unable to provide a clear location of sensitive data throughout their organization.
In a Ponemon report titled ‘The State of Data Centric Security’, 57% of survey respondents report see their biggest security risk being that they don’t understand where their sensitive data lives. According to that same report, most IT professionals (79% of respondents) believe that not knowing where their sensitive data lives is a big security concern but only a small majority (51% of respondents) believe that it should be a priority to protect and secure their sensitive data. This gap is problematic and will cause significant issues for organizations.
Data has been – and will continue to be – a large part of most organizations’ digital transformation strategy. That said, this data is also creating new vulnerabilities without the property security systems and process in place. Graeme Thompson, CIO of Informatica, argues this point very well in Data Security: Don’t Call an Ambulance for a Sore Throat when he writes:
Just as businesses have evolved toward the cloud, they’re also evolving toward enterprise-wide data access. We recognize the valuable insights and innovations to be gleaned from trading siloed departmental data warehouses for the comprehensive enterprise data lake. Tearing down those silos can cost us a layer of security around specific data sets, but curling up in an information panic room is not the way forward.
Last year, I was speaking with the CISO for a large enterprise organization. The conversation was around how much time they’ve been spending on thinking about and securing their IT systems and their data. This particular CISO has done a very good job of implementing master data management systems and processes to ensure their data is safe, accurate and available. Though he has done an admirable job, he worries that he doesn’t have the manpower or budget to feel comfortable that the organization’s data is as secure as it can be.
With the large amounts of both structured and unstructured data in most organizations, some of the older IT security approaches may not work as well as they might have in the past. My suggestion to this CISO was to spend some time investigating the use of machine learning approaches to data security. Machine learning can provide an organization with a ‘second set’ of eyes and ears that can be focused on data security. Implementing machine learning systems can not only free up team members to focus on other things but – more importantly – these systems can monitor threats and issues at a scale that humans just can’t replicate.
The CISO I mentioned earlier is currently trialing an approach that uses machine learning security monitoring system for both his IT systems and his various data stores and, even though this system has only been in place for less a few months, he’s already begun to see efficiency improvements for security monitoring across the enterprise. As an example, after only a few days of their new machine learning enabled security platform being in place, they were seeing hundreds of issues through their monitoring systems that they hadn’t been able to capture before. From these efficiencies, he’s been able to re-assign one of his IT personnel from full-time security monitoring to a less than full-time role because the monitoring has been capable of raising alerts in real-time without any manual intervention.
In addition to the act of monitoring for intrusions and security issues, these machine learning systems can help IT professionals locate and manage their sensitive data, recommend remediation efforts and actions when issues are found and gain a better understanding of who is accessing and using data across the organization.
Like many other areas within the modern organization, machine learning is changing how companies approach data security and changing data security itself. Machine learning isn’t a panacea for security, but it is is a very good tool to have in your security tool box.
A few weeks ago, I wrote about machine learning risks where I described four ‘buckets’ of risk that needed to be understood and mitigated when you have machine learning initiatives. One major risk that I *should* have mentioned explicitly is the risk of accuracy and trust in machine learning. While I tend to throw this risk into the “Lack of model variability” bucket, it probably deserves to be in a bucket all its own or, at the very least, it needs to be discussed.
Accuracy in any type of modeling process is a very nebulous term. You can only build a model to be as accurate as the training data that the model sees. I can over-optimize a model and generate an MAE (Mean Absolute Error) that is outstanding for the model/data. I can then use that outstanding MAE to communicate the impressive accuracy of my model. Based on my impressively accurate model, my company then changes processes to make this model the cornerstone of their strategy…and then everyone realizes the model is almost worthless when ‘real-world’ data is used.
New (and experienced) data scientists need to truly understand what it means to have an accurate model If you go out there and surf around the web you’ll see a lot of people that are new to the machine learning / deep learning world who have taken a few courses and thrown up a few projects on their github repository and call themselves a ‘data scientist’. Nothing wrong with that – everyone has to start somewhere – but the people that tend to do well as data scientists understand the theory, process and mathematics of modeling just as much as (or more than) the ability to code up a few machine learning models.
Modeling (which is really what you are doing with machine learning / deep learning) is much more difficult than many people realize. Sometimes, building a model that delivers 55% accuracy can deliver much more value to an person/organization that one that has been over-optimized to deliver 90% accuracy.
As an example, look at the world of investing. There are very famous traders and investors who have models that are ‘accurate’ less than half the time yet they make millions (and billions) off of those models (namely because risk management is a large part of their approach to the markets). This may not be a good analogy to use for a manufacturing company trying to use machine learning to forecast demand over the next quarter but the process these investors take in building their models are absolutely the same as those steps needed to build accurate and trustworthy models.
Accuracy and Trust in Machine Learning
If you’ve built models in the past, do you absolutely trust that they will perform in the future as well as they’ve performed when trained using your historical data?
Accuracy and trust in machine learning should go hand in hand. If you tell me your model has ‘good’ MAE (or RMSE or MAPE or whatever measure you use), then I need you to also tell me why you chose that measure and what variances you’ve seen in errors. Additionally, I’d want you to tell me how you built that model. How big was your training dataset? Did you do any type of walk-forward testing (in the case of time series modeling)? What have you done about bias in your data?
The real issue in the accuracy and trust debate isn’t with the technical skills of the data scientist to be honest. A good data scientist will know this stuff inside and out from a technical standpoint. The real issue is in the communication ability of the data scientist and the people she is talking to. An MAE Of 3.5 might be good or it might be bad and the non-technical / non-data scientists would have no clue in how to interpret that value. The data scientist will need to be vary specific about what that value means from an accuracy standpoint and what that might mean when this model is put into production.
Accuracy and trust in machine learning / modeling has been – without question – the biggest challenge that i’ve run across in my career. I can find really good data scientists and coders to build really cool machine learning models. I can find a lot of data to throw at those models. But what I’ve found hardest is helping non-data folks understand the outputs and what those outputs mean(which touches on the Output Interpretation risk I mentioned when I wrote about machine leaning risks).
I’ve found a good portion of my time spent while working with companies on modeling / machine learning is spent on analyzing model outputs and helping the business owners understand the accuracy / trust issues.
How do you (or your company) deal with the accuracy vs trust issue in machine learning / modeling?
Machine Learning Risks are real and can be very dangerous if not managed / mitigated.
Everyone wants to ‘do’ machine learning and lots of people are talking about it, blogging about it and selling services and products to help with it. I get it…machine learning can bring a lot of value to an organization – but only if that organization knows the associated risks.
Deloitte splits machine learning risks into 3 main categories: Data, Design & Output. This isn’t a bad categorization scheme, but I like to add an additional bucket in order to make a more nuanced argument machine learning risks.
My list of ‘big’ machine learning risks fall into these four categories:
Bias – Bias can be introduced in many ways and can cause models to be wildly inaccurate.
Data – Not having enough data and/or having bad data can bring enormous risk to any modeling process, but really comes into play with machine learning.
Lack of Model Variability (aka over-optimization) – You’ve built a model. It works great. You are a genius…or are you?
Output interpretation – Just like any other type of modeling exercise, how you use and interpret the model can be a huge risk.
In the remainder of this article, I spend a little bit of time talking about each of these categories of machine learning risks.
Machine Learning Risks
One of the things that naive people argue as a benefit for machine learning is that it will be an unbiased decision maker / helper / facilitator. This can’t be further from the truth.
Machine learning models are built by people. People have biases whether they realize it or not. Bias exists and will be built into a model. Just realize that bias is there and try to manage the process to minimize that bias.
In addition to the bias that might be introduced by people, data can be biased as well. Bias that’s introduced via data is more dangerous because its much harder to ‘see’ but it is easier to manage.
For example, assume you are building a model to understand and manage mortgage delinquencies. You grab some credit scoring data and build a model that predicts that people with good credit scores and a long history of mortgage payments are less likely to default. Makes sense, right? But…what if a portion of those people with good credit scores had mortgages that were supported in some form by tax breaks or other benefits and those benefits expire tomorrow. What happens to your model if those tax breaks go away? I’d put money on the fact that your model isn’t going to be able to predict the increase in numbers of people defaulting that are probably going to happen.
Data bias is dangerous and needs to be carefully managed. You need domain experts and good data management processes (which we’ll talk about shortly) to overcome bias in your machine learning processes.
From the mortgage example above, you can (hopefully) imagine how big of a risk bias can be for machine learning. Managing bias is a very large aspect to managing machine learning risks.
The second risk area to consider for machine learning is the data used to build the original models as well as the data used once the model is in production. I talked a bit about data bias above but there are plenty of other issues that can be introduced via data. With data, you can have many different risks including:
Data Quality (e.g., Bad data)– do you know where your data has been, who has touched it and what the ‘pedigree’ of your data is? If you don’t, you might not have the data that you think you do.
Not enough data – you can build a great model on a small amount of data but that model isn’t going to be a very good model long-term due unless all your future data looks exactly like the small amount of data you used to build it. When building models (whether they are machine learning models or ‘standard’ models), you want as much data as you can get.
Homogeneous data – similar to the ‘not enough data’ above risk above, this risk comes from a lack of data – but not necessarily the lack of the amount of data but the lack of variability of the data. For example, if you want to forecast home prices in a city, you probably want to get as many different data sets as you can find to build these models. Don’t use just one data-set from the local tax office….who knows how accurate that data is. Find a couple of different data sets with many different types of demo-graphical data points and then spend time doing some feature engineering to find the best model inputs for accurate outputs.
Fake Data – this really belongs in the ‘bad data’ risk, but I wanted to highlight it separately because it can be (and has been) a very large issue. For example, assume you are trying to forecast revenue and growth numbers for a large multi-national organization who has offices in North America, South America and Asia. You’ve pulled together a great deal of data including economic forecasts and built what looks to be a great model. Your organization begins planning their future business based on the outcome of this model and use the model to help make decisions going forward. How sure are you that the economic data is real?
Data Compliance issues – You have some data…but can you (or should you) use it? Simple question but one that vexes many data scientists – and one that doesn’t have an easy answer.
Lack of Model Variability (aka over-optimization)
You spend weeks building a model. You train it and train it and train it. You optimize it and get an outstanding measure for accuracy. You’re going to be famous. Then…the real data starts hitting the model. Your accuracy goes into the toilet. Your model is worthless.
What happened? You over-optimized. I see this all the time in the financial markets when people try to build a strategy to invest in the stock market. They build a model strategy and then tweak inputs and variables until they get some outrageous accuracy numbers that would make them millionaires in a few months. But that rarely (never?) happens.
What happens is this – an investing strategy (e.g., model) is built using a particular set of data. The inputs are tweaked to give the absolute best output without regards to variability of data (e.g., new data is never introduced). When the investing strategy is then applied to new, real world data, it doesn’t perform anywhere near as well as it did on the old tested data. The dreams of being a millionaire quickly fade as the investor watches their investing account value dwindle.
In the world of investing, this over-optimization can be managed with various performance measures and using a method called walk-forward optimization to try to get as much data in as many different timeframes as possible into the model.
Similar approaches should be taken in other model building exercises. Don’t over-optimize. Make sure the data you are feeding your machine learning models are varied across both data types, timeframes, demo-graphical data-sets and as many other forms of variability that you can find.
Some folks might call ‘lack of model variability’ by another name — Generalization Error. Regardless of what you call this risk…its a risk that exists and should be carefully managed throughout your machine learning modeling processes.
You spend a lot of time making sure you have good data, the right data and the as much data as you can. You do everything right and build a really good machine learning model and process. Then, your boss takes a look at it and interprets the results in a way that is so far from accurate that it makes your head spin.
This happens all the time. Model output is misinterpreted, used incorrectly and/or the assumptions that were used to build the machine learning model are ignored or misunderstood. A model provides estimates and guidance but its up to us to interpret the results and ensure the models are used appropriately.
Here’s an example that I ran across recently. This is a silly one and might be hard to believe – but its a good example to use. An organization had one of their data scientists build a machine learning model to help with sales forecasting. The model was built on the assumption that all data would be rolled up to quarterly data for modeling and reporting purposes. While i’m not a fan of up-sampling data from high to low granularity, but it made sense for this particular modeling exercise.
This particular model was built on quarterly data with a fairly good mean error rate and good variance measures. Looking at all the statistics, it was a good model. The output of the model was provided to the VP of Sales who immediately got angry. He called up the manager of the data scientist and read her the riot act. He told her the reports were off by a factor of anywhere from 5 to 10 times what it should be. He was furious and shot off an email to the data team, the sales team and the leadership team decrying the ‘fancy’ forecasting techniques declaring that it was forecasting 10x growth of the next year and “had to be wrong!”
Turns out he had missed that the output was showing quarterly sales revenue instead of weekly revenue like he was used to seeing.
Again – this is a simplistic example but hopefully it makes sense that you need to understand how a model was built, what assumptions were made and what the output is telling you before you start your interpretation of the output.
One more thing about output interpretation…a good data scientist is going to be just as good at presenting outputs and reporting on findings as they are at building the models. Data scientists need to be just as good at communicating as they are at data manipulation and model building.
Finishing things up…
This has been a long one…thanks for reading to here. Hopefully its been informative. Before we finish up completely, you might be asking something along the lines of ‘what other machine learning risks exists?’
If you asked 100 data scientists and you’ll probably get as many different answers of what the ‘big’ risks are – but I’d bet that if you sit down and categorize them all, the majority of them would fall into these four categories. There may be some outliers (and I’d love to add those outliers to my list if you have some to share).
What can you do as a CxO looking at machine learning / deep learning / AI to help mitigate these machine learning risks? Like my friend Gene De Libero says: ‘Test, learn, repeat (bruises from bumping into furniture in the dark are OK).”
Go slow and go small. Learn about your data and your businesses capabilities when it comes to data and data science. I know everyone ‘needs’ to be doing machine learning / AI but you really don’t need to throw caution to the wind. Take your time to understand the risks inherent in the process and find ways to mitigate the machine learning risks and challenges.
I can help mitigate those risks. Feel free to contact me to see how I might be able to help manage machine learning risks within your project / organization.
Everyone’s talking about machine learning (ML) and Artificial Intelligence (AI) these days. If you are a CxO or work in IT or marketing, I’d bet that you hear these terms more than you probably want to. It feels an awful lot like the early data of Big Data or Business Intelligence or the days when the “Intranet” was first making waves within organizations.
Like most new technologies (ahem…buzzwords), machine learning and AI can seem like solutions looking for problems. While I would argue there are people / companies looking for problems to throw their experience with AI and machine learning at, there are some viable problems out there for ML/AI. That said, I still stand behind my argument that you probably don’t need machine learning…but every organization should investigate the use of ML/AI.
Rather than buy a solution and then look for a problem to through it at (like many vendors / consultants are pushing these days), its worthwhile for every company to spend some time looking at a few important areas within their businesses to see if there’s anything that ML/AI can do to help.
Below are a few examples I’ve helped organizations with over the last few years.
Areas to start investigating the use of Machine Learning
Improving/Personalizing Customer Service
Customer service is one of those areas that you either immediately think “yes…that’s a perfect place to use ML/AI” or “uh…what?”. Hopefully you fall into the former category because customer service is an ideal space for implementing machine learning and artificial intelligence to help improve service, better understand your customers and personalize interactions. Why’s it an ideal space? Because you have a lot of data – some of which is structured and some of which is unstructured. What better place to start with machine learning than a place that you have a long history of data and have multiple types of data? It’s a perfect problem for a machine learning solution.
Additionally, the use of AI for things like chatbots can drive a great deal of value for your organization. In a reported described by Business Insider, 44% of consumers surveyed stated that they would use chatbots if the experience could be perfected/improved. That’s an impressive number given that these chatbots are automated and people claim to want to speak with ‘real’ humans when contacting an organization.
Fraud Detection and Analysis
You don’t have to be a large credit card company to benefit from machine learning for fraud detection. While those organizations do benefit greatly from implementing ML / AI systems and approaches, any organization that has large enough volumes of transactions can use various machine learning approaches to detect fraudulent activities. How much is ‘large enough’? I can’t tell you that…but if you have transactional data covering multiple years, you should have plenty of data to build an anomaly detection algorithm to see those transactions that are out of the ordinary. Fraudulent activity detection isn’t something every organization can benefit from, but it is a large area the lends itself well to machine learning approaches.
Supply Chain Management
Another area ripe for machine learning is the supply chain. If you sell products and manage logistics, you have a great deal of data just waiting to have machine learning turned loose on it. You can find new efficiencies in your supply chain, find areas that can be improved upon and find new avenues for cost cutting as well as revenue. The supply chain has a great deal of both structured and unstructured data as well as many different types of data that cover many different types of metadata (e.g., costs, times, production requirements, etc). The large amount of data as well as the various types of data provide an ideal base of data to apply ML techniques to better understand and manage the supply chain.
One of the first things that many organizations want to do with machine learning is to throw their marketing data at it to ‘do things better’. While I find this fairly naive, I also love the enthusiasm. Marketing and the data that marketing groups have is an ideal place for organizations to start investigating the use of ML/AI as there is generally plenty of data of varying types throughout every marketing organization. Using ML, organizations can get a better feel for who their customers are, how to reach them quicker and more effectively and how well their campaigns have performed.
Creating a better hiring process
When I was first approached by a client and asked if I could help them ‘improve their hiring process’ using machine learning, I was skeptical. I’ve always been skeptical of most hiring processes and have rarely seen an automated hiring process within HR that I would consider to be ‘good’. I shared my concerns with them – and they agreed completely with me – so I agreed to help them build a proof of concept system that used machine learning and natural language processing to sift through resumes to fitler the ‘best’ ones to the top. Our first attempts were no better than their existing keyword search systems but we quickly found an approach using keywords combined with other ‘flags’ that could find those types of people that this organization liked to hire and filter them to the top of the queue.
Using Machine Learning / AI during the hiring process is still a tricky concept because a human with domain experience will generally find the best candidates for a position, but ML can help filter the candidate pool.
What are you doing with machine learning?
There’s a lot of buzz about machine learning and AI these days. Most of that buzz is because of the real value that can be found with properly implemented machine learning/AI using quality data.
What cool things are you implementing with machine learning?
Eric D. Brown, D.Sc. is a technology consultant, investor and entrepreneur with an interest in using technology and data to solve real-world business problems. He currently runs his own consulting practice focused on helping organizations use their data more efficiently. Additionally, he is the Chief Information Officer of Sundial Capital Research, publisher of sentimenTrader
Eric received his Doctor of Science (D.Sc.) in Information Systems in 2014 with a dissertation titled “Analysis of Twitter Messages for Sentiment and Insight for use in Stock Market Decision Making”. His research interests are currently in the areas of decision support, data science, big data, natural language processing, sentiment analysis and social media analysis.In recent years, he has combined sentiment analysis, natural language processing and big data approaches to build innovative systems and strategies to solve interesting problems. You can read some of his research here: Eric D. Brown on ResearchGate
In addition, he is an entrepreneur that has launched a few companies with the most recent being a company focused on proving data analytics and visualization services to the financial markets.