What would you do if you had so much data about your customers that you know could know (almost) everything about your customer when they contacted you? Better yet, what if you had the ability to instantly know the exact offer for service or product that would pitch the right ‘sales’ approach that your customer would immediately sit up, take notice and spend money?
Most of you would jump at the chance to have this information about your clients. You may be willing to open up the checkbook for a huge amount of money to make this happen. What if I told you that you don’t need to do much more than get a better grasp on your data and understand how to use that data to build a 360 degree view of your customer?
Granted, you may need to collect a bit more data (and perhaps find new types of data) and you may need to implement some new data management processes and/or systems, but you shouldn’t have to start from scratch – unless you have no data skills, people or processes. For those companies that already have a data strategy and a team of data geeks, building a customer-centric view with data can be extremely rewarding.
Many companies consider themselves ‘customer-centric’ and have built programs and processes in order to ‘focus on the customer. They may have done a very good job in this regard but there’s more than can be done. Most organizations have focused on Customer Relationship Management (CRM) as a way to help drive interactions with clients. While a CRM platform is important and necessary, most of these platforms are nothing more than data repositories that provide very little value to an organization beyond the basics of ‘we talked to this person’ or ‘we sold widget X to that customer.’
Utilizing proper data management and the data lake concept, companies can begin to build much broader viewpoints into their customer base. Using data lakes filled with CRM data along with customer information, social media data, demographics, web activity, wearable data and any other data you can gather about your customers you (with the help of your data science team) can begin to build long-term relationships built on more than just some basic data.
In addition to better relationships with your customers, a data-centric approach can help you better predict the activities of your customers, thereby helping you better position your marketing and messaging. Rather than hope your messaging is good enough to reach a small percentage of your customer base, the data-centric approach can allow you to take advantage of the knowledge, skills and systems available to you. Additionally, this approach will allow your data team to create personal and individual programs and messaging to help drive marketing and customer service.
In the more sophisticated organizations that have implemented proper data integration and management systems, the amount of time spent sifting through and cleaning data is much lower and, in my experience, more in line with the numbers reported in the 2017 Data Scientist Report by Crowdflower.
That report indicates a better balance between basic data-wrangling activities and more advanced analysis:
51% of time spent on collecting, labeling, cleaning and organizing data
19% of time spent building and modeling data
10% of time spent mining data for patterns
9% of time spent refining algorithms
Closing the Gaps
If we think about this data transformation in terms of person-hours, there’s a big difference between a data scientist spending 80% of their time finding and cleaning their data and a data scientist spending 51% of their time on that same tasks. Closing the gap begins with demolishing the data silos that impede organization’s’ ability to extract actionable insights from the data they’re collecting.
Digital transformation projects have become a focus of many CIOs, with the share of IT budgets devoted to these projects expected to grow from 18% to 28% in 2018. Top-performing businesses are allocating nearly twice as much budget to digital transformation projects – 34% currently, with plans to increase the share even further to 44% by 2018.
CIOs in these more sophisticated organizations – let’s call them data-driven disruptors – have likely had far more success finding ways to manage the exponential growth and pace of data. These CIOs realize the importance of combating SaaS sprawl, among other data management challenges, and have found better ways to connect the many different systems and data stores throughout their organization.
As a CIO, if you can free up your data team(s) from dealing with the basics of data management and let them focus their efforts on the “good stuff” of data analytics (e.g., data modeling, mining, etc.), you’ll begin to see your investments in big data initiatives deliver real, meaningful results.
Big data has moved from buzzword to being a part of everyday life within enterprise organizations. An IDG survey reports that 75% of enterprise organizations have deployed or plan to deploy big data projects. The challenge now is capturing strategic value from that data and delivering high-impact business outcomes. That’s where a Chief Data Officer (CDO) enters the picture. While CDO’s have been hired in the past to manage data governance and data management, their role is transitioning into one focused on how to best organize and use data as a strategic asset within organizations.
“The CDO should not just be part of the org chart, but also have an active hand in launching new data initiatives,” Patricia Skarulis, SVP & CIO of Memorial Sloan Kettering Cancer Center, said at the recent CIO Perspectives conference in New York.
Chief Data Officer – What, when, how
A few months ago, I was involved in a conversation with the leadership team of a large organization. This conversation revolved around whether they needed to hire a Chief Data Officer and, if they did, what that individual’s role should be. It’s always difficult creating a new role, especially one like the CDO whose oversight spans multiple departments. In order to create this role (and have the person succeed), the leadership team felt they needed to clearly articulate the specific responsibilities and understand the “what, when, and how” aspects of the position.
The “when” was an easy answer: Now.
The “what” and the “how” are a bit more complex, but we can provide some generalizations of what the CDO should be focused on and how they should go about their role.
First, as I’ve said, the CDO needs to be a collaborator and communicator to help align the business and technology teams in a common vision for their data strategies and platforms, to drive digital transformation and meet business objectives.
In addition to the strategic vision, the CDO needs to work closely with the CIO to create and maintain a data-driven culture throughout the organization. This data-driven culture is an absolute requirement in order to support the changes brought on by digital transformation today and into the future.
“My role as Chief Data Officer has evolved to govern data, curate data, and convince subject matter experts that the data belongs to the business and not [individual] departments,” Stu Gardos, CDO at Memorial Sloan Kettering Cancer Center, said at the CIO Perspectives conference.
Lastly, the CDO needs to work with the CIO and the IT team to implement proper data management and data governance systems and processes to ensure data is trustworthy, reliable, and available for analysis across the organization. That said, the CDO can’t get bogged down in technology and systems but should keep their focus on the people and processes as it is their role to understand and drive the business value with the use of data.
In the meeting I mentioned earlier, I was asked what a successful Chief Data Officer looks like. It’s clear that a successful CDO crosses the divide between business and technology and institutes data as trusted currency that is used to drive revenue and transform the business.
A few weeks ago, I wrote about machine learning risks where I described four ‘buckets’ of risk that needed to be understood and mitigated when you have machine learning initiatives. One major risk that I *should* have mentioned explicitly is the risk of accuracy and trust in machine learning. While I tend to throw this risk into the “Lack of model variability” bucket, it probably deserves to be in a bucket all its own or, at the very least, it needs to be discussed.
Accuracy in any type of modeling process is a very nebulous term. You can only build a model to be as accurate as the training data that the model sees. I can over-optimize a model and generate an MAE (Mean Absolute Error) that is outstanding for the model/data. I can then use that outstanding MAE to communicate the impressive accuracy of my model. Based on my impressively accurate model, my company then changes processes to make this model the cornerstone of their strategy…and then everyone realizes the model is almost worthless when ‘real-world’ data is used.
New (and experienced) data scientists need to truly understand what it means to have an accurate model If you go out there and surf around the web you’ll see a lot of people that are new to the machine learning / deep learning world who have taken a few courses and thrown up a few projects on their github repository and call themselves a ‘data scientist’. Nothing wrong with that – everyone has to start somewhere – but the people that tend to do well as data scientists understand the theory, process and mathematics of modeling just as much as (or more than) the ability to code up a few machine learning models.
Modeling (which is really what you are doing with machine learning / deep learning) is much more difficult than many people realize. Sometimes, building a model that delivers 55% accuracy can deliver much more value to an person/organization that one that has been over-optimized to deliver 90% accuracy.
As an example, look at the world of investing. There are very famous traders and investors who have models that are ‘accurate’ less than half the time yet they make millions (and billions) off of those models (namely because risk management is a large part of their approach to the markets). This may not be a good analogy to use for a manufacturing company trying to use machine learning to forecast demand over the next quarter but the process these investors take in building their models are absolutely the same as those steps needed to build accurate and trustworthy models.
Accuracy and Trust in Machine Learning
If you’ve built models in the past, do you absolutely trust that they will perform in the future as well as they’ve performed when trained using your historical data?
Accuracy and trust in machine learning should go hand in hand. If you tell me your model has ‘good’ MAE (or RMSE or MAPE or whatever measure you use), then I need you to also tell me why you chose that measure and what variances you’ve seen in errors. Additionally, I’d want you to tell me how you built that model. How big was your training dataset? Did you do any type of walk-forward testing (in the case of time series modeling)? What have you done about bias in your data?
The real issue in the accuracy and trust debate isn’t with the technical skills of the data scientist to be honest. A good data scientist will know this stuff inside and out from a technical standpoint. The real issue is in the communication ability of the data scientist and the people she is talking to. An MAE Of 3.5 might be good or it might be bad and the non-technical / non-data scientists would have no clue in how to interpret that value. The data scientist will need to be vary specific about what that value means from an accuracy standpoint and what that might mean when this model is put into production.
Accuracy and trust in machine learning / modeling has been – without question – the biggest challenge that i’ve run across in my career. I can find really good data scientists and coders to build really cool machine learning models. I can find a lot of data to throw at those models. But what I’ve found hardest is helping non-data folks understand the outputs and what those outputs mean(which touches on the Output Interpretation risk I mentioned when I wrote about machine leaning risks).
I’ve found a good portion of my time spent while working with companies on modeling / machine learning is spent on analyzing model outputs and helping the business owners understand the accuracy / trust issues.
How do you (or your company) deal with the accuracy vs trust issue in machine learning / modeling?
Machine Learning Risks are real and can be very dangerous if not managed / mitigated.
Everyone wants to ‘do’ machine learning and lots of people are talking about it, blogging about it and selling services and products to help with it. I get it…machine learning can bring a lot of value to an organization – but only if that organization knows the associated risks.
Deloitte splits machine learning risks into 3 main categories: Data, Design & Output. This isn’t a bad categorization scheme, but I like to add an additional bucket in order to make a more nuanced argument machine learning risks.
My list of ‘big’ machine learning risks fall into these four categories:
Bias – Bias can be introduced in many ways and can cause models to be wildly inaccurate.
Data – Not having enough data and/or having bad data can bring enormous risk to any modeling process, but really comes into play with machine learning.
Lack of Model Variability (aka over-optimization) – You’ve built a model. It works great. You are a genius…or are you?
Output interpretation – Just like any other type of modeling exercise, how you use and interpret the model can be a huge risk.
In the remainder of this article, I spend a little bit of time talking about each of these categories of machine learning risks.
Machine Learning Risks
One of the things that naive people argue as a benefit for machine learning is that it will be an unbiased decision maker / helper / facilitator. This can’t be further from the truth.
Machine learning models are built by people. People have biases whether they realize it or not. Bias exists and will be built into a model. Just realize that bias is there and try to manage the process to minimize that bias.
In addition to the bias that might be introduced by people, data can be biased as well. Bias that’s introduced via data is more dangerous because its much harder to ‘see’ but it is easier to manage.
For example, assume you are building a model to understand and manage mortgage delinquencies. You grab some credit scoring data and build a model that predicts that people with good credit scores and a long history of mortgage payments are less likely to default. Makes sense, right? But…what if a portion of those people with good credit scores had mortgages that were supported in some form by tax breaks or other benefits and those benefits expire tomorrow. What happens to your model if those tax breaks go away? I’d put money on the fact that your model isn’t going to be able to predict the increase in numbers of people defaulting that are probably going to happen.
Data bias is dangerous and needs to be carefully managed. You need domain experts and good data management processes (which we’ll talk about shortly) to overcome bias in your machine learning processes.
From the mortgage example above, you can (hopefully) imagine how big of a risk bias can be for machine learning. Managing bias is a very large aspect to managing machine learning risks.
The second risk area to consider for machine learning is the data used to build the original models as well as the data used once the model is in production. I talked a bit about data bias above but there are plenty of other issues that can be introduced via data. With data, you can have many different risks including:
Data Quality (e.g., Bad data)– do you know where your data has been, who has touched it and what the ‘pedigree’ of your data is? If you don’t, you might not have the data that you think you do.
Not enough data – you can build a great model on a small amount of data but that model isn’t going to be a very good model long-term due unless all your future data looks exactly like the small amount of data you used to build it. When building models (whether they are machine learning models or ‘standard’ models), you want as much data as you can get.
Homogeneous data – similar to the ‘not enough data’ above risk above, this risk comes from a lack of data – but not necessarily the lack of the amount of data but the lack of variability of the data. For example, if you want to forecast home prices in a city, you probably want to get as many different data sets as you can find to build these models. Don’t use just one data-set from the local tax office….who knows how accurate that data is. Find a couple of different data sets with many different types of demo-graphical data points and then spend time doing some feature engineering to find the best model inputs for accurate outputs.
Fake Data – this really belongs in the ‘bad data’ risk, but I wanted to highlight it separately because it can be (and has been) a very large issue. For example, assume you are trying to forecast revenue and growth numbers for a large multi-national organization who has offices in North America, South America and Asia. You’ve pulled together a great deal of data including economic forecasts and built what looks to be a great model. Your organization begins planning their future business based on the outcome of this model and use the model to help make decisions going forward. How sure are you that the economic data is real?
Data Compliance issues – You have some data…but can you (or should you) use it? Simple question but one that vexes many data scientists – and one that doesn’t have an easy answer.
Lack of Model Variability (aka over-optimization)
You spend weeks building a model. You train it and train it and train it. You optimize it and get an outstanding measure for accuracy. You’re going to be famous. Then…the real data starts hitting the model. Your accuracy goes into the toilet. Your model is worthless.
What happened? You over-optimized. I see this all the time in the financial markets when people try to build a strategy to invest in the stock market. They build a model strategy and then tweak inputs and variables until they get some outrageous accuracy numbers that would make them millionaires in a few months. But that rarely (never?) happens.
What happens is this – an investing strategy (e.g., model) is built using a particular set of data. The inputs are tweaked to give the absolute best output without regards to variability of data (e.g., new data is never introduced). When the investing strategy is then applied to new, real world data, it doesn’t perform anywhere near as well as it did on the old tested data. The dreams of being a millionaire quickly fade as the investor watches their investing account value dwindle.
In the world of investing, this over-optimization can be managed with various performance measures and using a method called walk-forward optimization to try to get as much data in as many different timeframes as possible into the model.
Similar approaches should be taken in other model building exercises. Don’t over-optimize. Make sure the data you are feeding your machine learning models are varied across both data types, timeframes, demo-graphical data-sets and as many other forms of variability that you can find.
Some folks might call ‘lack of model variability’ by another name — Generalization Error. Regardless of what you call this risk…its a risk that exists and should be carefully managed throughout your machine learning modeling processes.
You spend a lot of time making sure you have good data, the right data and the as much data as you can. You do everything right and build a really good machine learning model and process. Then, your boss takes a look at it and interprets the results in a way that is so far from accurate that it makes your head spin.
This happens all the time. Model output is misinterpreted, used incorrectly and/or the assumptions that were used to build the machine learning model are ignored or misunderstood. A model provides estimates and guidance but its up to us to interpret the results and ensure the models are used appropriately.
Here’s an example that I ran across recently. This is a silly one and might be hard to believe – but its a good example to use. An organization had one of their data scientists build a machine learning model to help with sales forecasting. The model was built on the assumption that all data would be rolled up to quarterly data for modeling and reporting purposes. While i’m not a fan of up-sampling data from high to low granularity, but it made sense for this particular modeling exercise.
This particular model was built on quarterly data with a fairly good mean error rate and good variance measures. Looking at all the statistics, it was a good model. The output of the model was provided to the VP of Sales who immediately got angry. He called up the manager of the data scientist and read her the riot act. He told her the reports were off by a factor of anywhere from 5 to 10 times what it should be. He was furious and shot off an email to the data team, the sales team and the leadership team decrying the ‘fancy’ forecasting techniques declaring that it was forecasting 10x growth of the next year and “had to be wrong!”
Turns out he had missed that the output was showing quarterly sales revenue instead of weekly revenue like he was used to seeing.
Again – this is a simplistic example but hopefully it makes sense that you need to understand how a model was built, what assumptions were made and what the output is telling you before you start your interpretation of the output.
One more thing about output interpretation…a good data scientist is going to be just as good at presenting outputs and reporting on findings as they are at building the models. Data scientists need to be just as good at communicating as they are at data manipulation and model building.
Finishing things up…
This has been a long one…thanks for reading to here. Hopefully its been informative. Before we finish up completely, you might be asking something along the lines of ‘what other machine learning risks exists?’
If you asked 100 data scientists and you’ll probably get as many different answers of what the ‘big’ risks are – but I’d bet that if you sit down and categorize them all, the majority of them would fall into these four categories. There may be some outliers (and I’d love to add those outliers to my list if you have some to share).
What can you do as a CxO looking at machine learning / deep learning / AI to help mitigate these machine learning risks? Like my friend Gene De Libero says: ‘Test, learn, repeat (bruises from bumping into furniture in the dark are OK).”
Go slow and go small. Learn about your data and your businesses capabilities when it comes to data and data science. I know everyone ‘needs’ to be doing machine learning / AI but you really don’t need to throw caution to the wind. Take your time to understand the risks inherent in the process and find ways to mitigate the machine learning risks and challenges.
I can help mitigate those risks. Feel free to contact me to see how I might be able to help manage machine learning risks within your project / organization.
How much is bad data costing you? It could be very little – or it could be a great deal. In this article I give an example of what the cost of bad data really is.
A few days ago, I received a nice, well designed sales/marketing piece in the mail yesterday. In it, a local window company warned me of the dangers of old windows and the costs associated with them (higher energy costs, etc etc). Note: This was the third such piece I’ve received from this company in about 3 months.
It was a well thought out piece of sales/marketing material. If I had been thinking about new windows, I most likely would be given them a call.
However…my house is less than a year old. So is every other house in the neighborhood of about a thousand homes that are all less than 5 years old. Talking to my neighbors, everyone got a similar sales pitch. I’m not a window salesperson, but I wouldn’t think we are the target market for these types of pitches.
That said, the neighborhood directly beside us is a 20+ year old neighborhood that would be ideal for the pitch. I hope this window company pitched them as well as they pitched me (and I’m assuming they did).
What I suspect happened is this window company bought a ‘targeted’ list from a list broker that promised ‘accurate and up to date’ listings of homeowners in a zip code. Sure, the list is accurate (I am a homeowner) but its not really targeted correctly.
The cost of bad data
I won’t get into the joys of buying lists like this because we all know some mistakes are made. There will always be bad data regardless of what your data management practices are but a good data governance/management process will help eliminate as much bad data as possible.
Of course, in this example we’re talking about a small business. What do they know about data management? Probably nothing…and most likely they don’t need to know too much but they do need to understand how much bad data is costing them.
Let’s look at the costs for this window company.
I went out to one of those list websites and built a demographic to buy lists of homeowners in my zip code. The price was about $3,000 for about 18K addresses. Next, I found a direct mailing cost estimator website that helped me estimate the cost to mail out the material that I had received from the window company. The mailing cost was about $10,000 (which seems high to me…but what do I know about mailings?). This sounds about right considering it would cost about $8500 to send out 18,000 letters with standard postage.
I’m going to assume this company got a deal for their mailings and paid $20,000 for the 3 mailing campaigns that I received a letter. With the price of the list, that brings us to $23K total cost, or about $1.28 per letter sent. That doesn’t seem like a lot of money to spend on sales/marketing until you realize how much of that money was wasted on homes that don’t need the service.
We have roughly 1,000 homes in our neighborhood. A random sampling of the homeowners tells me 90% of them received more than one mailing from this window company. This gives us 900 homes. I’ll assume each home only received 2 mailings, which brings a cost to this window company of about $2,300 ($1.28 x 900 x 2).
That’s $2,300 spent trying to sell windows to homes that don’t need it. That’s 10% of the budget.
So…for this small company trying to sell windows, 10% of their budget was wasted on marketing their services to homes that didn’t need their services. That’s a big number, even for a small company.
Of course, some of you may argue that these costs aren’t all wasted because some of the marketing material might have made it into the hands of friends / family or a homeowner may remember this window company in future years – and you are probably right. But…is the possibility of maybe potentially getting work in the future worth spending 10% of your marketing budget?
To me it is, especially since that 10% could have been redirected to a higher potential marketing opportunity.
The cost of bad data is high regardless of what the number actually shows. If you spend $1 because bad data ‘tricked’ you into doing so, that cost is wasted.
The real question is – what are you doing to understand how good or bad your data is?