Eric D. Brown, D.Sc.

Data Science | Entrepreneurship | ..and sometimes Photography

Tag: data (page 2 of 10)

Machine learning risks are real. Do you know what they are? 

Machine Learning RisksMachine Learning Risks are real and can be very dangerous if not managed / mitigated.

Everyone wants to ‘do’ machine learning and lots of people are talking about it, blogging about it and selling services and products to help with it. I get it…machine learning can bring a lot of value to an organization – but only if that organization knows the associated risks.

Deloitte splits machine learning risks into 3 main categories: Data, Design & Output. This isn’t a bad categorization scheme, but I like to add an additional bucket in order to make a more nuanced argument machine learning risks.

My list of ‘big’ machine learning risks fall into these four categories:

  1. Bias – Bias can be introduced in many ways and can cause models to be wildly inaccurate.
  2. Data – Not having enough data and/or having bad data can bring enormous risk to any modeling process, but really comes into play with machine learning.
  3. Lack of Model Variability (aka over-optimization) – You’ve built a model. It works great.  You are a genius…or are you?
  4. Output interpretation – Just like any other type of modeling exercise, how you use and interpret the model can be a huge risk.

In the remainder of this article, I spend a little bit of time talking about each of these categories of machine learning risks.

Machine Learning Risks

Bias

One of the things that naive people argue as a benefit for machine learning is that it will be an unbiased decision maker / helper / facilitator.  This can’t be further from the truth.

Machine learning models are built by people. People have biases whether they realize it or not. Bias exists and will be built into a model. Just realize that bias is there and try to manage the process to minimize that bias.

Cathy O’Neill argues this very well in her book Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Now, I’m not a huge fan of the book (the book is a bit too politically bent and there are too many uses of the words ‘fair’ and ‘unfair’….who’s to judge what is fair?) but there are some very good arguments about bias that are worth the time to read.

In addition to the bias that might be introduced by people, data can be biased as well. Bias that’s introduced via data is more dangerous because its much harder to ‘see’ but it is easier to manage.

For example, assume you are building a model to understand and manage mortgage delinquencies. You grab some credit scoring data and build a model that predicts that people with good credit scores and a long history of mortgage payments are less likely to default.  Makes sense, right?  But…what if a portion of those people with good credit scores had mortgages that were supported in some form by tax breaks or other benefits and those benefits expire tomorrow.  What happens to your model if those tax breaks go away?  I’d put money on the fact that your model isn’t going to be able to predict the increase in numbers of people defaulting that are probably going to happen.

Data bias is dangerous and needs to be carefully managed. You need domain experts and good data management processes (which we’ll talk about shortly) to overcome bias in your machine learning processes.

From the mortgage example above, you can (hopefully) imagine how big of a risk bias can be for machine learning.  Managing bias is a very large aspect to managing machine learning risks.

Data

The second risk area to consider for machine learning is the data used to build the original models as well as the data used once the model is in production. I talked a bit about data bias above but there are plenty of other issues that can be introduced via data. With data, you can have many different risks including:

  • Data Quality (e.g., Bad data)– do you know where your data has been, who has touched it and what the ‘pedigree’ of your data is? If you don’t, you might not have the data that you think you do.
  • Not enough data – you can build a great model on a small amount of data but that model isn’t going to be a very good model long-term due unless all your future data looks exactly like the small amount of data you used to build it. When building models (whether they are machine learning models or ‘standard’ models), you want as much data as you can get.
  • Homogeneous data – similar to the ‘not enough data’ above risk above, this risk comes from a lack of data – but not necessarily the lack of the amount of data but the lack of variability of the data.   For example, if you want to forecast home prices in a city, you probably want to get as many different data sets as you can find to build these models.  Don’t use just one data-set from the local tax office….who knows how accurate that data is. Find a couple of different data sets with many different types of demo-graphical data points and then spend time doing some feature engineering to find the best model inputs for accurate outputs.
  • Fake Data – this really belongs in the ‘bad data’ risk, but I wanted to highlight it separately because it can be (and has been) a very large issue.  For example, assume you are trying to forecast revenue and growth numbers for a large multi-national organization who has offices in North America, South America and Asia. You’ve pulled together a great deal of data including economic forecasts and built what looks to be a great model.  Your organization begins planning their future business based on the outcome of this model and use the model to help make decisions going forward.  How sure are you that the economic data is real?
  • Data Compliance issues – You have some data…but can you (or should you) use it?  Simple question but one that vexes many data scientists – and one that doesn’t have an easy answer.

Lack of Model Variability (aka over-optimization)

You spend weeks building a model. You train it and train it and train it. You optimize it and get an outstanding measure for accuracy. You’re going to be famous.  Then…the real data starts hitting the model.  Your accuracy goes into the toilet.  Your model is worthless.

What happened? You over-optimized.  I see this all the time in the financial markets when people try to build a strategy to invest in the stock market. They build a model strategy and then tweak inputs and variables until they get some outrageous accuracy numbers that would make them millionaires in a few months.  But that rarely (never?) happens.

What happens is this – an investing strategy (e.g., model) is built using a particular set of data. The inputs are tweaked to give the absolute best output without regards to variability of data (e.g., new data is never introduced). When the investing strategy is then applied to new, real world data, it doesn’t perform anywhere near as well as it did on the old tested data.  The dreams of being a millionaire quickly fade as the investor watches their investing account value dwindle.

In the world of investing, this over-optimization can be managed with various performance measures and using a method called walk-forward optimization to try to get as much data in as many different timeframes as possible into the model.

Similar approaches should be taken in other model building exercises.  Don’t over-optimize. Make sure the data you are feeding your machine learning models are varied across both data types, timeframes, demo-graphical data-sets and as many other forms of variability that you can find.

Some folks might call ‘lack of model variability’ by another name — Generalization Error. Regardless of what you call this risk…its a risk that exists and should be carefully managed throughout your machine learning modeling processes.

Output Interpretation

You spend a lot of time making sure you have good data, the right data and the as much data as you can. You do everything right and build a really good machine learning model and process.   Then, your boss takes a look at it and interprets the results in a way that is so far from accurate that it makes your head spin.

This happens all the time. Model output is misinterpreted, used incorrectly and/or the assumptions that were used to build the machine learning model are ignored or misunderstood. A model provides estimates and guidance but its up to us to interpret the results and ensure the models are used appropriately.

Here’s an example that I ran across recently. This is a silly one and might be hard to believe – but its a good example to use. An organization had one of their data scientists build a machine learning model to help with sales forecasting. The model was built on the assumption that all data would be rolled up to quarterly data for modeling and reporting purposes. While i’m not a fan of up-sampling data from high to low granularity, but it made sense for this particular modeling exercise.

This particular model was built on quarterly data with a fairly good mean error rate and good variance measures. Looking at all the statistics, it was a good model. The output of the model was provided to the VP of Sales who immediately got angry.  He called up the manager of the data scientist and read her the riot act. He told her the reports were off by a factor of anywhere from 5 to 10 times what it should be. He was furious and shot off an email to the data team, the sales team and the leadership team decrying the ‘fancy’ forecasting techniques declaring that it was forecasting 10x growth of the next year and “had to be wrong!”

Turns out he had missed that the output was showing quarterly sales revenue instead of weekly revenue like he was used to seeing.

Again – this is a simplistic example but hopefully it makes sense that you need to understand how a model was built, what assumptions were made and what the output is telling you before you start your interpretation of the output.

One more thing about output interpretation…a good data scientist is going to be just as good at presenting outputs and reporting on findings as they are at building the models.  Data scientists need to be just as good at communicating as they are at data manipulation and model building.

Finishing things up…

This has been a long one…thanks for reading to here. Hopefully its been informative. Before we finish up completely, you might be asking something along the lines of  ‘what other machine learning risks exists?’

If you asked 100 data scientists and you’ll probably get as many different answers of what the ‘big’ risks are – but I’d bet that if you sit down and categorize them all, the majority of them would fall into these four categories. There may be some outliers (and I’d love to add those outliers to my list if you have some to share).

What can you do as a CxO looking at machine learning / deep learning / AI to help mitigate these machine learning risks?  Like my friend Gene De Libero says: ‘Test, learn, repeat (bruises from bumping into furniture in the dark are OK).”

Go slow and go small. Learn about your data and your businesses capabilities when it comes to data and data science. I know everyone ‘needs’ to be doing machine learning / AI but you really don’t need to throw caution to the wind. Take your time to understand the risks inherent in the process and find ways to mitigate the machine learning risks and challenges.

I can help mitigate those risks. Feel free to contact me to see how I might be able to help manage machine learning risks within your project / organization.

What is the cost of bad data?

Cost of Bad DataHow much is bad data costing you? It could be very little – or it could be a great deal. In this article I give an example of what the cost of bad data really is.

A few days ago, I received a nice, well designed sales/marketing piece in the mail yesterday. In it, a local window company warned me of the dangers of old windows and the costs associated with them (higher energy costs, etc etc). Note: This was the third such piece I’ve received from this company in about 3 months.

It was a well thought out piece of sales/marketing material. If I had been thinking about new windows, I most likely would be given them a call.

However…my house is less than a year old. So is every other house in the neighborhood of about a thousand homes that are all less than 5 years old. Talking to my neighbors, everyone got a similar sales pitch. I’m not a window salesperson, but I wouldn’t think we are the target market for these types of pitches.

That said, the neighborhood directly beside us is a 20+ year old neighborhood that would be ideal for the pitch. I hope this window company pitched them as well as they pitched me (and I’m assuming they did).

What I suspect happened is this window company bought a ‘targeted’ list from a list broker that promised ‘accurate and up to date’ listings of homeowners in a zip code. Sure, the list is accurate (I am a homeowner) but its not really targeted correctly.

The cost of bad data

I won’t get into the joys of buying lists like this because we all know some mistakes are made. There will always be bad data regardless of what your data management practices are but a good data governance/management process will help eliminate as much bad data as possible.

Of course, in this example we’re talking about a small business. What do they know about data management? Probably nothing…and most likely they don’t need to know too much but they do need to understand how much bad data is costing them.

Let’s look at the costs for this window company.

I went out to one of those list websites and built a demographic to buy lists of homeowners in my zip code. The price was about $3,000 for about 18K addresses. Next, I found a direct mailing cost estimator website that helped me estimate the cost to mail out the material that I had received from the window company. The mailing cost was about $10,000 (which seems high to me…but what do I know about mailings?). This sounds about right considering it would cost about $8500 to send out 18,000 letters with standard postage.

I’m going to assume this company got a deal for their mailings and paid $20,000 for the 3 mailing campaigns that I received a letter. With the price of the list, that brings us to $23K total cost, or about $1.28 per letter sent. That doesn’t seem like a lot of money to spend on sales/marketing until you realize how much of that money was wasted on homes that don’t need the service.

We have roughly 1,000 homes in our neighborhood. A random sampling of the homeowners tells me 90% of them received more than one mailing from this window company. This gives us 900 homes. I’ll assume each home only received 2 mailings, which brings a cost to this window company of about $2,300 ($1.28 x 900 x 2).

That’s $2,300 spent trying to sell windows to homes that don’t need it. That’s 10% of the budget.

So…for this small company trying to sell windows, 10% of their budget was wasted on marketing their services to homes that didn’t need their services. That’s a big number, even for a small company.

Of course, some of you may argue that these costs aren’t all wasted because some of the marketing material might have made it into the hands of friends / family or a homeowner may remember this window company in future years – and you are probably right. But…is the possibility of maybe potentially getting work in the future worth spending 10% of your marketing budget?

To me it is, especially since that 10% could have been redirected to a higher potential marketing opportunity.

The cost of bad data is high regardless of what the number actually shows. If you spend $1 because bad data ‘tricked’ you into doing so, that cost is wasted.

The real question is – what are you doing to understand how good or bad your data is?

Agile Marketing Based on Analytical Data Insights: Improving Scrum Tactics in Brand Outreach

This post is written by Mathias Lanni (Executive VP, Marketing – Velocidi).

Agile management and scrum-style techniques have long been accepted in fields of technology development, but have been increasingly adopted outside the tech industry over the years.  Fundamentally, agile tactics are a way for organizations to more quickly adapt to quickly-changing markets and customer demands, without the slow-to-change hidebound nature of top-down hierarchical organizations impeding change.

Marketing has certainly become fast-changing!   The marketing field has become extremely volatile in the past 10-20 years, with the digital revolution bringing about huge changes in buyer behavior, brand/buyer interactions, as well as basic outreach.

Agencies were already having to deal with client demands which could change rapidly based on customer demands and/or issues with their image.  Now, on top of that, digital marketing is constantly in flux, with massive shifts in strategy consistently happening in response to changes in the search engines as well as the impossible pace of internet/electronic trends.  It’s enough to drive any marketer to reassess their workflow, which is undoubtedly why agile techniques are coming into the field.  The issue is how to introduce scrum-style strategies while also making use of “Big Data” analytics to ensure the best possible decisions are made.

In this blog, we wanted to address a few ways data analytics can be integrated into scrum-style workflows in a marketing management setting, and in particular how they can be utilized to quickly settle questions that may come up due to shifting priorities.

Improved Scrum Marketing Management through Smart Use of Data

We’ll assume readers are familiar with the basics of scrum management.  Rather than go over that, we wanted to address a few specific problem areas relating to Product Owners and Scrum Masters where data analysis can be of the greatest help.

Problem 1 – Sorting Through the Backlog

One of the perennial issues with digital marketing is that there is always more that could possibly be done than even the biggest team could ever achieve.  As an easy example, there are literally dozens of social media networks out there.  Yet even the largest of brands is going to struggle to support more than a handful properly.

So when you have a long backlog of user stories to implement, how do you prioritize?

This is exactly the sort of problem a well-maintained database and analytical system can cut through easily.  By sorting through usage data, customer feedback, focus group comments, and similar information, one can almost always get clear guidance on which user stories would likely be well-received by the target audience(s).  With sufficient data, there is no need for guesswork – you’ll have clear trends indicating the right path.

Of course, this principle also applies to selecting user stories in the first place.  A data-driven outlook will help ensure effective stories are selected, leading to a backlog full of to-do items which all have a high likelihood of paying off.

Problem 2 – Optimizing Your Points and Time Allocation

Historically, one of the biggest issues facing Scrum Masters is properly configuring your sprints.  How many points should be in the sprint, and what time allocation is best?

Don’t forget that big data can be applied to your own processes as well!  A database keeping track of the successes and failures of your own scrums will serve you well, and generally, it only requires a few months’ of data before you can start seeing clear trends.  Allocating points doesn’t have to be a matter of gut and instinct.  You’ll be able to look up exact time spent on similar user stories in the past when determining your time allocations, which in turn gives you clear guidance on point’s allocation.

Of course, this does rely on committing to recording these numbers and doing so consistently.  This small time investment will pay off in the future – and do so with increasing reliability as the months’ pass.

Problem 3 – Crafting Effective Retrospectives

It’s well known that human memory is quite fallible, particularly when under stress.  This can be a problem when it comes time for your monthly retrospective.  How well will people really remember the nitty-gritty of problems faced in the previous month?

Again, this is a problem which can be solved with good data and time tracking throughout the scrum process.  The Scrum Master might even devote some time to reviewing the data logs.  Why did a particular Team Member end up spending twice as much time implementing a User Story than was originally allocated?

They might not remember this event off the top of their head without prompting, but with the data on the table, it’ll be much easier to remember.  Then the information about the problem and its solution can be integrated into the database, and into future decision-making.

Data Can Tie Your Marketing Together

These are just a few examples of how data analytical techniques and scrum-based marketing management can go hand-in-hand.  Data can be the basis for decisions throughout the process and will make the lives of both the Product Owner and the Scrum Master vastly easier.  In most cases, a trip to the database will be able to answer most pressing questions – clearing the roadblock quickly – while the ever-increasing amount of data recorded will help you quickly optimize your scrum processes on a month-by-month basis.


About Mathias Lanni EVP, Marketing – Velocidi

Mathias Lanni has helped some of the world’s leading brands take advantage of new emerging technologies to reach and engage their audiences. Through 20+ years of brand marketing experience Mathias has helped large national advertisers incorporate paid search, display advertising, conversation analytics, social media marketing, social advertising, web & app development into their traditional marketing plans. Before Velocidi, Mathias was a founding member of Edelman Digital, the world’s first global social media agency, where he led global scaling plans for the agency. Mathias currently works with www.velocidi.com 

Data and Culture go hand in hand

data and culture go hand in handA few weeks ago, I spent an afternoon talking to the CEO of a mid-sized services company.  He’s interested in ‘big data’ and is interviewing consultants / companies to help his organization ‘take advantage of their data’.  In preparation for this meeting, I had spent the previous weeks talking to various managers throughout the company to get a good sense of how the organization uses and embraces data.  I wanted to see how well data and culture mixed at this company.

Our conversation started out like they always do in these types of meetings. He started asking me about big data, how big data can help companies and what big data would mean to their organization.  As I always do, I tried to provide a very direct and non-sales focused message to the CEO about the pros/cons of big data, data science and what it means to be a data-informed organization.

This particular CEO stopped me when I started talking about being ‘data-informed’.  He described his organization is being a ‘data-driven company!’ (the exclamation was implied in the forcefulness of his comment).  He then spent the next 15 minutes describing his organization’s embracing of data. He described how they’ve been using data for years to make decisions and that he’d put his organization up against any other when it comes to being data-driven.  He showed me sales literature that touts their data-driven culture and described how they were one of the first companies in their space to really use data to drive their business.

After this CEO finished exclaiming the virtues of his data-driven organization, I made the following comment (paraphrasing of course…but this is the gist of the comment):

“You say this is a data-driven organization…but the culture of this organization is not one that I would call data-driven at all.   Every one of your managers tells me most decisions in the organization are made by ‘gut feel’.  They tell me that data is everywhere and is used in making decisions but only after the decision has been made.   Data is used to support a decision rather than informing the decisions. There’s a big difference between that and being a data-informed and a being a data-driven organization.

After what felt like much more than the few seconds it was, the CEO smiled and asked me to help him understand ‘just what in the hell I was talking about’.

What am I talking about?

I’m talking about the need to view data as more than just a supporting actor in the theatrical play that is your business.  Data must go hand-in-hand with every initiative your organization undertakes.   There’s some folks out there that argue that you need to build a data-driven culture, but that’s a hard thing to sell to most people and simply because they don’t really understand what a ‘data-driven’ culture is.

So…what is a ‘data-driven culture’?  If you ask 34 experts on the subject, you’ll get 34 different explanations.  I suspect if you ask another 100 experts, you’ll get 100 additional answers.  Rather than trying to be a data-driven culture, its much better to integrate the idea of data into every aspect of your culture. Rather than try to create a new culture that nobody really understands (or can define), work on tweaking the culture you have to be one that embraces data and the intelligent use of data.

This is what happens when you become start moving toward being a data-informed organization.   Rather than using data to provide reasons for the decisions that you make, you need to incorporate data into your decision making process. Data needs to be used by your people (an important point…don’t forget about the people) to make decisions. Data needs to be a part of every activity in the organization and it needs to be available to be used by anyone within the organization. This is where a good data governance / data management system/process comes into play.

During my meeting with the CEO, I spent about 2 hours walking through the topics of data and culture.  We touched on many different topics in our conversation but always seemed to come back around to him not understanding how his organization isn’t “data-driven”.  He truly believed that he was doing the right things that a company needs to do to be ‘data-driven’. I couldn’t argue that he wasn’t doing the right things but I did point out the fact that data was considered as an afterthought in every conversation I had with his leadership team.

Data and culture go hand in hand

Since that meeting, the CEO has called me a few times and we’ve talked through some plans for helping bring data to the forefront of his organization.  This type of work is quite different than the ‘big data’ work that the CEO had original wanted to talk about.  There’s no reason not to continue down the path of implementing the right systems, processes and people to build a great data science team within the company, but to get the most from this work, its best to also take a stab at tweaking your culture to ensure data is embraced and not just tolerated.

A culture that embraces data is one that ensures data is available from the CEO down to the most junior of employees.  This requires not only cultural change but also systematic changes to ensure you have proper data governance and data management in place.

Data science, big data and the whole world that those worlds entail is much more than just something you install and use.  Its a shift from a culture focused on making decisions by gut-feel and using data to back that decision up to one that intuitively uses data throughout the decision making process, including starting with data to find new factors to make decisions on.

What about your organization? Does data and culture go hand in hand or are you trying to force data into a culture that doesn’t understand or embrace it?

The Data Way

The Data WayThe world has become a world of data. According to Domo, the majority of the data (roughly 90% of it) that exists today has been created within the last two years. That’s a lot of data. Actually…that’s a LOT of data. And it’s your job to use that data to make better decisions and guide your organization / team to a brighter future.

Whether you’re in marketing, IT, HR, Finance, Sales or any other function within an organization, you have data and you need to figure out how to use that data – but where do you begin?

Many people grab data, throw it into excel and start throwing pivot tables and vlookups at it. If that’s what you do – then more power to you. Personally, I can’t stand vlookups. Truth be told – they don’t like me and subsequently I hate them. Don’t get me wrong – pivot tables and vlookups (and the other useful spreadsheet functionality) can deliver very good insight into your data but only if you know what you’re looking for.

Of course – you have a question or questions you want answered to and that’s what you’re digging into your data. You might want to know what your material costs are going to be for next year. Maybe you want to forecast your sales revenue for the coming quarter. Or, perhaps you want to better understand the differences between pay scales between the different groups of people within your organization.

That’s all well and good but what about all the other questions you don’t know you have? You’ll never find the answers to those questions sticking with pivot tables and vlookups to answer the ‘original’ question because you didn’t know you were supposed to be asking any additional questions.

When I say this in conversation, I tend to get a lot of questioning looks and responses like ‘that makes no sense’ or ‘I can’t ask questions I don’t know I’m supposed to ask’. Fair enough. I usually respond with the example of the creation of the Post-it Note by Art Fry at 3M. Nobody at 3M was looking to develop little sticky pieces of paper to be used as notes. They were just trying to create better adhesives when an idea struck Mr. Fry. He needed a bookmark and page marker that wouldn’t fall out. After some trial and error, the Post-it Note was born and now these little notes are part of a multi-billion dollar industry for 3M.

3M and its engineers had no idea they needed/wanted to invent the Post-it note but they were open to exploring new ideas and questions as they arose.

This is the same mindset you need to have with data. Don’t just ‘answer the question’ but keep digging and keep playing.  It can be tough to do that in Excel when stuck in pivot table and vlookup hell, but it can be done. Just keep your curiosity levels high and keep looking for those questions you didn’t know you had.

That’s the data way.

Don’t get “Theranosed”

Kaiser FungDon't get "Theranosed" just posted a blog titled “Tip of the day: don’t be Theranosed” where he defines “theranosed” as:

Theranos (v): to spin stories that appeal to data while not presenting any data

To be Theranosed is to fall for scammers who tell stories appealing to data but do not present any actual data. This is worse than story time, in which the storyteller starts out with real data but veers off mid-stream into unsubstantiated froth, hoping you and I got carried away by the narrative flow.

I really liked the definition of being ‘theranosed’, but anyone that’s been around long enough knows that this type of activity has occurred for many years and will continue to occur for many more.  In this particular example, a storyteller uses the appeal of data without actually using data to tell a story that led to a multi-billion dollar company valuation.  The people caught up in the story feel like they are being fed data to help back up the story they are being told and gladly go along with the narrative.

How can you ensure you and your company aren’t “theranosed”?

Well…I’m not sure you can be 100% safe from being spun stories without data, but you can build a culture of curiosity and questioning that almost ensures that at least someone in your company asks the right question and/or sees through the story being spun.

Additionally, you can build a strong data culture within your business.  Understanding how to use data to make decisions and tell stories can help you spot someone trying to “theranose” you. Understanding what it takes to analyze data and build meaningful stories with data will help you see through someone else’s BS  very quickly.

If you teams ask questions and dig into the data, you can be sure that you’ve do everything possible to minimize the possibilities of being “theranosed.”

« Older posts Newer posts »

If you'd like to receive updates when new posts are published, signup for my mailing list. I won't sell or share your email.