Eric D. Brown, D.Sc.

Data Science | Entrepreneurship | ..and sometimes Photography

Category: Featured (page 1 of 26)

Do you need machine learning? Maybe. Maybe Not.

I’ve recently written about the risks of machine learning (ML), but with this post I wanted to take a step back and talk about ML and general. I want to talk about the ‘why’ of machine learning and whether you and/or your company should be investigating machine learning.  Do you need machine learning?  Maybe. Maybe not.

The first question you have to ask yourself (and then answer) is this:  Why do you want to be involved with machine learning? What problem(s) are you really trying to solve?  Are you trying to forecast revenue for next quarter? You can probably do just fine with standard time series modeling techniques.  Are you trying to predict house prices in cities/neighborhoods around the world? Machine learning is probably a good idea.

I use this rule of thumb when talking to clients about machine learning:

  • If you are trying to forecast something with a small number of values / features – start with standard forecasting / modeling techniques.  You can always move on to machine learning after working through the standard approaches.
  • If you need to combine multiple data sets to create new knowledge and actionable insights, you probably don’t need machine learning.
  • If you have a complex model / algorithm with many features, then machine learning is something to consider.

The key here is ‘complex’.

Sure, machine learning can be applied to simple problems but there’s plenty of other approaches that might be just as good. Take the forecasting revenue example – there are multitudes of time series forecasting techniques you can use to create these forecasts.  Even if you have hundreds of product lines, you are most likely using a few ‘features’ to forecast one outcome which can easily be handled by Holt-Winters, ARIMA and other time-series forecasting techniques. You could throw this same problem at a ML algorithm / method and possibly get slightly better (or worse) results but the amount of time and effort to implement an ML approach may be wasted.

Where you get the most value from machine learning is when you have a problem that really vexes you. The problem is so complex that you just don’t know where to start. THAT is when you reach for machine learning.

Do you really need machine learning?

There are a LOT of people that will immediately tell you ‘yes!’ when asked if you should be investigating ML.  They are also the people that are trying to sell you ML / AI services and/or platforms. They are the people that have jumped on the band wagon and are chasing the latest buzzwords in the marketplace.  In 2 years, those same people will be jumping up and down telling you need to implement whatever is at the top of the buzzword queue at the time.  They are the same people that were telling you that you needed to implement a data warehouse and business intelligence platforms in the past.  Don’t get me wrong – data warehouses and business intelligence have their places but they weren’t right for every organization and/or every problem.

Do you need machine learning? Maybe.

Do you have complex stream of data that you need to process and turn into knowledge and actionable intelligence?  Definitely look into machine learning.

Do you need machine learning? Maybe not.

If you want to ‘do’ machine learning because everyone else is, feel free to investigate it and start building up your skills but don’t throw an enormous budget at it until you know beyond a shadow of a doubt that you need machine learning.

Or you could call me. I can help you figure out if you really need machine learning.

Photo by marc liu on Unsplash

This one skill will make you a data science rockstar

Want to be a data science rockstar? of course you do! Sorry for the clickbait headline, but I wanted to reach as many people as I can with this important piece of information.

Want to know what the ‘one skill’ is?

It isn’t python or R or Spark or some other new technology or platform.  It isn’t the latest machine learning methods or algorithms. It isn’t being able to write AI algorithms from scratch or analyze terabytes of data in minutes.

While those are important – very important – they aren’t THE skill. In fact, it isn’t a technical skill at all.

The one skill that will make you a data science rockstar is a so-called ‘soft-skill’.  The ability to communicate is what will set you apart from your peers and make you stand out in an increasingly full world of data scientists.

Why do I need to communicate to be a data science rockstar?

You can be the smartest person in the world when it comes to creating some wild machine learning systems to build recommendation engines, but if you can’t communicate the ‘strategy’ behind the system, you’re going to have a hard time.

If you’re able to find some phenomenal patters in data that have the potential to deliver a multiple X increase in revenue but can’t communicate the ‘strategy’ behind your approach, your potential is going to be unrealized.

What do I mean by ‘strategy’?  In addition to the standard information (error rates/metrics, etc) you need to be able to hit the key ‘W’ points (‘what, why, when, where and who’) when you communicate your output/results. You need to be able to clearly define what you did, why you did it, when your approach works (and doesn’t work), where your data came from and who will be effected by what you’ve done.  If you can’t answer these questions succinctly and in a manner that a layperson can understand them, you’re failing a data scientist.

Two real world examples – one rockstar, one not-rockstar

I have two recent examples for you to help highlight the difference between a data science rockstar (i.e., someone that communicates well) and one not-so-much rockstar. I’ll give you the background on both and let you make up your own mind on which person you’d hire as your next data scientist. Both of these people work at the same organization.

Person 1:

She’s been a data scientist for 4 years. She’s got a wide swatch of experience in data exploration, feature engineering, machine learning and data management.  She’s had multiple projects over her career that required a deep dive into large datasets and she’s had to use different systems, platforms and languages during her analysis. For each project she works on, she keeps a running notebook with commentary, ideas, changes and reasons for doing what she’s doing – she’s a scientist after all.   When she provides updates to team members and management, she provides multiple layers of details that can be read or skipped depending on the level of interest by the reader.  She providers a thorough writeup of all her work with detailed notes about why things are being done they way they are done and how potential changes might effect the outcome of her work.  For project ‘wrap-up’ documentation, she delivers an executive summary with many visualizations that succinctly describes the project, the work she did, why she did what she did, what she thinks could be done to improve things and how the project could be improved upon. In addition to the executive summary, she provides a thorough write-up that describes the entire process with multiple appendices and explanatory statements for those people that want to dive deeply into the project. When people are selecting people to work on their projects, her name is the first to come out of their mouths when they start talking about team members.

Person 2:

He’s been a data scientist for 4 years (about 1 month longer than Person 1).  His background is very technical and is the ‘go-to’ person for algorithms and programming languages within the team. He’s well thought of and can do just about anything that is thrown over the wall at him. He’s quite successful and is sought after for advice from people all over the company.  When he works on projects he sort of ‘wings it’ (his words) and keeps few notes about what he’s done and why he’s chosen the things he has chosen.  For example, if you ask him why he chose Random Forests instead of Support Vector Machines on a project, he’ll tell you ‘because it worked better’ but he can’t explain what ‘better’ means.   Now, there’s not many people that would argue against his choices on projects and his work is rarely questions. He’s good at what he does and nobody at the company questions his technical skills, but they always question ‘what is he doing?’ and ‘what did he do?’ during/after projects.  For documentation and presentation of results, he puts together the basic report that is expected with the appropriate information but people always have questions and are always ‘bothering him’ (again…his words). When new projects are being considered, he’s usually last in line for inclusion because there’s ‘just something about working with him’ (actual words from his co-workers).

Who would you choose?

I’m assuming you know which of the two is the data science rockstar. While Person 2 is technically more advanced than Person 1, his communication skills are a bit behind Person 1. Person 1 is the one that everyone goes to for delivering the ‘best’ outcomes from data science in the company they work at.  Communication is the difference. Person 1 is not only able to do the technical work but also share the outcomes in a way that the organization can easily understand.

If you want to be a data science rockstar, you need to learn to communicate. It can be that ‘one skill’ that could help move you into the realm of ‘top data scientists’ and away from the average data scientists who are focusing all of their personal developer efforts on learning another algorithm or another language.

By the way, I’ve written about this before here and here so jump over and read a few more thoughts on the topic if you have time.

Photo by Ben Sweet on Unsplash

Data Mining – A Cautionary Tale

For those of you that might be new to data, keep this small (but extremely important) thing in mind – beware data mining.

What is data mining?  Data mining is the process of discovering information and patterns in data.  Data mining is the first step taken in the Data -> Information -> Knowledge -> Wisdom conversion process.  Data mining is extremely important – but can cause you a lot of problems if you aren’t aware of some of the issues that can arise from data mining.

First, data mining can give you the answer you’re looking for….regardless of whether that answer is even correct.  Many people look at data mining as an iterative process that is a ‘loop’ that lets you mine until you find the data that supports the hypothesis you’re trying to prove (or disprove).  A great example of this is the ‘food science star’ Brian Wansink at Cornell. Dr. Wansink spent years in the spotlight as head of Cornell’s Food & Brand Lab as well as heading up the US Dietary Guidelines committee that influenced public policy around foods and diets in the United States.

Over the last few years, Wansink’s ‘star’ has been fading as other researchers began investigating his work after he posted an article about a graduate research that ‘never said no.’ As part of that post (and subsequent investigation) emails were released that had some interesting commentary around ‘data mining’ that I thought was worth sharing. From Here’s How Cornell Scientist Brian Wansink Turned Shoddy Data Into Viral Studies About How We Eat:

When Siğirci started working with him, she was assigned to analyze a dataset from an experiment that had been carried out at an Italian restaurant. Some customers paid $8 for the buffet, others half price. Afterward, they all filled out a questionnaire about who they were and how they felt about what they’d eaten.

Somewhere in those survey results, the professor was convinced, there had to be a meaningful relationship between the discount and the diners. But he wasn’t satisfied by Siğirci’s initial review of the data.

“I don’t think I’ve ever done an interesting study where the data ‘came out’ the first time I looked at it,” he told her over email.

Emphasis mine.

Since the investigation began, Wansink has had 15 articles retracted from peer-reviewed journals and many more are being reviewed.   Wansink and colleagues were continuously looking through data trying to find a way to ‘sort’ the data to match what they wanted the data to say.

That’s the danger of data mining. You keep working your data until you find an answer you like and ignore the answers you don’t like.

Don’t get me wrong – data mining is absolutely a good thing when done right.  You should go into your data with a hypothesis in mind, then look for patterns and then either accept or reject your hypothesis baed on the analysis.  There’s nothing wrong with then starting over with a new hypothesis or finding patterns that help you develop a new hypothesis but your data and your analysis have to lead you down the road to a valid outcome.

What Wansink is accused of doing is something called ‘p-hacking’ where a researcher is trying to find a ‘p-value’ of 0.05 or less (to signify 95% confidence interval) and allows you to reject the null hypothesis.  P-hacking is the art of continue to sort / manipulate your data to find those data points that give you a p-value of 0.05 or less.  For example, let’s assume that you have a dataset of 500 rows with 4 columns.  You run some analysis –  for this example we’ll say a basic regression analysis – and you get a p-value of 0.2. That’s not great as it suggest weak evidence to reject the null, but it does give you insight into the dataset.   An ethical researcher / data scientist will take what they learned from this analysis and take a look at their data again.  An unethical researcher / data scientist will massage the data to get their p-value to look better. Perhaps make an arbitrary decision to drop any rows with data readings over a certain value and re-run your analysis…and bam…you have a p-value of 0.05. That’s p-hacking and poor data mining.

This is where it gets tricky. There’s could be a very valid reason for why you might have removed the rows of data. Perhaps it was ‘bad data’ or maybe it wasn’t relevant (e.g., the remaining rows have a reading less than 1 and the rows you removed have readings of 10 million) but you need to be able to defend the manipulation of the data, and unethical researchers will generally not be able to do that.

Another ‘gotcha’ can be found in the Wansink story here related to p-hacking and over-analysis.

But for years, Wansink’s inbox has been filled with chatter that, according to independent statisticians, is blatant p-hacking.

“Pattern doesn’t look good,” Payne of New Mexico State wrote to Wansink and David Just, another Cornell professor, in April 2009, after what Payne called a “marathon” data-crunching session for an experiment about eating and TV-watching.

“I also ran — i am not kidding — 400 strategic mediation analyses to no avail…” Payne wrote. In other words, testing 400 variables to find one that might explain the relationship between the experiment and the outcomes. “The last thing to try — but I shutter to think of it — is trying to mess around with the mood variables. Ideas…suggestions?”

Two days later, Payne was back with promising news: By focusing on the relationship between two variables in particular, he wrote, “we get exactly what we need.” (The study does not appear to have been published.)

Don’t do that. That’s bad data mining and bad data science.  If you have to run an analysis 400 times to find a couple of variables that give you a good p-value, you are doing things wrong.

Data mining is absolutely a valid approach to data. Everyone does it but not everyone does it right.  Be careful of massaging the data to fit your needs and get the answer you want. Let your data tell you how it wants to be handled and what answers its going to give.

Is your data ready to help you make game-changing decisions?

Organizations today are facing disruption on all fronts, which should viewed as a good thing as it allows organizations to redefine their strategies, their markets and re-create their organization to be better prepared for the future.

This disruption is one of the driving factors behind digital transformation initiatives. In order to successfully complete these transformation projects, companies must build a foundation of properly managed data.  With the right data management and governance systems and processes in place, CIO’s can begin to build an intelligent organization that has the capability to make intelligent decisions based on data that is reliable, up-to-date and trustworthy.

To build the right foundation for an effective data-driven digital transformation, CIOs must first ensure their organization can effectively understand and manage their data. With the proper data management platform in place to support the discovery, connectivity, quality, security, and governance across all systems and process, organizations can fully trust their data, which means they can trust the outcome of any decisions, processes, and outcomes driven through that data.

Reliable data has always been important, but it’s vitally important for organizations looking to unlock its potential as a driver of digital transformation. With high-quality, “clean” data, CIOs can begin to build an intelligent organization from top to bottom by providing trustworthy data, information, and knowledge for all aspects of the business.

An evolved approach to data management sets the stage for improvements across all areas of the business including finance, marketing and operations. In describing how proper data management has helped her company, Cynthia Nustad, CIO for HMS, states a few clear business benefits. “We’ve accelerated new product introduction, aligned data easier, and reduced the time to onboard customer data by more than 40%,” she says.

In addition to the improvements that data quality can bring to your existing operations, good data provides a strong base for entering the intelligence age. With good data, you can begin to build new data analytics projects and platforms, and incorporate machine learning and other forms of artificial intelligence (AI) into your analytics toolkit. If you try to implement these types of projects without proper data quality and governance systems and processes, you’ll most likely be wasting time and money.

While it’s tempting for CIOs to jump headfirst into AI and other advanced big data initiatives, successful deployments first require a focus on data management. It isn’t the most exciting area, but having good data is an absolute requirement to building an intelligent organization.

Originally published on CIO.com

Want to speed up your digital transformation initiatives? Take a look at your data

Digital Transformation imageDigital transformation has taken center stage in many organizations. Need convincing?

  • IDC predicts that two-thirds of the CEOs of Global 2000 companies will have digital transformation at the center of their corporate strategies by the end of 2017.
  • Four in 10 IT leaders in the Computerworld 2017 Tech Forecast study say more than 50% of their organization has undergone digital transformation.
  • According to Gartner, CIOs are spending 18% of their budget on digitization efforts and expect to see that number grow to 28% by 2018.

Based on this data (and in my regular talks with CIOs), there’s a high probability that you have an initiative underway to digitize one or more aspects of your organization. You may even be well along the digital transformation path and feeling pretty good about your progress.  I don’t want to rain on your digital transformation parade, but before you go any further on your journey, you should take a long, hard look at your data.

Data is the driving force behind every organization today, and thus the driving force behind any digital behind any digital transformation initiative. Without good, clean, accessible, and trustworthy data, your digital transformation journey may be a slow (and possibly difficult) one.  Leveraging data to help speed up your digital transformation initiatives first requires proper data management and governance. Once that’s in place, you can begin to explore ways to open up the data throughout the organization.

Digital transformation is doomed to fail if some (or all) of your data is stored in silos.  Those data silos may have worked great for your business in the past by segmenting data for ease of management and accessibility, but they have to be demolished in order to compete and thrive in the digital world.  To transform into a truly digital organization, you can no longer allow marketing’s data to remain with marketing and finance data to remain within finance. Not only do these data silos make data management and governance more complex, they are challenges to the types of analysis that deliver new insights into the business (e.g., analyzing revenue streams by looking at new ways of combining marketing and financial data).  Data needs to be accessible using modern data management, data governance and data integration systems (with the proper security protocols in place) in order to make data accurate and usable to be a used as a driving force for digital transformation.

Removing data silos is just one aspect of the required data management and governance needed for driving digital transformation.  Implementing data management and governance systems and processes that allow your data to remain secure while remaining available for analysis is a building stone for digital transformation.

In order to speed up your transformation projects and initiatives, you really need to take a long, hard look at your data. If you have good data management and governance throughout your organization, you are one step ahead of those companies that haven’t focused on managing their data as a strategic asset rather than allowing data to be hoarded and live in silos around the organization.

Digital transformation will be one of the key areas of focus for CIOs for some time to come and it just might just be the key to remaining competitive in your market, so anything you can do today to help your transformation projects succeed should be immediately considered.  Having a good data management and governance plan and system in place should help drastically speed up your digitization initiatives.

Originally published on CIO.com

Opportunity Lost: Data Silos Continue to inhibit your Business

An image of data silosAccording to some estimates, data scientists spend as much as 80% of their time getting data in a format that can be used. As a practicing data scientist, I’d say that is a fairly accurate estimate in many organizations.

In the more sophisticated organizations that have implemented proper data integration and management systems, the amount of time spent sifting through and cleaning data is much lower and, in my experience, more in line with the numbers reported in the 2017 Data Scientist Report by Crowdflower.

That report indicates a better balance between basic data-wrangling activities and more advanced analysis:

  • 51% of time spent on collecting, labeling, cleaning and organizing data
  • 19% of time spent building and modeling data
  • 10% of time spent mining data for patterns
  • 9% of time spent refining algorithms

Closing the Gaps

If we think about this data transformation in terms of person-hours, there’s a big difference between a data scientist spending 80% of their time finding and cleaning their data and a data scientist spending 51% of their time on that same tasks. Closing the gap begins with demolishing the data silos that impede organization’s’ ability to extract actionable insights from the data they’re collecting.

Digital transformation projects have become a focus of many CIOs, with the share of IT budgets devoted to these projects expected to grow from 18% to 28% in 2018. Top-performing businesses are allocating nearly twice as much budget to digital transformation projects – 34% currently, with plans to increase the share even further to 44% by 2018.

CIOs in these more sophisticated organizations – let’s call them data-driven disruptors – have likely had far more success finding ways to manage the exponential growth and pace of data. These CIOs realize the importance of combating SaaS sprawl, among other data management challenges, and have found better ways to connect the many different systems and data stores throughout their organization.

As a CIO, if you can free up your data team(s) from dealing with the basics of data management and let them focus their efforts on the “good stuff” of data analytics (e.g., data modeling, mining, etc.), you’ll begin to see your investments in big data initiatives deliver real, meaningful results.

Originally published on CIO.com

« Older posts

If you'd like to receive updates when new posts are published, signup for my mailing list. I won't sell or share your email.