Deep learning – when should it be used?

Deep Learning vs other approaches

“When should I use deep learning?”

I get asked that question constantly.

The answer to this question is both complicated and simplistic at the same time.

The answer I usually give us something along the lines of ‘if you a lot of data and an interesting / challenging problem, then you should try out deep learning’.

How a much is ‘a lot of data’?  That’s the complicated part.

Let’s use some examples to try to clarify things.

  • If you have 5 years of monthly sales data and want to use deep learning to build a forecaster, you’ll most likely be wasting your time.  Deep learning will work technically, but it generally won’t give you much better results than some simpler machine learning or even more simpler regression techniques.
  • If you have 20 years of real estate sales data with multiple features (e.g., square footage of the house, location, comparables, etc etc) and want to try to predict sales prices within a neighborhood/state/country, then deep learning is definitely an approach to take.  This a wonderful usage for deep learning.
  • If you want to build a forecaster to help develop a budget for your organization, maybe deep learning is a good approach…and maybe it isn’t.
  • If you want to build a “Hotdog Not Hotdog” app, deep learning is the right approach.
  • If you want to forecast how many widgets you’ll need to build next year with the previous 10 years of data, I’d recommend going with regression first then moving into some basic machine learning techniques. Deep learning (e.g., neural networks) could work here but it might not make a lot of sense depending on the size of the data.
  • If you want to predict movements in the stock market using the last 100 years of stock market data combined with hundreds of technical and/or fundamental indicators, deep learning could be a good approach but a good machine learning methodology would work as well. Just be careful of data mining and other bias’ you can introduce when working with time series data like the markets.

Ok. So I didn’t exactly clarify things, but hopefully you get the point.

Andrew Ng uses this graphic to highlight where deep learning make sense (from his Deeplearning.ai Coursera Course):

Deep Learning vs other approaches

 

 

 

 

 

 

 

 

 

 

 

 

 

The various lines on the chart are different approaches (regression, machine learning, deep learning) with the ‘standard’ approaches of regression and machine learning shown in red/orange.  You can see from the left of the chart that these types of approaches are similar performance-wise with small data sets.

Deep learning really begins to diverge in performance when your data-set starts to get sufficiently large. The ‘problem’ here is that ‘sufficiently large’ is hard to define.  That’s why I usually tell people to start with the basics first and try out regression then move to machine learning (Random Forest, SVM’s, etc etc) and then – once you have a feel for your data AND the performance of your approach isn’t delivering the results you expected, then try out deep learning.

That said, there are obviously times that deep learning makes sense initially. When you are looking at things like machine vision, natural language processing, autonomous driving, text translation in real-time, etc you want to investigate deep learning right away. Additionally, you can use deep learning appraoches for any problem you want to, but the performance is best when you have a large data set.

So…when should you consider deep learning? It depends on the challenge you are trying to solve.  Sorry…there’s not an ‘easy’ answer for the question.

 

When it comes to big data, think these three words: analyze; contextualize; internalize

change your mindset about big data - analyze, contextualize and internalize

change your mindset about big data - analyze, contextualize and internalizeIf you don’t know, I’m a bit of a data nerd.  I’ve been writing about big data, data science, machine learning and other ‘new’ stuff for years.  I believe in data science and I believe in big data. I’m a fan of machine learning (but think you probably don’t need it) for the majority of problems that the majority of organizations run across.

So…with that in mind…let me say this:  Big data and data science is nothing new.  Everyone is talking about big data, machine learning, artificial intelligence and data science like these things are ‘brand new’ to the world, but they aren’t. All of these ‘buzzword bingo’ candidates have been around for years…think 50+ years in one form or another.  Its wonderful to see the buzz around them these days since we finally have computing power to actually implement some of these ideas in a much more scalable way.

That said…don’t let scalable fool you into thinking that all you need to do is ‘scale’ and things will be hunky-dory.  The ability to scale to handle larger problems and larger data-sets is extremely important, but without the very basics of data science and applied statistics, all your big data / machine learning / AI projects aren’t going to be as valuable to you / your organization as you might hope.

According to IBM, we now generate 2.5 quintillion bytes of data per day. What are we doing with all that data?  Surely it isn’t all being used by good data scientists to build new models, generate revenue and deliver actionable insights to organizations?  I know for a fact it isn’t, although there are plenty of companies that are taking advantage of that data (think Google and Facebook). I once wrote that ‘today we are drowning in data and starved for information’ (which was a small change to John Naisbitt’s 1982 masterpiece Megatrends in which he wrote ‘we are drowning in information and starved for knowledge.’

Today, we are working with enormous data-sets today and there’s no reason to think these data-sets won’t continue to get larger. But, the size of your data isn’t necessarily what you should be worried about.  Beyond the important basics (data quality, data governance, etc) – which, by the way, have very little to do with data ‘size’ – the next most important aspect of any data project is the ability to analyze data and create some form of knowledge from that data.

When I talk to companies about data projects, they generally want to talk about technologies and platforms first, but that’s the wrong first step.  Those discussions are needed but I always tell them not to get hung up on the Spark’s, Hadoop’s, Map-reducer’s or other technologies / approaches.  I push them to talk about whether they and their organization have the right skills to analyze, contextualize and internalize whatever data they may have.  By having the ability to analyze, contextualize  and internalize, you add meaning to data, which is how you move from data to knowledge.

To do this work, organizations need to ensure they have people with statistical skills as well as development skills to be able to take whatever data you have and infer something from that data.  We need these types of skills more-so than we need the ability to spin up Hadoop clusters. I know 25 people that I can call tomorrow to turn up some big data infrastructure for me that could handle the largest of the large data-sets…but I only know a handful of people that I would feel comfortable calling and asking them to “find the insights from this data-set” and trust that they have all the skills (technical, statistical AND soft-skills) to do the job right.

Don’t forget, there IS a science to big data (ahem…it IS called data science after all). This science is needed to work your way up the ‘data -> information -> knowledge’ ladder. By adding context to your data, you create information. By adding meaning to your information, you create knowledge. Technology is an enabler for data scientists to add context and meaning, but it is still up to the individual to do the hard work.

Don’t get me wrong, the technical skills for these types of system are important. Data Scientists need to be able to code and use whatever systems are available to them, but the real work and the value comes from create information and knowledge from data.  That said, you don’t work up the ‘data -> information -> knowledge’ ladder without being able to understand and contextualize data and technology can’t (generally) do those very important steps for you (although with Artificial Intelligence, we may get their someday).

Stop thinking about the technologies and buzzwords.  Don’t think ‘Spark’, ‘python’, ‘SAS’ or ‘Hadoop’…think ‘analyze’ and ‘contextualize.’ Rather than chasing new platforms, chase new ways to ‘internalize’ data. Unless you and your team can find ways to analyze, contextualize and internalize data, your ability to make a real business impact with big data will be in jeopardy.

The Data Way

The Data Way

The Data WayThe world has become a world of data. According to Domo, the majority of the data (roughly 90% of it) that exists today has been created within the last two years. That’s a lot of data. Actually…that’s a LOT of data. And it’s your job to use that data to make better decisions and guide your organization / team to a brighter future.

Whether you’re in marketing, IT, HR, Finance, Sales or any other function within an organization, you have data and you need to figure out how to use that data – but where do you begin?

Many people grab data, throw it into excel and start throwing pivot tables and vlookups at it. If that’s what you do – then more power to you. Personally, I can’t stand vlookups. Truth be told – they don’t like me and subsequently I hate them. Don’t get me wrong – pivot tables and vlookups (and the other useful spreadsheet functionality) can deliver very good insight into your data but only if you know what you’re looking for.

Of course – you have a question or questions you want answered to and that’s what you’re digging into your data. You might want to know what your material costs are going to be for next year. Maybe you want to forecast your sales revenue for the coming quarter. Or, perhaps you want to better understand the differences between pay scales between the different groups of people within your organization.

That’s all well and good but what about all the other questions you don’t know you have? You’ll never find the answers to those questions sticking with pivot tables and vlookups to answer the ‘original’ question because you didn’t know you were supposed to be asking any additional questions.

When I say this in conversation, I tend to get a lot of questioning looks and responses like ‘that makes no sense’ or ‘I can’t ask questions I don’t know I’m supposed to ask’. Fair enough. I usually respond with the example of the creation of the Post-it Note by Art Fry at 3M. Nobody at 3M was looking to develop little sticky pieces of paper to be used as notes. They were just trying to create better adhesives when an idea struck Mr. Fry. He needed a bookmark and page marker that wouldn’t fall out. After some trial and error, the Post-it Note was born and now these little notes are part of a multi-billion dollar industry for 3M.

3M and its engineers had no idea they needed/wanted to invent the Post-it note but they were open to exploring new ideas and questions as they arose.

This is the same mindset you need to have with data. Don’t just ‘answer the question’ but keep digging and keep playing.  It can be tough to do that in Excel when stuck in pivot table and vlookup hell, but it can be done. Just keep your curiosity levels high and keep looking for those questions you didn’t know you had.

That’s the data way.

Don’t forget the “Science” in Data Science

Don't forget the science in data science

Don't forget the science in data scienceJust a reminder to everyone out there: This isn’t Data Magic….it is Data Science.  The word ‘science’ is included there for a reason.

I would LOVE for magic to be involved in data analytics. I could then whip up a couple of spells and say “abra cadabra’ and have my data tell me something meaningful.  But that’s not how it works.You can say fancy incantations all day long, but your data is going to be meaningless until you do some work on it.

This ‘work’ that you need to do involves lots of very unglamorous activities. Lots of data munging and manipulation. Lots of trial and error and a whole lot of “well that didn’t work!”

Data science requires a systematic approach to collecting, cleaning, storing and analysis data.  Without ‘science’, you don’t have anything but a lot of data.

Let’s take a look at what the word ‘science’ means. Dictionary.com defines “science” as:

  • a branch of knowledge or study dealing with a body of facts or truths systematically arranged and showing the operation of general laws
  • systematic knowledge of the physical or material world gained through observation and experimentation.
  • any of the branches of natural or physical science.
  • systematized knowledge in general.
  • knowledge, as of facts or principles; knowledge gained by systematic study.
  • a particular branch of knowledge.
  • skill, especially reflecting a precise application of facts or principles

You’ll notice that the word ‘magic’ isn’t included anywhere in that definition but the word ‘systematic’ shows up a few times. While we’re at it, let’s take a look at a definition of data science (from Wikipedia):

an interdisciplinary field about scientific methods, processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured

Again…nothing about ‘abra cadabra’ in there.

If you want to “do’ data science correctly, you have to do the hard work. You have to follow some form of systematic process(es) to get your data, clean your data, understand your data and then use that data to test out some hypotheses.

Doing data science without ‘science’ is nothing more than throwing darts at a dart board and thinking the results are meaningful.

 

 

 

Data Analytics – Data Modeling, a Necessary first step

Data Analytics - Data Modeling, a Necessary first step

Data Analytics - Data Modeling, a Necessary first stepWhat do you think of when you hear the term ‘data modeling’?

Just typing ‘data modeling’ almost made me go to sleep.  Who am I kidding…I’m a data geek and this stuff is interesting…but some folks aren’t quite as excited by this stuff as I am.

Data modeling has many different definitions and connotations.  For many within the IT world, data modeling conjures up database administrators sitting in room designing tables and relationships. That type of thing does make me sleepy…but it is a necessary step in any data storage workflow and in your data strategy.

Oh. What’s that? You don’t have a data strategy?

Well…You need one.

Here’s why: Much like business strategy, data strategy provides guidance into how your organization is going to capture, manage, use and integrate your data into your business.  Business strategy helps inform and guide data strategy while your data strategy helps you build better business strategies and tactics.

I’ll assume you have a data strategy in place. Because you do need one before you dive into data modeling.  Sure, you can build data models without any type of strategy but I can guarantee you those models will be changed multiple times over time since they weren’t informed by any strategic thinking.

Like I mentioned before, data modeling has many different definitions, many of which are very technical and beyond the scope of this short post but I will provide the steps that I like to use for data modeling.   Data Modeling consists of the following steps:

  • Understanding your business strategies, tactics and needs
  • Understanding what data you have and who might use it in the future
  • Understanding where your data comes from (and where it might be going)
  • Understanding the context of your data
  • Ensuring data quality, consistency and governance
  • Ensuring proper metadata is included with your data

These seem pretty straightforward (and they are) but these steps are the key steps needed to undertake a data project.  These aren’t earth-shattering revelations about how to do data modeling, but making sure these steps are covered in every data modeling project has helped me, my colleagues and my clients build some great data models, which led to great outcomes from the data we had.

If you don’t take the time understand your data, how do you know that the analytics that you build with that data is accurate?  You don’t.  Spend the necessary time in the modeling phase of your next data project and you may be surprised at the quality of the output of your data analytics.

Data Analytics – The importance of Data Preparation

Data Preparation

Data PreparationHow many of you would go sky diving without learning all the necessary precautions and safety measures necessary to keep you alive? How many of you would let your kid drive your car without first teaching them the basics of driving?  While not as life-and-death as the above questions, data preparation is just as important to proper data analytics as learning the basics of driving before getting behind a wheel.

Experienced data scientists will tell you data prep is (almost) everything and is the area that they spend the majority of their time.  Blue Hill research reports that data analysts spend at least 2 hours per day in data preparation activities.  At 2 hours per day, Blue Hill estimates that it costs about $22,000 per year per data analyst to prepare data for use in data analytics activities.

One of the reasons that prep takes up so much time is that it is generally a very manual process. You can throw tons of technology and systems at your data, but the front-end of the data analytics workflow is still very manual.  While there are automated tools available to help with data preparation, this step in the process is still a very manual process.

Data preparation is important. But…what exactly is it?

The Importance of Data Preparation

Data prep is really nothing more than making sure your data meets the needs of your plans for that data. Data needs to be high quality, describable and in a format that is easily used in future analysis and has some context included around the data.

There’s tons of ways to ‘do’ data preparation. You can use databases, scripts, data management systems or just plain old excel. In fact, according to Blue Hill, 78% of analysts use excel for the majority of their data preparation work. Interestingly, 89% of those same analysts claim that they use excel for the majority of their entire data analytics workflow.

As I mentioned before, there are some tools / systems out there today to help with data prep, but they are still in their infancy. One of these companies, Paxata, is doing some very interesting stuff with data preparation, but I think we are a few years off before these types of tools become widespread.

Data preparation is integral to successful data analytics projects. To do it right, it takes a considerable amount of time and can often take the majority of a data analyst’s time. Whether you use excel, databases or a fancy system to help you with data prep, just remember the importance of data preparation.

If you don’t prepare your data correctly, your data analytics may fail miserable. The old saying of “garbage in, garbage out” definitely applies here.

How focused are you on data preparation within your organization?