If you don’t know, I’m a bit of a data nerd. I’ve been writing about big data, data science, machine learning, and other ‘new’ stuff for years. I believe in data science, and I believe in big data. I’m a fan of machine learning (but think you probably don’t need it) for the majority of problems that most organizations run across.
So…with that in mind…let me say this: Big data and data science is nothing new. Everyone is talking about big data, machine learning, artificial intelligence and data science like these things are ‘brand new’ to the world, but they aren’t. All of these ‘buzzword bingo’ candidates have been around for years…think 50+ years in one form or another. It’s wonderful to see the buzz around them these days since we finally have the computing power to actually implement some of these ideas in a much more scalable way.
That said…don’t let scalable fool you into thinking that all you need to do is ‘scale’ and things will be hunky-dory. The ability to scale to handle larger problems and larger data sets is extremely important, but without the basics of data science and applied statistics, all your big data/machine learning/AI projects aren’t as valuable to you / your organization as you might hope.
According to IBM, we now generate 2.5 quintillion bytes of data per day. What are we doing with all that data? Surely it isn’t all being used by good data scientists to build new models, generate revenue and deliver actionable insights to organizations? I know for a fact it isn’t, although there are plenty of companies that are taking advantage of that data (think Google and Facebook). I once wrote that “today we are drowning in data and starved for information” (which was a small change to John Naisbitt’s 1982 masterpiece Megatrends in which he wrote ‘we are drowning in information and starved for knowledge.’)
Today, we are working with enormous data-sets, and there’s no reason to think these data sets won’t continue to get larger. But, the size of your data isn’t necessarily what you should be worried about. Beyond the important basics (data quality, data governance, etc) – which, by the way, have very little to do with data ‘size’ – the next most important aspect of any data project is the ability to analyze data and create some form of knowledge from that data.
When I talk to companies about data projects, they generally want to talk about technologies and platforms first, but that’s the wrong first step. Those discussions are needed, but I always tell them not to get hung up on technologies like Spark, Hadoop, Map-reducer or other technologies/approaches. I push them to talk about whether they and their organization have the right skills to analyze, contextualize and internalize whatever data they may have. By having the ability to analyze, contextualize and internalize, you add meaning to data, which is how you move from data to knowledge.
To do this work, organizations need to ensure they have people with statistical skills as well as development skills to be able to take whatever data you have and infer something from that data. We need these types of skills more so than we need the ability to spin up Hadoop clusters. I know 25 people that I can call tomorrow to turn up some big data infrastructure for me that could handle the largest of the large data sets…but I only know a handful of people that I would feel comfortable calling and asking to “find the insights from this data set” and trust that they have all the skills (technical, statistical AND soft skills) to do the job right.
Don’t forget; there IS a science to big data (ahem…it IS called data science after all). This science is needed to work your way up the ‘data -> information -> knowledge’ ladder. By adding context to your data, you create information. By adding meaning to your information, you create knowledge. Technology is an enabler for data scientists to add context and meaning, but it is still up to the individual to do the hard work.
Don’t get me wrong, the technical skills for these types of system are important. Data Scientists need to be able to code and use whatever systems are available to them, but the real work and the value come from creating information and knowledge from data. That said, you don’t work up the ‘data -> information -> knowledge’ ladder without being able to understand and contextualize data and technology can’t (generally) do those very important steps for you (although with Artificial Intelligence, we may get there someday).
Stop thinking about the technologies and buzzwords. Don’t think ‘Spark’, ‘python’, ‘SAS’ or ‘Hadoop’…think ‘analyze’ and ‘contextualize.’ Rather than chasing new platforms, chase new ways to ‘internalize’ data. Unless you and your team can find ways to analyze, contextualize and internalize data, your ability to make a real business impact with big data will be in jeopardy.