Eric D. Brown, D.Sc.

Data Science | Entrepreneurship | ..and sometimes Photography

Tag: data science (page 2 of 3)

Data Analytics – The importance of Data Preparation

Data PreparationHow many of you would go sky diving without learning all the necessary precautions and safety measures necessary to keep you alive? How many of you would let your kid drive your car without first teaching them the basics of driving?  While not as life-and-death as the above questions, data preparation is just as important to proper data analytics as learning the basics of driving before getting behind a wheel.

Experienced data scientists will tell you data prep is (almost) everything and is the area that they spend the majority of their time.  Blue Hill research reports that data analysts spend at least 2 hours per day in data preparation activities.  At 2 hours per day, Blue Hill estimates that it costs about $22,000 per year per data analyst to prepare data for use in data analytics activities.

One of the reasons that prep takes up so much time is that it is generally a very manual process. You can throw tons of technology and systems at your data, but the front-end of the data analytics workflow is still very manual.  While there are automated tools available to help with data preparation, this step in the process is still a very manual process.

Data preparation is important. But…what exactly is it?

The Importance of Data Preparation

Data prep is really nothing more than making sure your data meets the needs of your plans for that data. Data needs to be high quality, describable and in a format that is easily used in future analysis and has some context included around the data.

There’s tons of ways to ‘do’ data preparation. You can use databases, scripts, data management systems or just plain old excel. In fact, according to Blue Hill, 78% of analysts use excel for the majority of their data preparation work. Interestingly, 89% of those same analysts claim that they use excel for the majority of their entire data analytics workflow.

As I mentioned before, there are some tools / systems out there today to help with data prep, but they are still in their infancy. One of these companies, Paxata, is doing some very interesting stuff with data preparation, but I think we are a few years off before these types of tools become widespread.

Data preparation is integral to successful data analytics projects. To do it right, it takes a considerable amount of time and can often take the majority of a data analyst’s time. Whether you use excel, databases or a fancy system to help you with data prep, just remember the importance of data preparation.

If you don’t prepare your data correctly, your data analytics may fail miserable. The old saying of “garbage in, garbage out” definitely applies here.

How focused are you on data preparation within your organization?

Good data science isn’t about finding answers to questions

Good data science isn't about finding answers to questionI just finished reading an article over on Fast Company titled “How Designers Are Helping HIV Researchers Find A Vaccine.”  The story related in this article is a perfect example of what ‘good’ data science looks like.  The data scientists and designers worked together to build a platform that made it easy for anyone to dive into data sets, find answers – and more importantly – find more questions.

I’ve said it before – Good data science isn’t about finding answers to questions. Good data science is about setting up your data sets, processes and systems to allow you to find more questions.  As I’ve said before:

Big Data helps you find the questions you don’t know you want to ask.

The designers and data scientists working with the HIV data were working from a similar mindset. From the article:

“We’ve already harmonized the data . . . we’ve lined everything up, put it in the space, made it so you could ask questions you didn’t set out to ask,” says Dave McColgin, UX design director at Artefact. “You can sort of stumble into additional questions, if that makes sense.”

This is good data science.

These folks didn’t take the data and throw it into a data repository, set up processing systems and technologies and then keep everyone away from it. They didn’t hoard the data or the results of any analysis. They opened the data up to everyone to get multiple sets of eyes (and brains) on the data. They focused on data visualization to make it easy to understand and conceptualize the data. They started with the idea that they wanted to see more questions asked then answered. Again…this is good data science.

For those of you who are thinking about data initiatives or currently working with data, make sure you are building your systems and processes to find more questions than answers. Otherwise, you’ll be missing out on a good portion of the value of data science.


A quick analysis of the #CIO Twitter Stream – Twitter Quality vs Quantity?

As I mentioned a few weeks ago, I’ve been capturing and analyzing the #CIO twitter stream.

I’m interested in the CIO topic, have the capabilities to do the work and there are some really interesting aspects to twitter users and messages that I’m enjoying studying…so I chose this particular topic to take a more detailed look at.

Update: Per feedback received, I wanted to make the goal of this project clear:

I am looking for ways to ‘measure’ influence and ‘quality’ of twitter users for my doctorate research. While my research is focused on the stock market, I am using the #CIO data stream because it is one that I know well and can follow easily. Using this stream, I am able to build my analysis tools and work through analysis issues that I will re-use in my other research areas. Ultimately, there’s no real “actionable” goal from this particular stream’s research other than to be able to see what is being shared, who is sharing it and how the information might be consumed and re-shared.

The current dataset:

  • Number of Tweets collected: 7,478
  • Number of different users: 2,868
  • Date Range: June 16 to June 28 2012

Collection Method:

  • Using the streaming method of the Twitter API, I am collected any tweet that uses the hashtag “#CIO”.
  • I am collecting all fields provided to me via Twitter API. They are:
    • id (unique number for each tweet)
    • id_str(string version of id)
    • from_user (string – username from twitter)
    • from_user (integer unique for each twitter user)
    • to_user_id (integer describing if a tweet is sent to another user)
    • geo (geographic location if enabled by user)
    • text (twitter message)
    • profile_image_url (url for the profile of the user who sent the message)
    • created_at (date/time of creation of twitter message)
  • Each tweet is stored in a MySQL database for further study.


  • Using python, I’ve written a script that pulls tweets with the #CIO hashtag. The script then analysis the data.
  • Currently, I’ve analyzed the following:
    • Tweets per day
    • Number of tweets per user
    • Lexical Diversity of tweets
    • Average length of tweets
    • Number of mentions/retweets

Below are some simple results from the analysis.

Continue reading

Big Data, Small Business

This post sponsored by the Enterprise CIO Forum and HP.

Big Data is a Big story these days and has been for some time.

Big Data is a big business too….and will most likely continue to grow.

Big Data is a topic that many in large organizations are talking about…as are many consulting and technology companies. HP, among others, has a great deal of future business riding on Big Data and the underlying architecture. They’ve published an interesting report titled Information Optimization: Turning Information into better Enterprise Decisions (PDF download) that’s worth a read.

In a research report titled “Big data: The next frontier for innovation, competition, and productivity” by McKinsey & Co, an argument is made that Big Data will:

…become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus

The McKinsey report describes seven key ‘insights’…they are:

  1. Data have swept into every industry and business function and are now an important factor of production, alongside labor and capital
  2. Big Data can create Big Value (more on this in a minute)
  3. The use of big data will become a key basis of competition and growth for individual firms
  4.  The use of big data will underpin new waves of productivity growth and consumer surplus
  5. While the use of big data will matter across sectors, some sectors are set for greater gains
  6. There will be a shortage of talent necessary for organizations to take advantage of big data.
  7. Several issues will have to be addressed to capture the full potential of big data.

The full-report is available for download here.

All these key ‘insights’ are spot on and nothing new…for big organizations. But…what about small businesses?   Does Big Data make itself available to Small Business?  Or…more appropriately…does Small Business have access to the Big Data tools, data and talent?

In order to play in the Big Data world, you’ve got to have data. And the ability to utilize that data.

Does Small Business have data? Yes. Does Small business have the ability to utilize that data? Maybe..maybe not.

Insight #6 from McKinsey will hit Small Business extremely hard.  If there’s a shortage of Talent in the broader market, just imagine the shortage of talent and capabilities in the small business market.

Not only is there a shortage of talent…but there seems to be a shortage in tools too. Not that the tools out there today can’t be used by small businesses…its just that most tools are either too expensive or not available for their data.

The benefits of Big Data are many…if you have the capability to use / consume  it.   There are some real benefits to using Data for decisions.

But for small business to use Big Data, they have to 1.) Have Data;  2.) Know how to consume that data to analyze it; and 3.) Know how to analyze that data.

Kevin Tea touches on the importance of Big Data today in an article titled Big Data – Deal With The Negatives To Enjoy The Positives over on the Enterprise CIO Forum.  Kevin writes about the positives behind Big Data and the ways companies can use data to drive innovation and performance. A few key items are:

  • Make big data more accessible and timely.
  • Segment populations to customize.
  • Use automated algorithms to replace and support human decision making.
  • Innovate with new business models, products, and services

Good points.

But…are any of these points realistic for small business?

Point #1 is a requirement. The other four are nice and would be wonderful for small businesses to take advantage of but I wonder if they will ever be able to?

I wonder if Big Data & Small Business are even meant to play together?  If I owned a small business, I’d want to know everything i could about my business, my customers and my industry. I’d then want to be able to take that data and use it for making decisions and driving innovation in my business.

I just wonder if small businesses have the right tool set and skill set to take advantage of the explosion in Data. I wonder if small business owners have the data. I wonder if Big Data will drive innovation, performance and improvements for big business but leave Small Business behind.

I’d like to see more small businesses be in a position to take advantage of the analytic capabilities that are popping up everywhere today.

Is Big Data another ‘fad’ to sell more consulting services and technology platforms.  Is Big Data just another repackaging of what most smart organizations have always done?  Maybe. Maybe not. Either way…the need for small business to have access to analytics and data is as important today than it has ever been.

I’d love to hear your stories of how small businesses are using data to drive their business.  Drop me a comment or email.

Image Credit: Visualization of Twitter bios in the big data community By metaroll on flickr

This post sponsored by the Enterprise CIO Forum and HP.

Using Twitter Sentiment for predicting stock price movement

I just finished giving a presentation titled “Will Twitter Make you a better investor?”…and like I always do with these presentations, I recorded one of my rehearsal’s to share.

In this presentation, I provide an overview of my research into using twitter sentiment and message volume as inputs into modeling stock price movements. A quick and dirty linear regression model using Twitter Sentiment, the Number of Tweets per day, the VIX Closing price and the VIX Price change delivers a simple model for the S&P 500 SPY ETF that has an accuracy of 57% over 6 months (tested on out-of sample data). This model was built using data from July 11 2011 to August 11 2011. Note: Accuracy is a measure of predicting the direction of movement.   Being accurate and making money from that accuracy is two different things.

Update:  Please note that the Linear Regression model described in this presentation is far from ideal. When modeling Time Series data, the linear regression model must be used with care due to autocorrelation issues.  

If you don’t want to listen to me yammer, you can jump down to the bottom of this post and take a look at the slides.

The presentation (if you don’t see anything…jump over to Vimeo to watch it there (~30 minutes)):

Twitter Sentiment & Investing – modeling stock price movements with twitter sentiment. from Eric D Brown on Vimeo.

The slides (if you don’t want to listen to me yammer):

Will Twitter make you a better investor?

Will Twitter Make You a Better Investor? A Look at Sentiment, User Reputation and their effect on the Stock MarketMy paper, titled Will Twitter Make You a Better Investor? A Look at Sentiment, User Reputation and their effect on the Stock Market, has been published in the Conference Proceedings for the Southern Association for Information Systems (SAIS) 2012 Conference.

You can grab a copy of the PDF here: Will Twitter Make You a Better Investor? A Look at Sentiment, User Reputation and their effect on the Stock Market

You can see the full proceedings of the SAIS 2012 Conference here.

« Older posts Newer posts »