Don’t use your data this way

bad-hr-analytics-dataLast week, I wrote a post titled Big Data Starts with Data Management. In that post, I wrote:

Starting with data management will help mitigate these risks since a good data management approach allows organizations to keep data quality in mind from the beginning of a big data project.

Data management is a key aspect of big data projects. There’s no doubt about that.

Today, I want to share a real-world story of one company that has poor (or perhaps no) data management processes and how the lack of good processes could potentially push clients away.

This story starts in 1996 when I purchased a 1995 Chevrolet Blazer. I took the Blazer into a local GM dealer for service a few times in 1997.

Now – fast-forward to March 2014. One evening, I check my email and notice the following email:

We want to welcome you as a Preferred Email Customer of _____ Chevrolet. Thank you for letting us send periodic emails regarding your 1995 Blazer. In the future, we would like to send you emails that will include safety related recalls, service reminders and special offers from _____ Chevrolet that only our Preferred Email Customers will receive.

I was quite surprised when I first saw this email. First, I no longer own the Blazer; I traded that Blazer in for a Camaro in 1999. Second, after thinking for a minute and looking at the dealership, I realized it was the same dealership that I had taken my Blazer to the dealership in ‘97.

Here we have a dealership who’s just taken some initiative to build a ‘preferred’ email customer list and start to reach out to those clients. Great idea, but poor execution.

There’s a real problem with their approach though. They’ve done very little in terms of data management. They’ve taken a few pieces of disparate data from their client database and stuffed them into an email.

This isn’t how you use data. You don’t just take data, throw it into an email system and blast your clients or ex-clients. You’ll do nothing but annoy.

Data management processes would help here. With the proper management systems and processes in place, this organization wouldn’t have just dumped old data into an email system. They would have had processes in place to ensure the data would be as accurate as possible. They would have also had systems in place to ensure that any data that is used to contact clients would be cleansed and updated to ensure that clients want to be updated.

This particular organization is trying to reach out to clients, but they have used old, outdated data. When I read their email I immediately thought that this dealership had no clue how to do proper client outreach. They had no clue how to ‘clean’ their data or manage that data to ensure they were only reaching out using accurate email lists.

Data management isn’t always THE answer, but for this particular problem, it will help. Proper data management systems and processes are critical for every organization today. Don’t let your organization look as bad as this car dealership did. Make sure you’ve got proper data management processes and systems in place.

A look at Twitter messages in 2012 mentioning $SPY and S&P500 Symbols

twitter-bird-blue-on-whiteCross Posted at TradeTheSentiment.com

While working up my data analysis chapter of my dissertation, I came across some interesting tidbits of information and thought I’d share.

Nothing here is earth-shattering and there’s not much I (or you) can do with this…but I thought it interesting and hope someone else out there does too. I’ve shared other findings before – and continue to share my daily Bear/Bull Ratio via my Trade The Sentiment site, which is an outcome of this research.

For the data collection phase of my dissertation, I collected Twitter messages for all stocks in the S&P500 index and the SPY ETF itself.  There are many great pieces of knowledge that I’ve gathered from this work – some I’ve shared but most I won’t share because I need something to put into the dissertation. 🙂

So…here’s some data that you might find interesting (or maybe you won’t). Without further ado – and without interpretation, here you go:

SPY and all symbols in S&P 500 Index

Dates: Jan 1 – Dec 30 2012

  • Number of Twitter Messages Captured: 1,655,962
  • Number of Symbols: 501 (S&P 500 + SPY)
  • Number of days messages captured: 361
  • Number of Twitter users: 224,499
  • Average Messages per day: 4,587.15
  • Average Messages per user: 7.38
  • Date with Highest message volume: December 5 2012
  • Symbol with most Mentions: AAPL (620,964 messages or 37.5% of messages)
  • Symbol with most Bearish Mentions: AAPL with 98,402 messages with bearish sentiment
  • Symbol with most Bullish Mentions: AAPL with 78,353 messages with bullish sentiment
  • User with most Tweets: SeekingAlpha
  • Top 10 users account for 128,703 messages or 7.77% of messages
  • Top 25 users account for 197,878 messages or 11.95% of messages
  • Top 50 users account for 278,846 messages or 16.84% of messages
  • 50% of messages were sent by 849 Twitter users or 0.38% of users
  • 80% of messages were sent by 14,049 Twitter users or 12.27% of users

 Top 50 captured Twitter Users:

  1. SeekingAlpha
  2. BigTicks
  3. thefinancepress
  4. wallstCS
  5. CPUStocks
  6. gasoilstocks
  7. ADVFNplc
  8. MarketCurrents
  9. StockRecaps
  10. PerforM84697233
  11. TheStreet
  12. simplestockqtes
  13. Tradified
  14. SAI
  15. BigChipStocks
  16. pennystockguys
  17. DJThistle
  18. RetailerStocks
  19. TradingGuru
  20. boogidown
  21. USwwwStocks
  22. lluccipha
  23. MNYCx
  24. investorpoint
  25. takingstock614
  26. tradingview
  27. stockticks
  28. 1nvestor
  29. ForTraders
  30. FastFoodStocks
  31. StockTwits
  32. some_win
  33. ValueStocksNow
  34. PiggyStocks
  35. Insider_Trades
  36. 61point8
  37. BlueFielder
  38. tlmontana
  39. stockguy22
  40. Phil_Goodship
  41. LaMonicaBuzz
  42. Jamtrades
  43. businessinsider
  44. BUDDIEE18
  45. ZolmaxNews
  46. OneChicago
  47. olyant75
  48. onebrow1
  49. DeidreZune
  50. bored2tears

Cross Posted at TradeTheSentiment.com

Always Learning. Or: my attempt to improve my (horrid) programming habits

Screenshot at 2013-04-18 11:06:40I’m the worlds worst developer.

Really. I am.

I don’t follow best practices and my coding style is the oft-chided “brute force” method.

I owe (blame?) my coding style on the fact that the first language I learned was FORTRAN 77 and then quickly I picked up C.  Then…I spent 3 years teaching FORTRAN 77 while a Grad student. Teaching FORTRAN 77 to an engineer as their first language is kind of like teaching an artist to draw by only using straight lines.  That artist will be able to create art…and perhaps even create beautiful art…but it will be with straight lines, which will limit their creative output.

So…my “brute force” development method is simple: write a line of code to “do something” then write the next line of code.  etc. etc.  I stay in my brute force mindset 99.99% of the time while coding. It works but it is far from elegant and far from efficient.

But…for what i do, my coding style works. I’m not a professional developer…I write code for data analysis. It does the job that i need it to do.  It might take longer than it should to execute said code, but it works.

Brute force coding can be slow. Very slow, especially looking at large datasets. But…its my approach and I’ve been happy with it. Until yesterday.

I have a dataset of over 5 million twitter messages. Combine that with a dataset of over 8500 stock symbols.  Using Python, I built a set of scripts that reads through that large twitter dataset to find mentions of each stock symbol and then i aggregate the data based on various time-frames.

Initially, I wrote my code to look at less than 30 stock symbols. It was fast enough, especially if looking at just a few days of data. But…when i opened the universe up to over 8500 symbols, my brute force coding method’s inefficiencies became very very (very) visible.

My original script took a little over 24 hours to run through 8500 symbols and create a daily summary for those symbols consisting of a 1 week period. Yes. THAT is slow. Based on that speed, I’d be able to have a 1 year sample using a daily summary of 8500 symbols in roughly 52 days.  Not good.

So…I went back to the drawing board and against my training and instinct, I set aside my brute force methods and looked for more efficient methods. It took me a while to learn a new approach, but I did it.

I had to learn new methods and a new mindset for programming. No more “do this then do this” coding…I had to think abstractly and learn new tools and processes.

Using Python, pandas, numpy and Python’s Multiprocessing package, I re-wrote my code. I built the code to use efficient and ‘pythonic’ approaches to performing tasks. I then split up tasks to be spun-off to multiple processors. This multi-threading approach was the biggest efficiency booster overall, but taking advantage of built-in pandas and numpy functions helped as well.

When I began, my code took 24 hours to summarize 1 week of data.  My re-written and re-factored code now does the same task in under 4 minutes.  Thats much much faster, yes? 🙂   Much of the time savings came from the use of the python multiprocessing package and the using of a dual-processor Xeon 5570 computer with 16 total threads.  I wrote my code to use 12 of those threads to keep from overloading the machine (and to be able to still use the computer while the script runs). This change, along with a few other minor efficiency changes, brought my compute time from 24 hours down to 75 minutes for the 1 week period.

The final efficiency boost was found by using some built-in functions in pandas. I had been looping through an entire array to get a count of values for each symbol for each day..this takes computing cycles. Rather than looping, I used pandas’ built-in ‘value_count’ function.  Making this change brought my compute time from 75 minutes to less than 4 minutes for the same 1 week period.  Some great efficiency gains I’d say.

So…the moral of this story?

Don’t be afraid to learn new things and new approaches.  While I still follow my brute force coding methods for many scripts, I know i can bring in more elegant and efficient methods as I need to. It might be difficult to learn something new, but it can be rewarding.

Context and Data

5354550015A few weeks ago I wrote about Big Data and Small Business.

From that post, I wrote:

As its defined, big data might be too big for small business, but the concepts behind big data – identifying, collecting, analyzing and using data – aren’t too big. Anyone can use do four steps regardless of business size and technical acumen.

When it comes to big data, anyone can ‘do’ big data.  Anyone can identify, collect, analyze and use the analysis to run their business.  The key to ‘doing’ big data is to find the context and the tools to make it work for you and your organization.

There’s a lot on the web about tools for analysis, but not so much on the first step in the process of analyzing data.   Besides…thinking about the tools before really understanding what   That first step? Identifying the right data to analyze.

To identify the ‘right’ data, you’ve got to understand your data and how your data fits into your business.

In short, you have to understand the context of your data.

Webster’s defines the word “context” in a few ways, but we’ll select this definition as its :

The interrelated conditions in which something exists or occurs

So…context and your data.

In order to get the most out of your data, you have to know what data you have, where the data comes from, how the data was collected and, more importantly, the context surrounding the data when it was collected.

Before we get into the topic of context, let me say one thing about the data itself – the worst thing you can do is to immediately assume that the data that you have is valid and useful. Don’t assume.  You’ve got to understand your data and you’ve got to be sure of the integrity of your data.

Now…this isn’t a data integrity post. There are plenty of good data integrity posts out there so I won’t dive into that space just now.

I’m here to talk about context.  Context is key. 

Just collecting data isn’t enough.  Analyzing data isn’t enough either.

Understanding the context of where your data comes from and how you want to use it is the difference between good data and bad data.

An example

Your organization wants to undertake a social media listening program.  As part of this program, you are interested in understanding ‘sentiment’ of the marketplace.

Your social listening vendor offers you the ability to listen for sentiment on Twitter.  They collect messages and that mention your company, products and services. After collecting the messages, they run them through a fairly simple sentiment analysis system.  The analysis system uses a keyword list to assign sentiment to the messages.  This keyword list was built with help from you and your team but it is fairly basic and very generic.

After a few months of ‘listening’ to sentiment, you get the sense that your organization is well loved on Twitter. The sentiment is through the roof and the market loves you. You claim your listening project to be a success…you now know that the market loves you, your products and your services.

But…do they?

Context is key.

Is the keyword list used for ‘sentiment’ something that is useful for your business?

I don’t want to get to deep into sentiment analysis here, but context is very important in this regard. Is the sentiment keyword list generated with domain knowledge?  Was proper contextual planning used when developing the keywords to listen for?  What does a ‘standard’ client look like for you…are they generally more sarcastic than others? If so…how does that affect your sentiment analysis outcome?

As you can see…context is key. Domain context as well as context around the data you are collecting.

Take a look at a tweet that says ‘ Just Great.  Company Y’s new product X has fifteen features, none that address my issue!”

Now…to me, with a background in product management and software, I read that tweet as a sarcastic comment.  The user isn’t actually saying the new features are great…they are expressing a sarcastic remark to describe how unfortunate it is that none of the new features actually solve their biggest problem.

But…with a keyword list built to be generic, the word “great” might be considered to automatically mean that the user is expressing positive sentiment.  The word ‘issue’ might be tagged as negative…or maybe not.  But…without context around the domain, it would be difficult to build a keyword list that accurately classifies this message.

Again..I don’t want to dive to deeply into sentiment analysis…its a very interesting field that can be discussed for years. The key in this argument is to understand that context is everything here.

There are other examples of context and data that i could provide (and may yet provide in the future) but just remember the following: Context is key.

Data without context is just data.

Image Credit: Context logo by Context Travel on flickr

A quick analysis of the #CIO Twitter Stream – Twitter Quality vs Quantity?

As I mentioned a few weeks ago, I’ve been capturing and analyzing the #CIO twitter stream.

I’m interested in the CIO topic, have the capabilities to do the work and there are some really interesting aspects to twitter users and messages that I’m enjoying studying…so I chose this particular topic to take a more detailed look at.

Update: Per feedback received, I wanted to make the goal of this project clear:

I am looking for ways to ‘measure’ influence and ‘quality’ of twitter users for my doctorate research. While my research is focused on the stock market, I am using the #CIO data stream because it is one that I know well and can follow easily. Using this stream, I am able to build my analysis tools and work through analysis issues that I will re-use in my other research areas. Ultimately, there’s no real “actionable” goal from this particular stream’s research other than to be able to see what is being shared, who is sharing it and how the information might be consumed and re-shared.

The current dataset:

  • Number of Tweets collected: 7,478
  • Number of different users: 2,868
  • Date Range: June 16 to June 28 2012

Collection Method:

  • Using the streaming method of the Twitter API, I am collected any tweet that uses the hashtag “#CIO”.
  • I am collecting all fields provided to me via Twitter API. They are:
    • id (unique number for each tweet)
    • id_str(string version of id)
    • from_user (string – username from twitter)
    • from_user (integer unique for each twitter user)
    • to_user_id (integer describing if a tweet is sent to another user)
    • geo (geographic location if enabled by user)
    • text (twitter message)
    • profile_image_url (url for the profile of the user who sent the message)
    • created_at (date/time of creation of twitter message)
  • Each tweet is stored in a MySQL database for further study.

Analysis:

  • Using python, I’ve written a script that pulls tweets with the #CIO hashtag. The script then analysis the data.
  • Currently, I’ve analyzed the following:
    • Tweets per day
    • Number of tweets per user
    • Lexical Diversity of tweets
    • Average length of tweets
    • Number of mentions/retweets

Below are some simple results from the analysis.

Continue reading

#CIO Twitter Stream Content Visualization

On Monday July 16, I started saving all tweets with the hashtag “#CIO” using twitter’s API. I’m using the same collection/storage script that I’m using for my Twitter Sentiment for Investing Decisions research and just added another keyword term to store.

Since I have the ability to capture,store and analyze twitter data, I thought I’d point it at one of the areas that is most interesting to me…the CIO role.

The collector started capturing data around 8AM on Monday July 16.

At 8AM today (July 18), I grabbed the captured tweets to take a look at what people are talking about.

There were 1089 tweets collected during this time. Per my standard process, I removed common English Words and the terms “RT” and “CIO” as these two words were the most used words in the captured tweets and added little value to have them included (the term  CIO was used 1104 times and RT was used 356 times).

A few of the other often used terms:

  • Business – used 94 times
  • Cloud – used 102 times
  • BYOD – used 78 times
  • Leadership – used 58 times

To get a good visualization of the words found, I used Wordle to and it gave me the following (you can view the visualization on Wordle or click the image below to jump to Wordle):

 Stay tuned…I’m planning on keeping on eye on this over the next few weeks/months to get a feel for what type of content is coming across this stream and if there’s any type of analysis that can be done to understand this content better.