Data Analytics & Python (Cross Post)

Crosspost – This post first appeared on Python Data as Data Analytics & Python.

Data Analytics & PythonSo you want (or need) to analyze some data. You’ve got some data in an excel spreadsheet or database somewhere and you’ve been asked to take that data and do something useful with it. Maybe its time for data analytics & Python?

Maybe you’ve been asked to build some models for predictive analytics. Maybe you’ve been asked to better understand your customer base based on their previous purchases and activity.  Perhaps you’ve been asked to build a new business model to generate new revenue.

Where do you start?

You could go out and spend a great deal of money on systems to help you in your analytics efforts, or you could start with tools that are available to you already.  You could open up excel, which is very much overlooked by people these days for data analytics. Or…you could install open source tools (for free!) and begin hacking away.

When I was in your shoes in my first days playing around with data, I started with excel. I quickly moved on to other tools because the things I needed to do seemed difficult to accomplish in excel. I then installed R and began to learn ‘real’ data analytics (or so I thought).

I liked (and still do like) R, but it never felt like ‘home’ to me.  After a few months poking around in R, I ran across python and fell in love. Python felt like home to me.

With python, I could quickly cobble together a script to do just about anything I needed to do. In the 5+ years I’ve been working with python now, I’ve not found anything that I cannot do with python and freely available modules.

Need to do some time series analysis and/or forecasting? Python and statsmodels (along with others).

Need to do some natural language processing?  Python and NLTK (along with others).

Need to do some machine learning work? Python and sklearn (along with others).

You don’t HAVE to use python for data analysis. R is perfectly capable of doing the same things python is – and in some cases, R has more capabilities than python does because its been used an analytics tool for much longer than python has.

That said, I prefer python and use python in everything I do. Data analytics & python go together quite well.

 

I learned Python…and much more

I’ve spent the last two weeks with my head buried in programming languages.

I’ve been needing to re-write some scripts for data analysis for my research. I initially wrote some scripts in R but found that R is particularly slow when it comes to this type of analysis (more accurately I should say that my implementation of these analysis techniques is slow).

So..I started looking for a more economical way to do this analysis.  I’m using PHP to do some of the up-front data collection so my logical choice was to dust off my PHP skills and build some analysis scripts using PHP.

So I got out my PHP books and started coding. After a few days, I had a pretty impressive set of scripts that would take my collected data, run a bayes classification filter on that data for sentiment and then summarize that data.  I was proud of myself…until I realized that the implementation of my classification algorithm would be difficult to justify in an academic setting….or at least that I’d have to spend a lot of time defending and justifying it at a later date. This was also one of the reasons that I wanted to re-write the R scripts.

So…I revisited my approach.  Was there anything written in PHP that was well received in the academic world? Of course not.

One approach that is used by many researchers in  text classification and sentiment analysis is to use the Python language and the Natural Language Toolkit (NLTK) — and there are plenty of academic articles citing the NLTK…so that helps me with defending my algorithms in my dissertation work.

Now…I’ve never looked at Python. I couldn’t have written a “Hello World” program in python.  But…it needed to be done, so I found some resources on the web and dove in.  Over the course of a few hours I wrote my analysis and summary scripts in python….and was absolutely amazed at how quick this language is. My buddy Jeff is probably getting tired of me telling him how great python is … but oh well…he’ll keep hearing it 🙂

I was able to get the time that my analysis takes down from 8 to 9 hours in R to about 1.5 hours in python. Talk about a time saver!  Now…most of that time savings is probably due to new approaches to the analysis rather than just a pure python vs R speed issue….but the re-writing forced me to rethink my approach.

Why tell you about my newfound skillz (I’m told you have to use ‘z’ in this usage of the word)?

Part of me wanted to brag a bit 🙂

But, more importantly, learning a new programming language isn’t necessarily about the language itself…its about the discovery process.   For me, learning Python forced me to rethink my approaches to the data analysis I was working on…and the outcome is a faster analysis with potentially more accurate results as well as a more defensible algorithm. Learning a new language forced me to think through my approach. It forced me to think about the inputs and outputs.

When is the last time to you took a step back and rethought your approach?   You don’t need to learn Python to do it…just take a step back from your day-to-day grind and really look at what you are doing. Is it working for you?  Is it working for your team and/or organization?

If the answer isn’t an unequivocal ‘yes’, then maybe you need to rethink your script(s) and look for a new approach.