I’ve been needing to re-write some scripts for data analysis for my research. I initially wrote some scripts in R but found that R is particularly slow when it comes to this type of analysis (more accurately I should say that my implementation of these analysis techniques is slow).
So..I started looking for a more economical way to do this analysis. I’m using PHP to do some of the up-front data collection so my logical choice was to dust off my PHP skills and build some analysis scripts using PHP.
So I got out my PHP books and started coding. After a few days, I had a pretty impressive set of scripts that would take my collected data, run a bayes classification filter on that data for sentiment and then summarize that data. I was proud of myself…until I realized that the implementation of my classification algorithm would be difficult to justify in an academic setting….or at least that I’d have to spend a lot of time defending and justifying it at a later date. This was also one of the reasons that I wanted to re-write the R scripts.
So…I revisited my approach. Was there anything written in PHP that was well received in the academic world? Of course not.
One approach that is used by many researchers in text classification and sentiment analysis is to use the Python language and the Natural Language Toolkit (NLTK) — and there are plenty of academic articles citing the NLTK…so that helps me with defending my algorithms in my dissertation work.
Now…I’ve never looked at Python. I couldn’t have written a “Hello World” program in python. But…it needed to be done, so I found some resources on the web and dove in. Over the course of a few hours I wrote my analysis and summary scripts in python….and was absolutely amazed at how quick this language is. My buddy Jeff is probably getting tired of me telling him how great python is … but oh well…he’ll keep hearing it 🙂
I was able to get the time that my analysis takes down from 8 to 9 hours in R to about 1.5 hours in python. Talk about a time saver! Now…most of that time savings is probably due to new approaches to the analysis rather than just a pure python vs R speed issue….but the re-writing forced me to rethink my approach.
Why tell you about my newfound skillz (I’m told you have to use ‘z’ in this usage of the word)?
Part of me wanted to brag a bit 🙂
But, more importantly, learning a new programming language isn’t necessarily about the language itself…its about the discovery process. For me, learning Python forced me to rethink my approaches to the data analysis I was working on…and the outcome is a faster analysis with potentially more accurate results as well as a more defensible algorithm. Learning a new language forced me to think through my approach. It forced me to think about the inputs and outputs.
When is the last time to you took a step back and rethought your approach? You don’t need to learn Python to do it…just take a step back from your day-to-day grind and really look at what you are doing. Is it working for you? Is it working for your team and/or organization?
If the answer isn’t an unequivocal ‘yes’, then maybe you need to rethink your script(s) and look for a new approach.