I’m the worlds worst developer.
Really. I am.
I don’t follow best practices and my coding style is the oft-chided “brute force” method.
I owe (blame?) my coding style on the fact that the first language I learned was FORTRAN 77 and then quickly I picked up C. Then…I spent 3 years teaching FORTRAN 77 while a Grad student. Teaching FORTRAN 77 to an engineer as their first language is kind of like teaching an artist to draw by only using straight lines. That artist will be able to create art…and perhaps even create beautiful art…but it will be with straight lines, which will limit their creative output.
So…my “brute force” development method is simple: write a line of code to “do something” then write the next line of code. etc. etc. I stay in my brute force mindset 99.99% of the time while coding. It works but it is far from elegant and far from efficient.
But…for what i do, my coding style works. I’m not a professional developer…I write code for data analysis. It does the job that i need it to do. It might take longer than it should to execute said code, but it works.
Brute force coding can be slow. Very slow, especially looking at large datasets. But…its my approach and I’ve been happy with it. Until yesterday.
I have a dataset of over 5 million twitter messages. Combine that with a dataset of over 8500 stock symbols. Using Python, I built a set of scripts that reads through that large twitter dataset to find mentions of each stock symbol and then i aggregate the data based on various time-frames.
Initially, I wrote my code to look at less than 30 stock symbols. It was fast enough, especially if looking at just a few days of data. But…when i opened the universe up to over 8500 symbols, my brute force coding method’s inefficiencies became very very (very) visible.
My original script took a little over 24 hours to run through 8500 symbols and create a daily summary for those symbols consisting of a 1 week period. Yes. THAT is slow. Based on that speed, I’d be able to have a 1 year sample using a daily summary of 8500 symbols in roughly 52 days. Not good.
So…I went back to the drawing board and against my training and instinct, I set aside my brute force methods and looked for more efficient methods. It took me a while to learn a new approach, but I did it.
I had to learn new methods and a new mindset for programming. No more “do this then do this” coding…I had to think abstractly and learn new tools and processes.
Using Python, pandas, numpy and Python’s Multiprocessing package, I re-wrote my code. I built the code to use efficient and ‘pythonic’ approaches to performing tasks. I then split up tasks to be spun-off to multiple processors. This multi-threading approach was the biggest efficiency booster overall, but taking advantage of built-in pandas and numpy functions helped as well.
When I began, my code took 24 hours to summarize 1 week of data. My re-written and re-factored code now does the same task in under 4 minutes. Thats much much faster, yes? 🙂 Much of the time savings came from the use of the python multiprocessing package and the using of a dual-processor Xeon 5570 computer with 16 total threads. I wrote my code to use 12 of those threads to keep from overloading the machine (and to be able to still use the computer while the script runs). This change, along with a few other minor efficiency changes, brought my compute time from 24 hours down to 75 minutes for the 1 week period.
The final efficiency boost was found by using some built-in functions in pandas. I had been looping through an entire array to get a count of values for each symbol for each day..this takes computing cycles. Rather than looping, I used pandas’ built-in ‘value_count’ function. Making this change brought my compute time from 75 minutes to less than 4 minutes for the same 1 week period. Some great efficiency gains I’d say.
So…the moral of this story?
Don’t be afraid to learn new things and new approaches. While I still follow my brute force coding methods for many scripts, I know i can bring in more elegant and efficient methods as I need to. It might be difficult to learn something new, but it can be rewarding.
This is interresting; And in fact I think you did it the right way. First, try the simplest solution that work for you. If this is ok, then stop there and take some coffee 😉 if there is problem, try to improve and iterate. One comment through, I think there is no sufficient information to really understand what you really did. But look like brute force would be to perform 5 000 000 * 8500 computation. Now I don’t know if you did that already, but I guess that for one tweet there is only a few possible symbols. If… Read more »
Hi Nicolas –
Thanks for the comment.
I’m a huge fan of ‘simple first’…my whole life is built around the motto 🙂
Good suggestions here. I’ve approached the problem in a similar manner on my re-write to help with efficiency. The real issue is just the large dataset. Multithreading provides extreme efficiency but I think there are more gains to be made. I’ll be trying your suggestions to see if bring more speed. Thanks!
this was actually informative – not like most of what i see online. sharing 🙂