Kill your Spreadsheets: Leaving Excel behind and eloping with iPython Notebooks
I should start by saying that I do love Excel. Truly.
I’ve used it extensively for many years, even earning the status of “power user.” I knew all of the shortcuts, almost never touched my mouse, and even knew just enough VBA to be dangerous. After years of such heavy use, navigating through Excel feels akin to playing the piano, with muscle memory comfortably guiding you through the gentle flow of spreadsheet Zen (as one previous colleague out it, “once you know the dark side of Excel, you become more powerful than any Jedi…” Gee, I wonder what ever happened to that guy…).
It’s comfortable, it’s reliable. I gets the job done.
However, for all the wonders of Excel, there are
about a billion several areas where the almighty spreadsheet software falls short. To name just a few:
- Lack of repeatability: You can certainly create elaborate analyses in Excel, however small changes in required output or even updated data can often mean you need to repeat the entire process.
- Importing Data: We often do a lot of pre-processing of data using SQL. Excel makes importing directly from our data warehouse difficult compared to iPython.
- Excel for Mac lacks
all of the most importantmuch of the functionality of the Windows version.
Enter iPython Notebooks.
Move over Excel, iPython is here.
At Jana, much of our codebase is built on Python, and, by extension, the bulk of our data analysis is conducted using iPython Notebooks and libraries such as Pandas. Coming to Jana I’ll admit I was pretty stuck in my Excel-based ways, however after observing some of the work our data science and product analytics teams were doing, it became pretty clear that converting to iPython was a much smarter and faster way to work.
So I set out with a simple goal and a noble quest: wean myself off of Excel, and shift entirely over to iPython Notebooks (full disclosure: I am not a software engineer, but working with 50+ extremely talented python developers certainly helped).
I am happy to report that so far it’s been a success. However coming from Excel to iPython/Pandas there are a few intricacies I’ve struggled to wrap my head around, and I suspect others like me might struggle with as well.
The worst of which has been Multi or Hierarchical Indexes in Pandas DataFrames.
In the world of Excel, every column has a header, and accessing values in that column is relatively straightforward (your basic spreadsheet tab shares the same structure as a flat database table, essentially). In Pandas however, this gets complicated when you take a DataFrame and create a Pivot Table.
Consider this DataFrame of fake temperatures taken twice daily for zip codes around Boston:
date temp zip 0 2015-10-10 40.10 02138 1 2015-10-10 43.20 02138 2 2015-10-11 44.50 02138 3 2015-10-11 42.70 02138 4 2015-10-12 48.20 02138 5 2015-10-12 47.30 02138 6 2015-10-10 51.10 02110 7 2015-10-10 52.30 02110 8 2015-10-11 49.40 02110 9 2015-10-11 45.90 02110 10 2015-10-12 43.70 02110 11 2015-10-12 44.60 02110
Suppose we want to get the average daily temperature for each zip code. Using Pandas we’d create a simple pivot table (Excel users, pay attention to how easy this is):
>>> pivot = pandas.pivot_table(daf,index='date',columns='zip',aggfunc=numpy.mean) >>> print pivot temp zip 02110 02138 date 2015-10-10 51.70 41.65 2015-10-11 47.65 43.60 2015-10-12 44.15 47.75
However we are now stuck with a somewhat “clunky” Mutli-Index. To an experienced engineer or data scientist this may seem trivial, but as an Excel user I find these indices difficult to work with, and I’d always end up “collapsing” this new Multi-Index by re-indexing the DataFrame and then manually assigning column names (which is time intensive and decidedly not repeatable).
To get around this problem, I created a simple function that essentially “collapses” the hierarchical index into a nice, clean DataFrame where each column corresponds to a single column name:
>>>def collapse_headers (df): df.columns = ['_'.join(col).strip() for col in df.columns.values] df.reset_index() return df.reset_index()
collapse_headers gives you the following:
date temp_02110 temp_02138 0 2015-10-10 51.70 41.65 1 2015-10-11 47.65 43.60 2 2015-10-12 44.15 47.75
Presto! You are now back to the “flat” structure you are used to dealing with and ready for the next step in your analysis.