One of the awesome things about becoming a data scientist in industry and especially in smaller companies or startups, is that in a lot of cases you can build a predictive model and see useful results, fast. And sometimes you start seeing something that you feel like could be high impact around the corner.
All those infographics, always some Venn diagram, about “what is a data scientist?”, emphasize the combination of statistics/ML, engineering and business/narrative building skills that comprise sort of the wishful thinking list that bad recruiters then mistakenly put on “data analyst” job reqs. The first and second categories, statistics/ML & engineering — great, no debating those.
The third, business/narrative building, is an interesting one, and it is important, but it can lead to a lot of problems, as I have found out myself. I do believe “business” and “narrative building” fall in the same bucket a lot for the following reason. To get people to buy into a data science idea, particularly one that might change the business and organization in a significant way, there is (and should be) a large burden of proof on the data scientist. That burden is not just an empirical one if it’s a pretty big idea, but it’s one where the you have to put a good story behind what you’re trying to sell. That story pretty much has to resonate with everyone. It’s not an easy skill to learn of course because the narrative you need to build varies completely by situation and it’s more of an art than a science. For once I’ll brag: I’ve gotten half decent at this. I have found that a good narrative can really accelerate the adoption of and excitement surrounding some data science idea or model.
But the problem is, where you really think you will and are eager to see the results a model or insight can drive, there is a tendency to construct a narrative around something empirical too fast. Worse, there is an inclination to build a narrative around a finding that might even contradict your original hypothesis.
I annoy myself every time I do this, and yet I’ve found I have short memory: for a data scientist, I like to talk, which gets me in trouble. Here’s a realistic example. (See how I’m making a narrative even around how I make narratives? Pretty disgusting.)
- 2:15 pm: I find something strange and ‘highly significant’ when building a new logistic regression based on a half-baked idea I had when riding the subway. It’s not what I expected to find — BUT GOLLY LOOK AT THAT P-VALUE.
- 2:22 pm: Feeling like I’ve done some great work, I victoriously smash my hand twice on my desk, rattling a co-worker’s keyboard, and stand up to get a coffee from the kitchen.
- 2:26 pm: What the hell does this variable being significant mean exactly?
- 2:30 pm: I throw a few quick questions up on Slack, asking for a little more background on this one variable. Brilliantly, I predicate my question with — “yo guys, pretty sure i found somthin DOPE, but im wondering ” (not bothering of course to clean up my typos because I’m now in that kinda mood).
- 2:37 pm: Nobody seems to have noticed in those channels yet (the nerve…). I send a couple direct messages. Somebody points me to an answer.
- 2:45 pm: I’m pretty sure I’ve triangulated what that answer was with why the variable in the model is so significant. I slam a pen down and go get another coffee.
- 2:58 pm: I finish distilling this finding down to a great sounding narrative in my head, nodding to myself approvingly while Return of the Mack plays just loud enough to annoy my colleague, who is doing actual, real work.
- 3:08 pm: After a little code cleanup, I say to myself, “okay, but evankjanacom, let’s sit with this for a bit, look at your code more tomorrow, make sure there’s nothing funny in there. Ah but this is cool so I’ll just get a coffee I need a quick break anyway”
- 3:15 pm: A colleague asks me what’s new, how’s the modeling going? I blatantly disregard the 7 minutes ago self that told current me to wait before sharing these results. I spill the narrative.
- 3:18 pm: Now three other people know because they got semi-interested (or maybe because I have my feet up on the table like an idiot?).
- 3:20 pm: A couple other people are now saying, oh right that makes sense, b/c xyz.
- 3:30 pm, back at desk: I check in an ipython notebook. I move on to something else, forgetting this slip of principle that will come back to bite me.
- 4:18 pm, still at desk: get an direct message saying, hey, let’s highlight this finding! Reinforcing the jackass thing I just did earlier.
Two weeks later, Friday
- 4:45 pm: Going back to put this variable into a predictive model in production. Hold on a minute, why is this unit test failing there?
- 5:40pm: From the other side of the room, my girlfriend shakes her head disapprovingly but with a resigned and calm indifference created by years of me doing this kind of thing, I realize that “insight” I came up with two weeks ago was because in my ipython notebook, in cell 217 there was:
df[‘X1’][df[‘X1’] >= 0] = 1
where should have been
df[‘X1’][df[‘X1’] >= 0] = 1
- 6:17pm: I realize that awesome narrative, which has pretty much now been accepted as truth because I’ve said it about 80 times in the office, is totally wrong. The insight should have actually been the exact opposite.
- 6:31pm: I send out a pitiful message warning team not to actually use the model yet, because it’s exactly 180 degrees wrong. Our plan to run an experiment starting Monday has been foiled.
I’m not sure how many data scientists have found themselves in this situation. My guess is a decent number, although I do know many are, wisely, more cautious than I have been. What I do know is that there can be so much quick, nonlinear experimentation when you make these models — drop this predictor, add this one, log that one, try defining your label differently, look at this diagnostic now — that it’s incredibly easy to make a tiny mistake like ordering a couple lines of code wrong. But those little mistakes can cause all sorts of problems.
What I apparently haven’t learned yet — but I will now! yea, yea, that’s what I say every time — is (a) to keep my mouth shut early on (2) to do a better job organizing, testing and auditing predictive modeling code. Most importantly though, (d) avoid the temptation to start building a narrative behind something too readily. Especially something that contradicts your prior hypothesis.