Using Statistical Models in Production


Happy New Year!


Here at Jana, we have a clear division between the data we use for analytics and the data we use in our interactive systems in production. As a data engineer, I work mainly with analytics data, and a lot of that data is not readily available in production. This data separation isn’t an issue if you are making some tools to analyze campaign performance or to track revenue or active users, but this is an issue if you make something using analytics data that you decide you want to use in production.

The first time I made a model using analytics data, I was encouraged not to be concerned with the data availability in production, but rather to just make the best model I could make using any data we had in analytics. When it came time to test the model in production, I had to reimplement it using production-accessible sources of data. The biggest challenge with this was that the data wasn’t conveniently stored in a database we could easily query (like it is in analytics), but in individual “member documents”.

Luckily, the model I built was pretty simple – just a set of coefficients to specify a logistic regression. I fit the coefficients using analytics data, so all I needed to do in production was calculate the values of the predictors and evaluate the logistic function at those values. Since I only needed data from one member at a time in production, I wrote some code to scrape the member doc to get the data I needed.

Here is a basic outline of my workflow for testing a model in production:

  1. Gather training data from SQL tables
  2. Use python to create and validate a statistical model
  3. Implement the model in production

In all, the process of reimplementing the model for use in production took me two weeks. It was a valuable experience for a number of reasons: I got to dig into the codebase the rest of my team works with regularly, I compiled my own local version of mCent for the first time, and I enjoyed making a model from scratch using analytics data and following it all the way into production.

That being said, it would be really handy if we had a way to test analytics models on real mCent users before we commit the engineering effort to implement them in production. Thanks to Greg, now we do! He built a tool for custom algorithm targeting that lets us do just that. This tool is very lightweight and requires no additional engineering effort to test a statistical model in production. To use the tool, we simply compute clusters for sets of member IDs and then map a specific campaign to a cluster. For example, I just made a model that groups mCent members into clusters 1, 2, and 3 that correspond to low, medium, and high levels of activity in shopping apps. To test and see if there are differences in the way these people interact with a new shopping app, I created three different campaigns and made one visible to each cluster using the custom algorithm targeting tool.

Here is the new workflow for testing models in production:

  1. Gather training data from Snowflake tables
  2. Use python to create and validate a statistical model
  3. Upload member IDs and their cluster numbers (output of the model)
  4. Test the model on real mCent members
  5. Repeat steps 1-4 until you have a model that is ready to be implemented in production
  6. Implement the model in production

This way, we can test our models quickly, allowing us to iterate faster.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s