See that dip in the graph pasted above marked by an orange circle? That’s what happens when you publish a buggy version of your application to the Google Play Store. The graph above is generated by Crashlytics, one of the great tools available for free in Twitter’s Fabric.io. Not only do your users have a bad user experience when they update or download the app, you are inundated with emails from the service, coworkers, users, etc. and the pressure is on to get a fix out as quickly as possible.
Let me caveat this post a bit before I continue. At Jana, we know humans aren’t perfect. We all make mistakes. In the process of engineering, testing, and shipping, a few bugs might slip through the cracks between the unit and integration tests. We would love for our tests’ code coverage to catch them before they appear to our users, and we would like to think they mostly do ;), but some bugs do make it into production as the graph above shows. That is why we focus on Mean Time To Repair (MTTR).
MTTR here at Jana means we identify the problem and issue a fix as quickly as possible to mitigate the spread of the issue. Ideally, most users won’t ever see the problem and it’s business as usual for them. In the good ol’ days of publishing to Google Play, you could upload your application, aka as “an APK,” and within a couple of hours have it propagated to your users as updates or the only version they can download.
At Jana, we are pretty stoked about our continuous integration/deployment process. Once your pull request gets merged into the release branch of the GitHub repository, the process of running unit and integrations tests, publishing to the Play Store (Alpha track), to sending out release notes is all automatic and unattended. If everything is on the up and up through that process, all one needs to do is to log into the Google Play Developer Console and promote the application to Beta or Production. At first, we just outright published the application and sit back and watch Crashlytics for any issues that may arise. If there was an issue, we sorted out the problem, fixed it, and the process repeats itself.
We used to be confident that we could get the fix to our users in a couple of hours. But recently, Google announced a change to the Play Store review process and indicated that developers would not notice a change. Since then we have noticed some of our updates to the Play Store would take longer than usual, up to 3 days at times. We are not sure if this is the norm or just an anomaly during times of change, but we know this hurts our MTTR. After an internal discussion, we settled on trying out “Staged Rollouts” for moving our applications to production in place of a hard push to 100% of the market.
The premise of Staged Rollouts is that we can release the app to a subset of our users, monitor for bugs or crashes, turn on flags or experiments to test out new code paths in the new build before the application has propagated to a significant amount of users, halt the rollout process if an issue crops up (the previous build will only be available to the market), and improve the experience of our users overall. Since we’ve adopted this approach we’ve successfully averted issues that would cause stability issues for our users and we care deeply about the experience of our users when they use our application and tangentially our 99+% crash-free sessions.
By the way, we’re hiring.