Alerts are great! They let me relax and keep me confident that my code is working and the business isn’t going off the rails. Writing alerts is not so great, even when you have a good clean alerting system in place, and a solid idea of what you want to alert on. Tuning alerts in particular is tough: how bad is bad enough that a human can / should intervene? It’s easy to think that when choosing alerting thresholds, more sensitive is better. Don’t want to miss those critical alerts! But, in a lot of ways, it can be worse. Alerts that are noisy distract engineering and product focus by forcing people to look into alerts that aren’t actionable. Much worse than that however is that over time you become so desensitized to the noise that you tune it out and miss an alert telling you that something critical has actually occurred.
This happened to me a few weeks ago. At Jana we have account managers that interface with the brands who sponsor the free internet we provide. In return we run what we call campaigns for our sponsors. I recently switched to a new team, and as one of my ramp-up tickets, I was asked to alert the account managers when a campaign is coming to a close. As is common when coding in a project you don’t know, you make some assumptions. And we all know how the old saying goes…
The mistake I made had to do with a pretty specific Jana-ism about what it means for a campaign to end. Which resulted in this comment from my account manager:
(it wasn’t every campaign ever, but it was definitely a lot of them).
So! How do we avoid being sad donkeys / giving our account managers / clients / fellow employees heart attacks when alerts go off?
- Make your alerts configurable and have those constants clearly defined at the top of the file. Thresholds are often chosen arbitrarily to start. We make guesses based on analysis we’ve done before, or assumptions we make, but it’s hard to know what you’ll find when you get your hands on prod data. So make them easy to tune!
- Do a dry run (or 10). We have a cool counters system in our analytics pipeline. It’s super easy to fire a counter for an arbitrary event, and then take a look in our sql tables when you have a few days of data. You could also write to a log, or send an email just to yourself. That way you don’t inconvenience anyone while you get the noise down.
- Make the alert tuning a clear part of your original ticket, or write the ticket to tune the alert at the same time you write the ticket to make the alert (adapted for whatever task management system you / your company has). Sometimes it’s not one, “I’ve made a giant mistake” kind of moments, but rather the alert’s a little too sensitive, or you decide to take the average of a few days worth of data instead of one. Having the ticket already written keeps me honest about following up on the alert.
- Keep tools around to help you clean up your mistakes. We all make mistakes, but it’s way better to tell a ticked off user that you’ll get the alerts out of their newsfeed in 10 minutes flat than need to build a tool right then.
I hope some of these tips can help you create actionable alerts in a timely manner. Good luck and let me know if you have any other tips in the comments!