At Jana, we’ve been seeing an increasing number of emails being sent out daily as the number of members rapidly increase. We send different types of emails to members based on their activity: daily emails to recently active members, re-engagement emails to members who haven’t been active, activation emails to new members and so on.
As the number of daily emails sent reached north of 50,000, we decided to build a monitoring system that would automatically check if the emails were actually being sent out as we expected.
Email processing jobs
Our email processing jobs are broken up by country and by the type of email. Every job is also broken up into small celery tasks, and our in-house built counters are fired every time an email is sent to keep track of the total number of sends. At the end of a job, a reporting task is called to aggregate the numbers from each celery task, and checks the health of the total number of successfully processed emails based on past history. If the number of emails sent seems reasonable, we report the status of the email job as “healthy”, and if not, we report the status as “unhealthy” and send out an internal alert.
Our health status monitoring of the jobs are based on past history. We take the past 14 non-zero data points and compute the mean and standard deviation. If the number of the reporting task falls outside of the mean +/- two standard deviations and the mean +/- 20%, we send an alert. We exclude days that have zeroes so that the mean of our “excepted behavior” isn’t affected. The two checks, one based on standard deviation and the other on percent change, are also in place so that we only send out alerts when the abnormal behavior is “very” abnormal. We discovered from trial and error that looking at percent change alone was susceptible to noise when looking at a dataset with a spiky nature, and using standard deviation as our only check wasn’t as effective when we saw an occasional but expected swing in numbers in a relatively stable dataset.
Here is a snapshot from a graph that visualizes the number of emails sent to active members in India. The green points represent days when the number of sends was “healthy”, the red “unhealthy”, and the blue “zero”. The yellow region signifies the mean +/- two standard deviations of the past 14 data points from a given day and the light green region depicts the mean +/- 20%.
We can see that the standard deviation shrinks (8/21~) when the trend of the past 14 data points stabilizes, and that internal alerts are sent out when we see abnormal behavior (9//4-9/7, 9/13). These internal alerts were actually sent out when we introduced a bug that sent multiple emails to a member when we were only supposed to send one a day, and when we failed to completely process an email job.
Brazil is a country where we’ve been seeing rapid growth over the past month. We can see that the initial increase in the numbers outgrows the health checks that we’ve implemented, but the monitoring system self-adapts and sends out fewer alerts as time passes.
Things we’ve learned
We’ve noticed that the monitoring system works very well in countries that have high numbers of sends and successfully sends out internal alerts when there is abnormal behavior. On the other hand, we’ve observed a lot of noise in countries that have low numbers and where having a day with zero is part of the expected behavior. Even with the noise, we believe that this kind of self-adapting monitoring system is much more effective than setting hard limits and sending alerts when the number drops below X, or is below Y% of the previous day.
This monitoring system also isn’t restricted to monitoring the health of email sends and can be applied to pretty much any data set that has some sort of expected behavior.