python

Using Generic “Counters” to Measure It

One of our core values here at Jana is “Measure It”. This has been a really great value to have—because it encourages experimentation, and lets us know when we’re killing it! We have enough traffic and enough ideas bouncing around the office that we needed to implement a good, simple, and extensible method to measure things on the fly.

We use Apache Kafka to pass around our analytics data and HDFS (the Apache Hadoop File System) to capture and store it. We have dozens of Hadoop jobs that aggregate the data into various tables and create reports. This is great for specialized data—data we know we want and have decided to put some thought and time into. But the process is a little overkill for a quick “hey I wonder if” type of question. For that, we have the counter table. This table takes a few standard keys (a tracking id, the number of times the event occurred, etc) but it also has 6 flexible keys—these are anything-goes string fields.

As an example: part of my job is ensuring that we’re giving our members money quickly and reliably. We’ve noticed lately that sometimes transactions are failing because we’re trying to top-up members up on the wrong operator. We have a sneaking suspicion that this is happening because members are changing their operator and trying with ones they know are wrong.

To test the hypothesis I put in counters that “fire” (get recorded) at the end of each transaction:

tracking_client.count(
    tracking_id=member.tracking_id,
    key1=’airtime_topup_operator_mismatch’,
    key2=’member_changed_operator’ if matched else ‘member_didnt_change_operator’
    key3=status  # (‘succeeded’ or ‘failed’),
    key4=operator_name,
    key5=predicted_operator_name,
    count=1
)

Then, after waiting a day or two to collect some data, I can run the following sql query on the counters table, no custom hadoop jobs required!

SELECT key2 as 'changed', key3 as 'status', sum(count) from counter
 WHERE key1='operator_mismatch_counter'
 GROUP BY key2, key3

and with a little math, I get:

status count percent
member_changed_op failed 788 2.87
member_changed_op recalled 1830 6.66
member_changed_op success 24876 90.48
member_changed_op total 27494
member_didnt_change_op failed 14436 2.30
member_didnt_change_op recalled 22465 3.58
member_didnt_change_op success 591045 94.12
member_didnt_change_op total 627946

Showing that members did succeed more when using the pre-predicted operator. Additionally, the percent of recalled transactions is almost double for members that changed their op. This supports our hypothesis that members attempt to topup with multiple operators, recalling transactions that don’t succeed quickly, including ones that they know our incorrect.

We’re in the process of migrating from MySQL to Amazon RedShift, but the idea will remain the same.

If any of these things sound interesting, we’re hiring 🙂

Discussion

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Pingbacks & Trackbacks

  1. Real-Time Event Counting in Cassandra | jana technology blog