One challenge anyone trying to accomplish something faces is the question: Am I accomplishing this effectively? Gaining an understanding of this question can go a long way to hone your efforts and allow you to focus on the right things. We take that question very seriously here at Jana- one of our core values is “Measure it”. One of our most important tools in keeping with that core value is our analytics pipeline. It’s been through several iterations, growing and evolving alongside the company, and processes millions of events a day so that it can be useful for a wide variety of things, ranging from supporting sales pitches to revealing bugs in our code to providing data for key metrics that we want to track. In this blog post, I want to zoom in on one event- we will talk about the journey of a single data usage report originating from a user’s phone and ending up as a data point in our Snowflake database- Snowflake is a cloud data warehouse, and the winner out of many replacements we tested for MySQL. It will be a survey view of the pipeline, wide but not deep- I’ll include links for more reading through out, such as: http://snowflake.net/product/architecture/ if you would like to learn more about snowflake.
Step 1: From the Android App to the Backend Server
Our datum’s voyage begins in a background process on our android app, mCent, that periodically sends over information collected by a different background process about how much data the phone is using on the apps the user has on their phone, if that data is cellular or over wifi, and other interesting attributes. These attributes are serialized into a string, which we send over to an endpoint on our engineering server.
Step 2: From the Backend Server to Kafka
On the second leg of the trip, in a method called from the endpoint previously hit, the data is pulled out of the string. Some transformations are made like making the epoch day timestamp in seconds instead of milliseconds, and the data is collected into a map of android package id -> dictionary holding attributes for that package id. For each package id in the map, we create and send a message to the data usage Kafka topic. Kafka is a piece of software that facilitates communication between different systems. It is a message broker that is built for real time data feeds. You can read more about it here: http://kafka.apache.org/.
Step 3: From Kafka to HDFS through Camus
As part of a continually running loop on our analytics machine, we periodically pull down data from our various Kafka topics using Camus. Not the French absurdist author, but LinkedIn’s implementation of a Kafka to HDFS service, which you can read more about here: https://github.com/linkedin/camus. HDFS is, to quote wikipedia, “a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster”. You can read more about it here: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html. We use camus to pull down the raw data to keep in our HDFS as Avro files (an Apache data serialization system, https://avro.apache.org/). In our example case, this would be data usage data such as timestamp of data collection, wifi data used, etc. At this point, if you convert the Avro to JSON it is already looking pretty similar to how it will look as a row in our snowflake DB.
Step 4: From the HDFS to S3
Shortly after storing the data in HDFS, we upload that data to S3, which we keep in buckets namespaced by their Kafka topics and Camus pull times. S3 is amazon’s online data storage service- Amazon Web Service’s outsource’s a lot of maintanence and infrastructure pain and provide a well documented api and web tool to interact with. You can read more about S3 here: https://aws.amazon.com/s3/
Step 5: From S3 to EMR to S3
After copying the data from the HDFS to S3, we need to process it so that we can choose to keep only the latest data usage reports per user and app, as they are the ones most up to date. We do so with scalding jobs, run through AWS EMR. Scalding is a scala library developed by Twitter that simplifies making map reduce jobs. What was once a several hundred line java file in our old Hadoop system now is an under 50 line scalding job that is much easier to grok. Additionally, all these jobs are unit tested with scalding unit tests which allow you to specify input data and expected output data. You can read more about that here: https://github.com/twitter/scalding. EMR, or Elastic MapReduce is another AWS service that simplifies management and maintance of MapReduce machines. It can be configured to spin up the right size cluster just when you need it, spin it down when you don’t, and do various other nice infrastructure work. You can read more about that here: https://aws.amazon.com/elasticmapreduce/. In our example, the scalding job would look at all data usage reports since the last camus pull, group them by user identifier, package id, and day timestamp of report, then choose the latest raw timestamped report from each group. This ensures the data we keep is the most up to date data we have. It then outputs this data in tab separated text files, which are again stored on S3.
Step 6: From S3 to Snowflake
Finally, on the last leg of the datum’s journey to its warehouse, we take it from S3 and load it into snowflake. This is done by reading the data from s3, then running SQL update/merge commands through JayDeBeApi, which is a module that allows python codebases to talk to databases using JDBC (more here: https://github.com/baztian/jaydebeapi).
Step 7: Usefulness at last
Now that the data is loaded into snowflake, we can finally start using it- wether we simply query it through Snowflake’s query language, or create dashboards around it, or show it off to users and clients, it is now ready to be useful. It’s been a long road for our datum, moving through many codebases and services, but it has finally reached the end of its journey. To summarize everything in one sentence, we send data from client phones, to our main backend codebase, through Kafka, pulled down to the HDFS with camus, then to S3, then processed with EMR and scalding, then back to S3, then finally, uploaded through SQL commands to Snowflake.