Here at Jana we love to separate our user-facing (both frontend and backend) and analytical logic. We think the two are conceptually different and therefore have different requirements. Just to point out some of the reasons behind this idea:
- User facing features require data be accessible quickly and on demand. You don’t want your users to wait 10 minutes for an aggregation job to finish or for some data to be available. To be able to achieve this you are sometimes forced to store and parse smaller chunks of data (smaller time windows, aggregated data, etc.)
- Analytical data needs to be fine grained over long periods of time (usually over all time). This is to enable ad-hoc analysis for our various teams (Marketing, sales, fraud, the list goes on). At the same time we don’t need the data to be available immediately. Currently most of our data is updated six times a day which given our volume is amazing.
These differences result in different infrastructural designs (database choices, tools, even box setups) and makes the interaction of the two a bit of challenge. Our long living rule has been to not query the analytical data from the backend instances. While this has helped us maintain a clean and separate data logic, there are instances where you need to have user-facing features that require heavy time consuming algorithms. One such case for us is app recommendations based on a member’s social network.
To achieve this we have a recommendation task that runs over a large volume of social data to generate recommendation options for each of our daily active users. This of course depends on our backend code to further prune out bad recommendations (the user has been recommended that app before, etc.). Because of this the analytics side can’t directly make these recommendations. Our solution was to have a shared data store on Amazon AWS S3 where the analytics side writes the recommendation options and a periodic job on the backend instances reads them, prunes them and makes the actual recommendations (saves them in Cassandra to be served later in our app). This has been challenging for us on two fronts.
We need to make sure the backend instance only reads the latest data when it’s completely written to S3. To achieve this we used a very simple solution, a combination of the right file naming convention (using recommendation periods) and state files. State files are very much like OS PID files. They are minimal files that indicate a batch is completely written to S3. This way the code can understand which batch is the latest and whether it is ready to be consumed.
Reading from S3
At first this seems like an easy task to do. our recommendation batches are in the form of multiple files (similar to hadoop output files but generated by our great analytics data store, Snowflake) with each line holding a possible recommendation (sorted by their score). You might ask what can go wrong? everything! connections to S3 can timeout or die for various reasons, your recommendation logic can fail to generate a recommendation because the state of the member has changed drastically, etc. For these reasons you also want your job to be able to resume its state gracefully without re-doing the recommendations.
I wanted to share our final solution to this as a Python code snippet that uses Boto (a great python AWS library) and Python’s generator concept to provide a record by record interface to S3. Please do let us know what you think or whether this was helpful to your project.
Here is a sample of methods provided in the snippet and quick usage example:
Interested in recommendation systems and other cool data engineering stuff? we at Jana do a lot with data (and have a lot of it!) and we’re always hiring!