Uncategorized

Monitoring Mobile Network Performance

Airport-52

I have a confession to make. I secretly love flying. It’s the build up before you go on a big trip – going through security,  waiting in front of the gate watching other planes take off, hearing the captain say “flight attendants, please prepare for takeoff” as you click the buckle across your waist. The need for speed building as the engines rev up and the plane barrels down the runway.

And as a level-minded traveler, I’m hoping to avoid all delays, traffic jams, layovers, lost baggages, and smelly passengers – much like requests going from a client to a server (minus the excitement). Now, let’s think about a flight from India to the US. It takes an average of 20 hours of in-flight time. Now let’s say a flight from Boston to New York. Only 1 hour. Hopefully, we don’t have to wait that long for network requests to work. The in-flight time you can think of as latency – that is how long it takes for a packet to transit from one network to the other (source -> destination). This depends on how far it has to travel, how many different times the requests get rerouted, etc. Chances are if you’re flying to India, you’ll need to make a stop somewhere along the way. Similarly, a packet may have to jump between multiple networks in order to get to its final destination. The round trip time (RTT) refers to the the time it takes to send a packet until the time it is acknowledged (client sends request to server. Server acknowledges client)

Now let’s say that we have 2 planes: one that can hold 10 passengers (plane A) and the other than can hold 100 (plane B). At customs, it takes plane A 10 minutes for all passengers to get through the checkpoint and 2 hours for plane B. Doing some simple math we get 1 passenger/minute for plane A and 5/6 passengers/minute (100/120) for plane B. The rate of which customs can checkin passengers is analogous to the bandwidth of a network (measured in bits (b) or bytes (B)/second). In reality, the bandwidth of a network is a theoretical limit whereas throughput is the actual rate which bits/bytes are transferred (lower than bandwidth). If you think of a channel as a pipe, bandwidth is the maximum rate at which water can travel through the pipe section but throughput is the actual rate of water going through it. When measuring bits/sec, it is important to note that you are actually getting the throughput of a network and that can fluctuate quite drastically. These terms are typically interchanged but we’ll be specifically talking about throughput here.

How to use this knowledge to measure a request:

Ok so now that we’ve gone through some of the terms, let’s talk about what happens when a request is actually sent on mCent. What happens when this request has to establish a connection with our server, process the request, and return the response? How long does it take and where are the bottlenecks?

On a high level, when an mCent user makes a request, it first goes to a baseband processor on the member’s phone to then relay the request to a cell tower or cell site. It’s here that latency can widely vary based on the number of devices the cell site is servicing. Skipping a couple steps, it’s sent to a core network owned by a carrier (such as Airtel India, Vodafone, AT&T, Verizon, etc) which sits between the carrier’s private network and the public internet. Once it’s received by the mCent cluster, we process the request and send the response back to the user.Most of this we don’t have too much control over but we are able to specify network protocols which can reduce the round trip time.

Now let’s look at some strategies to measure requests from our app. We want to be able to slice these up a couple ways when analyzing bottlenecks but most commonly we look at timing percentiles by endpoints, status codes (200 successes, 400 failures, etc), request/response size, and network type (LTE, WIFI, HSPA, etc, for a deeper description of what these are you can check them out here) so we’ll want to be sure to keep track of these things as well:

Time to first byte (TTFB) – We can measure this from the user’s device by timing how long it takes from when the request was sent to when it is received (preprocessed) from the server. This includes any latency and throughput limitations and in my opinion, is the most important metric in evaluating network performance. We can measure every part of a request but this gives the most accurate portrayal of performance from the user’s perspective.

Throughput – For our app, we used Facebook’s Network Connection Class and started sampling when the request was sent and stopped sampling when the response was received. We took kbit/sec provided by that class and also recorded the hour of the day when this was measured so we could get an idea of the fluctuation throughout the day. Here’s a just a brief preview of how this could work (in Java)

Server processing time – If we knew how long our endpoints took to run then we could make adjustments accordingly. There are a lot of tools out there to monitor this but if you’re building it out, then it’s not too hard to setup. We use a Flask application so we take the timings before and after the request is processed. This helps us know know if there’s some logic that we added in our API that’s slowing everything down.

There are also many performance monitoring services out there like New Relic or Runscope that are worth checking out for more out of the box solutions.

Just by measuring these things, we can get a better sense of how to optimize our requests and make better decisions about our architecture. For example, we found that requests from India have a very slow bandwidth speed (4KB/sec) and we were returning rather large responses (30KB/sec) on certain endpoints. That would take 7.5s of transfer time! We looked into responses and realized that a lot of this was unnecessary and redundant information being returned.

Also, we can monitor country level trends and see when it would make the most sense to move toward multiple data centers so that we would not have to transmit the data as far to our users. When we better understand mobile network conditions in countries halfway around the world, we can make informed adjustments to improve the user experience. Even with poor network conditions, it’s possible to make our app fly.

Screen Shot 2015-12-31 at 2.56.53 PM

Throughput in India by network type. Measured by mCent (Kbits/sec)

Love fine tuning app performance? Join us!

 

 

Discussion

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s