A/B testing is standard practice for companies of all shapes and sizes. While there is a certain time in a product’s maturity to be testing 41 shades of blue, making decisions based on (properly analyzed) data from real users generally wins out over intuition. At Jana, we run dozens of experiments each month and most of them go smoothly. We use a homegrown experimentation framework, but there are plenty of off-the-shelf solutions to choose from. Along with these technical solutions come countless articles and tutorials on best practices of experiment data analysis and proper statistical techniques. While there is a lot to consider in the analysis step, danger can lurk in other places too. Before p-values and multiple comparison traps come into play, you need to making sure your samples include the right users to compare.
Let’s say you’ve just designed and implemented a new feature for your app and you’re ready to roll it out to users to see how it impacts engagement. You want to run a randomized controlled trial that puts some fraction of your users in a control variant (A) and another fraction in a treatment variant (B) that can use the new thing you’ve built. You already have a set of metrics to track for users in each group so you can pick a winner. All that’s left is to drop a few lines of code into your app that will decide which group a user is in…but where do you put them? Your choice can have a much bigger impact on your experiment than you may think.
Here are a few tips from our experience at Jana:
Save variant assignments
This one seems obvious, but occasionally gets missed. Make sure you save a record of which experiment variant each user was assigned to. It’s possible to write an entire experiment framework that correctly decides what user experience each person should see, but never logs any of its decisions, making it almost impossible to do any sort of analysis. All commercial experiment frameworks should offer this as a feature and it’s easily baked into your own by adding a logger and / or database write to the function that does the group assignment. At Jana, we’ve even added an alert to our experiments that notifies us if we are running an experiment, but receiving no logs with users being added.
It’s tempting to pre-compute experiment variant assignments and store them for use later. The problem is that you often end up including users who never actually show up for the experiment. Scanning your user database will naturally cause you to assign lapsed users to your experiment along with active ones and you’ll need to do a lot of extra analytics later on to filter them out. Pre-computing also makes it hard to figure out how many users are actually going to end up in your final analysis. Ideally, you have enough users in each group to draw statistically robust conclusions without subjecting everyone to constant UX changes or running costly experiments at scale. Instead, assign variants (and log your assignments) at the time it gets used. That way, you know exactly which individuals were active and participating in the experiment.
Give users a chance to change
Think carefully about the most appropriate time to assign and log a user to a variant in an experiment. Let’s say you are running many concurrent experiments and as soon as a user opens your app, you run a batch job, making assignments (and logs) for each. This is nice and easy from an engineering perspective, but it presents problems for an analysis. Consider a new feature that users get to by opening a menu and clicking on a tab. Only 20% of users every click on this menu. That means 80% of the users in your treatment group are never even going to know this new thing exists and we can hardly expect them to change their behavior. It’s better to only assign and log a user as in an experience when they have been exposed to a UX change they can act on. This way you would only put a user in an experiment after they clicked on the menu. Make sure the entry point for both the treatment and the control are in the same place to reduce the chance of biases like engaged users being more likely to click the menu in the first place.
Sometimes you need to know which variant of an experiment someone is in before they get to a point where they can take action on that. For example, you might need to start loading a UI or pulling down extra data for a new feature before the user has requested it. In these cases, an engineer might “peek” at which variant a user would end up in if they happen to interact with some part of an app that is part of an experiment. A peek would allow resources to be gathered without causing an assignment to be logged (see item 3 above). Peeking is fine from an experiment design standpoint as long as it doesn’t alter the UX in some way like slowing the app down as a background process completes. The biggest risk, though, is forgetting to log when a user actually has a chance to interact with the experiment feature. You don’t want to log at the time of the peek because the user has no chance to act, but you need to remember your job is not done. A few times we have run experiments at Jana where we “peek” at a user’s assignment, then use the results of the peak to determine if a user was or was not in the experiment later on. It comes up frequently when testing things early on in the lifecycle of users like a splash screen or registration page. There just isn’t enough time to make a network call up to a server to log an experiment assignment before the user needs to see something on the screen. Be extra careful to keep track of the assignments and log them later.
If you’re interested in helping us run great experiments, we’re hiring!