Experimental Design with A/B Tests
Tech culture emphasizes the ability to quickly test and iterate.
Rather than trying to prove out perfectly whether something works in theory, companies prefer to take a small group of users or customers and put them through the new experience to see whether it’s actually better. That’s why experimental design & analysis is so important in BizOps interviews -- it proves to the interviewer that you have the ability to correctly set up and understand outcomes from these experiments.
While knowing experimental design & analysis is more important for product bizops teams than go-to-market bizops, understanding the thought process is important in both.
Over the course of this lesson, we will review how to set up a successful experiment.
What is A/B Testing?
A/B testing is a simple process tech companies use to determine whether a change to the business or product positively affects business outcomes. Typically, the team will create different variants of experiences and assign random groups of users to go through each of them. Then, the team will compare the data coming out of each variant to see which yields the best outcome in terms of user or business metrics.
Let’s take the example of an email campaign.
Facebook might have an email campaign trying to bring users onto the site by highlighting the activities within the users’ network (eg. posts, messages, etc..).
The team might have three variants of this email:
- Variant 1: The email directly states what the activity is, something like “Your friend Amy posted this message on her wall.”
- Variant 2: The email highlights the people who have been active, something like “your friends Amy, Michael, and Joe all recently posted.”
- Variant 3: No email campaign.
Variants 1 & 2 each test a hypothesis around what brings users onto the site, either the activity or the people respectively. Variant 3 is called a control or the original experience because nothing is being tested. The team would then look at user engagement to see which variant was more successful in bringing people to Facebook and decide that’s the optimal user experience.
The A/B Testing Process
It’s important to demonstrate in interviews for bizops roles that work with product teams that you have some basic understanding of how companies run A/B tests or experiments. There are many nuances when it comes to setting up A/B tests. While this list is by no means exhaustive, it helps illustrate some of the key things to think about.
- Before running any A/B test, you need to set the three types of metrics that we discussed in the last lesson: true north, guidepost, and counter metrics. Setting metrics is an art and a science because it requires balancing speed, confidence, and feasibility.
- For an A/B test to be valid, the two groups of users need to be randomly selected, while controlling for characteristics that we would expect to affect behavior such as age, demographics, geographical location, etc.
- For companies with global user bases, you would typically limit the A/B test to one geographical location and language to simplify the testing process.
Typically, you gradually increase the size of the user base that experiences the new experience. The size of this group is referred to as the ramp percentage.
There are several steps in this process.
Step 1: Test a small group
First, run the test with a small group of users to make sure the new product is providing the expected experience. Typically you want to test the experience with 1% of users to ensure the experience is not broken.
Step 2: Check for directional indicators
Then, increase the ramp percentage to 5% or 10% to check for directional indication of your hypothesis. For example, if your hypothesis is that people are interested to see emails about their friends’ activities on Facebook, you would want to check that users receiving these emails are opening and clicking on those emails.
Step 3: Increase the ramp percentage
Finally, you ramp the experiment to 50% original experience and 50% new experience. This ramp percentage will allow you to most quickly build statistical confidence in understanding whether the new experience is better than the old one or not. We will talk more about interpreting A/B test results next.
If you want to read through a step-by-step guide for how an email campaign A/B test is done, check out Hubspot’s guide here.
In addition to all that we’ve covered in the metrics module, we want to highlight a couple of practical considerations regarding the application of setting metrics for A/B tests.
First, consider what’s available to track. What metrics is the business tracking today and how can you most efficiently leverage that existing infrastructure?
Then, consider the duration required to see impact. If the goal is to test whether an email drives customer retention in 3 months, then the test will take at least 3 months to show impact. An early indicator like email click-through rate or one-week retention could be good enough to indicate if the strategy is effective. This is where leading metrics become much more appealing than lagging metrics.
Finally, think about any unintended consequences that the test you’re running could have on the success metrics you’ve set. For example, email campaigns can result in people unsubscribing from your email, so the increase in engagement also comes with a decrease in the ability to reach your audience. There can also be network effects from your test, so the framing has to extend beyond the immediate test to the entire business and product ecosystem.
Interpreting Results
The goal of an A/B test is to understand whether one experience is better than another at a statistically significant level. Typically, the company will have an agreed-upon threshold in terms of how certain we can be mathematically that one experience is better than another, at 95% (this is called the p-value). When we are confident that one experience is at least 95% confident that one experience is better than the other, the experiment would be considered statistically significant. If these words sound familiar to you from a high school or college stats class, you should be fine. If you’re curious for a detailed explanation of the statistics involved, see this article.
In the context of a BizOps interview, you won’t need to know all the details of the calculations, but it will be helpful for you to understand the following practical concerns:
- The more data you have around an experiment, the better you’ll be able to detect smaller differences in performance. That’s why holding an experiment for a longer period is helpful.
- When splitting users into two experiences, you can achieve statistical significance the fastest at 50% each. So if you’re limited on time for data collection, ramp to 50% right away once you’re certain the experience is not broken.
- When the result of an experiment is statistically significant, it can still be wrong. However, companies make the decision to require a level of certainty and tolerate some uncertainty for the sake of being able to make decisions quickly. That’s why rather than just blindly considering the statistical significance of the outcome metric, teams often want to confirm that any funnel metrics are also moving in the right direction.
Recap
Tech prizes quick iteration, so running experiments will likely be part of your job in bizops. You may be asked to propose a hypothetical experiment during a case question, so it's helpful to review how to set up and interpret results from one of tech's favorite tools - A/B tests.
To show you know how to design and analyze experiments:
- Set intelligent metrics first. Be sure to spend time thinking through guardrail metrics to monitor any potential negative side effects of your test.
- Set expectations for an acceptable level of uncertainty is acceptable before diving into the details of the test. This is a great question to get into with your interviewer.
- Select user groups carefully - they should be randomly selected, and confounding variables should be controlled for (as much as is possible, anyway.)
- Start with small groups and ramp up accordingly. Consider how long you should expect to run a test to see results.
- Don't forget to articulate how your results may be wrong!
