Skip to main content

How to Answer A/B Testing Questions

Premium

A/B testing follows the same fundamental principles of hypothesis testing: formulating hypotheses, collecting data, choosing a test statistic, calculating p-values, and making decisions based on statistical significance.

When asked to design an A/B test, follow this 8-step framework below to guide your answer.

  • Step 1: Define the problem. Clarify the problem and goals. Discuss the user journey and specific user segments to define the scope.
  • Step 2: Identify key metrics. Define 1-2 metrics for each of the following: north star, primary success, guardrail, and secondary metrics.
  • Step 3: Select the unit of randomization. Discuss the unit of randomization and triggering criteria that you’ll select to ensure representative samples.
  • Step 4: Formulate hypotheses. State the null and alternative hypothesis.
  • Step 5: Select a statistical test. Consider the nature of the data and the hypothesis to select the appropriate statistical test.
  • Step 6: Conduct power analysis. Discuss the inputs and output of the power analysis. Discuss the ramp-up strategy, if needed.
  • Step 7: Analyze test results. Talk through how and when you would analyze test results.
  • Step 8: Evaluate and make recommendations. Discuss what evidence you’d need to launch the A/B test to 100% of the population.

AB Test Framework

We’ll use this example interview question to walk through the framework:

“Imagine you ran a campaign to attract new customers to use a product. Identify success metrics for the campaign, then design an experiment to determine if the campaign should continue.”

Step 1: Define the problem

Define the problem statement

Clarify the problem and goals to align with the interviewer's problem statement. This is one of the most important steps since the rest of your answer might not be accurate if you miss crucial details here.

Examples of helpful clarifying questions include:

  • Why is this experiment important for the company?
  • What decisions will the experiment drive?
  • What does the company hope to learn from it?

Often the interviewer will turn the question back to you, so we recommend reviewing the company’s mission, key metrics, and products. You can find this information in a company’s earnings reports, such as this report from Amazon.

Assess the product or feature's current state and the proposed change. To increase your understanding of the feature, ask clarifying questions such as,

  • What are the steps in the current feature/product workflow?
  • What is the proposed change in the workflow?
  • In which app versions/platforms will the change be tested?
  • In which countries will the change be tested?
  • Will this change apply to premium users, free users, or both?

Define the problem scope

After defining the problem statement, discuss the user journey and specific user segments for the proposed test to clearly define the scope. While the user segments depend on the specific scenario, common segments include mobile app vs. platform, geographical regions, and user type.

In the campaign metrics example, you could say,

“Let’s start with some clarifying questions:

  • The goal of the campaign is to attract new customers, but what is the bigger goal or KPI this falls under? Is it growth, profitability, or something else?
    • This is important to understand when deciding metrics. Ideally, growth and profitability go hand in hand but sometimes there can be trade-offs. For example, growth can be achieved by acquiring users with financial incentives, but if they don’t convert to long-term profitable users, profitability will suffer.
  • What specific customer action are we interested in when we say “use a product?” Is it sign-ups, buying the product, or some other action?
    • Here, we are told it is sign-ups for a paid subscription product.
  • Are we interested in specific user segments?
    • The interviewer turns this question back to the interviewee. While exact user segments depend on the campaign and company, one segment of interest could be users who already use other products of the company, i.e. existing users vs new users.
  • What is the cost and format of the campaign? Is it an email campaign?
    • Ideally, the cost of the campaign should not exceed the value it brings to the company. We are told the campaign is in the form of a pop-up, which gives users an incentive to sign up (e.g. a free month), on the product's landing page. So, the cost of the campaign is the cost of the incentive.”

With this information, we’ll first clarify the user journey. When a user arrives at the landing page of the product, the campaign is in the form of a pop-up that offers an incentive to sign up. Once a user has signed up, they no longer see the campaign pop-up.

If they haven’t signed up, we could choose to show the campaign each time they arrive at the landing page or impose a frequency cap. We could also test multiple variants if we don’t have a strong point of view. Inputs from product and user research will help guide this decision in a real-world scenario.”

Step 2: Identify key metrics

Metrics should be easily measurable and quantifiable, with clear definitions and methods for data collection. They should also be sensitive enough to detect meaningful differences between the variations being tested.

Define 1-2 metrics in each of the categories below:

North star

A north star metric is a single, high-level metric that encapsulates a product or business's core value or ultimate goal. It can also serve as a tie-breaker when there is a tradeoff in different company goals.

For example, Monthly Active Users (MAU) reflects user engagement and retention, which are critical for a company’s success. If a feature change increases revenue but reduces MAU, the company may not ship the change.

Primary success

Primary success metrics directly measure progress toward achieving the desired outcomes of the A/B test. It is closely tied to the north star metric, but it focuses on a specific aspect of the user journey or product experience.

For example, documents created per user is a suitable primary success metric if an experiment aims to improve the document creation workflow.

Guardrail

Guardrail metrics monitor and mitigate risks associated with achieving the north star or primary success metrics. These metrics act as "guardrails" to ensure that the pursuit of growth or success does not come at the expense of other important factors such as user experience, sustainability, or compliance.

For example, when increasing ad load in an app, user time spent on the app is a guardrail metric to ensure user engagement is not negatively affected.

Secondary

Secondary metrics provide valuable insights into specific aspects of performance or user behavior but are not considered as critical as the north star or primary success metrics. These metrics are often used for deeper analysis, optimization, or understanding of user needs and preferences.

In the campaign metrics example, you could say,

“Let’s identify the key metrics of interest:

Primary success metric: Sign-ups per user. This directly measures the success of the campaign.

North Star metric: Long-term return on investment (ROI). We should assess the financial impact of the campaign by comparing the long-term revenue generated from new customers acquired through the campaign to the costs incurred in running the campaign. A positive ROI indicates that the campaign is profitable, while a negative ROI may indicate that adjustments are needed. A concrete way to measure this is to compare lifetime value (LTV) to the customer acquisition cost (CAC).

Guardrail metric: Conversion rate, i.e. the percentage of sign-ups generated by the campaign that ultimately become paying customers. This metric helps evaluate the effectiveness of the campaign in converting potential customers into actual buyers. Another guardrail metric is retention, i.e. if these customers retain beyond the incentive period. For example, if the incentive is a free month, MAU (Monthly Active Users) is a metric that can be used to measure retention.

Secondary metric: User engagement, i.e. the level of engagement of new customers with the product, such as the frequency of usage, time spent on the platform, or interactions with customer support. Higher levels of engagement indicate that the campaign successfully attracted interested and active users. It might also be useful to measure the bounce rate after a user has clicked on the campaign pop-up to understand friction points in the sign-up process.”

Step 3: Select the unit of randomization

Once you’ve defined key metrics, you should select the unit of randomization and triggering criteria. These prevent sampling bias by ensuring that the sample populations for the control and treatment groups are representative of the target population. It is very important that randomization and triggering is implemented correctly since errors here can invalidate the entire A/B test, leading to incorrect conclusions.

Unit of randomization refers to the individual elements or entities that are randomly assigned to different variations of the experiment (e.g. individual users, visitors to a website, sessions, transactions, or any other discrete entity that interacts with the variations being tested).

For example, in a website A/B test, the unit of randomization could be individual visitors, who are randomly assigned to one of the experiment variations (A or B) when they land on the website.

One of the key assumptions of A/B testing is Stable Unit Treatment Value Assumption (SUTVA). SUTVA assumes the units are independent of each other, meaning that the treatment assigned to one unit should not affect the treatment assigned to another unit.

Violation of SUTVA can lead to biased estimates of treatment effects and undermine the validity of causal inference. SUTVA may be violated in certain situations such as in social networks where users may be influenced by the behavior or preferences of their connections. For example, the adoption of a new feature or product recommendation may spread through social networks, leading to increased engagement or conversion rates among connected users.

In marketplace platforms, changes to pricing, product listings, or search algorithms may impact user behavior and transaction volumes in both the control and treatment groups. For example, changes to search ranking algorithms may affect the visibility and discoverability of products in the control group as well, influencing user purchasing decisions. In such cases, consider randomization by cluster, time, or geographical region.

Ideally, the unit of randomization should be the same as the unit of analysis (e.g., the user level). However, this sometimes does not hold. In an A/B test with user-level randomization and session-level conversion metrics, we will have multiple sessions from a single user in a given group. Since these measurements are not independent, the variance estimate is biased. The delta method or bootstrapping can correct this bias.

Select the triggering criteria

Triggering criteria determine when these randomization units are exposed to the experiment and how the random assignment process occurs.

For example, a user's first visit to a website, a specific action taken by the user (e.g. clicking on a link), or a specific time period (e.g. during a promotional campaign) are different types of triggering criteria.

Triggering criteria should ensure that the random assignment occurs at an appropriate point in the user journey and that units are exposed to the experiment consistently and fairly. For example, when testing a change to a button on a particular page, the triggering criteria should be when a user visits that page and views the button. Assigning users who never visit the page adds noise to the experiment and makes it harder to detect changes in key metrics.

In the campaign metrics example, you could say,

“We’ll randomize by user and assign users to the experiment when they first arrive at the landing page.

Randomizing by session is not ideal in this scenario, since it may be confusing for users if they see the campaign in some sessions and not in others. Assigning users before they arrive on the landing page can also introduce noise. This makes it difficult to detect changes, since in this scenario, users may never see the campaign but still be in the experiment.”

Step 4: Formulate hypotheses

State the null and alternative hypothesis in this format:

“The alternative hypothesis is that by implementing feature X, we expect to see a change of Y% in the primary success metric, with no significant regression in guardrail metrics, and an increase or no significant change in the north star metric.

The null hypothesis is no change in metrics by implementing feature X.”

At this point, you can also state additional business or product-related insights you hope to get from the A/B test.

In the campaign metrics example, you could say,

“The alternative hypothesis is that the campaign will increase the primary success metric by X%, while not hurting guardrail metrics and creating a neutral or slight increase in the north star metric.

We will also monitor secondary metrics to learn about user engagement and friction points in the sign-up process. The null hypothesis is status quo i.e. no change in key metrics. We will discuss the effect size X% later, when we discuss power analysis.”

Step 5: Select a statistical test

Choosing a statistical test engages a more technical discussion than formulating hypotheses. The appropriate statistical test depends on the nature of the data and the hypothesis being tested. The most commonly used statistical tests include:

  • Z-test
  • T-test
  • Chi-square test
  • Analysis of variance (ANOVA)

Multiple statistical tests may be applicable in a given scenario. Practically speaking, A/B testing platforms usually will abstract the statistical test from the user, so the final decision of which test to use will usually be built into the platform. Also, Z-tests and T-tests results are nearly identical in large sample sizes, so it could boil down to how the platform codes the test. It’s important to discuss these considerations in the interview, since this is a crucial step in hypothesis testing. If the A/B test results are being analyzed off-platform, the data scientist will need to decide which test to use.

Refer to the table below for the uses, statistical formulas, and examples of these statistical tests.

Statistical Tests

In the campaign metrics example, you could say,

“Since our primary success metric is binary, i.e. a user can either sign up or not, the Z-test can be used.”

Step 6: Conduct power analysis

Discuss the inputs and outputs of the power analysis. Refer to “How to Answer Power Analysis Questions” to review the four key inputs for power analysis (i.e. effect size, power, significance level, and variance) to determine the sample size required.

In the campaign metrics example, you could say,

“The main inputs for power analysis include:

  • Effect size: typically based on input from the product team and other stakeholders. The effect size should be practically significant, i.e. it has a meaningful impact on the business.
  • Power: the probability of detecting a true effect, which is typically 0.8.
  • Significance level (alpha): the probability of detecting false positives, which is typically 0.05.
  • Variance: typically estimated using historical data.

The output of power analysis is the sample size required to detect the minimum detectable effect (MDE). From the sample size, we can estimate the duration of the experiment needed to get this sample size.

Other factors to consider when determining the duration are seasonality or variability in the metric (e.g. weekday user behavior might be different from weekend) and app adoption, if the campaign feature requires users to update their mobile apps.”

Discuss ramp-up strategy

A ramp-up strategy in A/B testing allows for more controlled and reliable testing results. For sudden or large-scale changes that impact a big sample size, you may need to discuss a ramp-up strategy to mitigate risks. While highly recommended, it's not mandatory to have a ramp-up strategy. For small or low-risk changes, it might be ok to start the test at the required sample size.

In the campaign metrics example, you could say,

“To expose 50% of the population to get the required sample size, a gradual ramp-up strategy could be:

  • Day 1: Launch A/B test to 1% of population
  • Day 2: Monitor key metrics for unexpected data. If everything looks good, ramp up to 10% and continue to monitor.
  • Day 3: Assuming no unexpected data, ramp up to 50%.”

Step 7: Analyze test results

Talk through how and when you would analyze test results. For example, you might emphasize running the test for the required duration and getting the required sample size before analyzing the data. You can also discuss how you’d organize the data into relevant segments, such as by app version, user type, or platform.

There could be follow-up questions about potential pitfalls here, such as:

  • “The product manager looks at the test results before the test end date and wants to make a decision without waiting for the test to finish. Why is this not recommended?”
  • Let's say you have 30 metrics in your A/B test and only one of them is stat sig. How would you proceed?

These questions prompt you to discuss pitfalls like peeking and multiple comparisons, which are defined below.

Peeking: This occurs when you check the results of an ongoing A/B test multiple times before the predefined test duration or the sample size is reached can increase the Type I error rate (i.e. the probability of a false positive), which can lead to incorrect conclusions.

Multiple comparisons: If your test has many variants and multiple metrics, or multiple comparisons, this can increase the likelihood of observing false positives.

Some common methods to address the multiple comparisons issue include:

  • Bonferroni correction: adjusts the significance level (α) for each individual hypothesis test to control the overall familywise error rate, or probability of making at least one Type I error when conducting multiple hypothesis tests simultaneously. It divides the desired significance level (α) by the number of tests being conducted. While effective in controlling the overall Type I error rate, the Bonferroni correction can be conservative, leading to reduced statistical power.
  • False discovery rate (FDR) correction: controls the expected proportion of false positives among all significant results. This method allows for a more flexible balance between controlling Type I errors and maximizing statistical power.

In the campaign metrics example, you could say,

“Assuming the test has run for the required duration, we’d analyze the data by overall metrics and relevant segments (e.g. app version, user type, and platform).

It's pretty unusual for all your metrics to look perfect. In this case, you might notice some regression to monthly retention, even though your primary success metric increased. Doing further analysis by looking at specific user segments could be one way to help inform what the final decision should be.”

Step 8: Evaluate and make recommendations

Evaluate the results and make a recommendation. If the impact on metrics is as expected, i.e. a statistically significant positive change in the primary success metric, neutral or positive north star metric, and no significant regression to the guardrail metrics, the A/B test can be launched to 100% of the population.

However, there are other factors you should discuss with the interviewer before making this decision:

Risk and uncertainty

Evaluate the potential risks and uncertainties associated with launching the variation. Are there any potential drawbacks or unintended consequences? What is the level of uncertainty surrounding the test results?

Stakeholder feedback

Gather feedback from key stakeholders, including leadership, product managers, marketers, and customer support teams. What are their perspectives on the test results and proposed changes? This is especially important if there are trade-offs. Example: positive impact to the primary success metric but negative impact to guardrails.

Novelty effects

Consider whether the introduction of a new feature or change temporarily influences users' behavior, leading to an initial spike or deviation in metrics before stabilizing or reverting to baseline levels over time. To account for novelty effects, consider creating a holdback group, i.e., a small group of users who continue to get the Control group’s experience, to measure the long-term effects.

In the campaign metrics example, you could say,

“If the impact on metrics is as expected, i.e. a statistically significant positive change in the primary success metric, neutral or positive north star metric, and no significant regression to the guardrail metrics, the A/B test can be shipped to 100% of the population.

If there is a trade-off, further discussions with stakeholders are needed. The leadership at the company may be okay with a short term tradeoff, as long as the north star metric isn’t regressing. Other factors to consider are novelty effects and the risk of shipping.

In some cases it's also possible that we might want to keep the experiment running for a longer time before we're comfortable with making the final decision. Alternatively, we may do a follow-up experiment. We could change the incentive if we felt like one month didn't work for whatever reason, or we could offer different types of incentives if we saw that there was high abandonment rate.”