Skip to main content

Design Test for Stories Reaction Feature

Premium

In this mock interview, Apurvaa (Senior Data Scientist @ Instacart, ex-Amazon) answers the prompt, "You work at a social media company, and the product manager wants to test adding a new reaction feature to Stories. How would you measure whether it’s successful?".

This is an open-ended A/B testing/experiment design question. The interviewer is testing whether you can convert this somewhat vague question into a well-defined problem statement, establish scope, and come up with reasonable hypotheses and metrics using a structured framework. They also want to evaluate your understanding of the end-to-end A/B testing process and your familiarity with the underlying statistical concepts at each step.

Use the A/B testing framework to answer this question.

Step 1: Define the problem statement and scope

This is the most important part of the process. If you miss crucial details, you might work towards an incorrect solution. Make sure to align with the interviewer on key goals, assumptions and details. The interviewer is looking for you to lead the discussion and establish a well-defined scope. Discuss the user journey and specific user segments to define the scope.

Clarifying questions:

  • Why is this experiment important for the company? What decisions will the experiment drive?
    • This is important to understand when deciding metrics. The experiment goal should also tie to the company’s KPIs and mission. Here, the interviewer turns the question back to the interviewee.
    • By introducing this new feature, users will be more engaged with Stories, which should also lead to being more engaged and connected with their friends, and spending more time in the app in general. This ties well with the primary mission of social media companies which is to improve connectivity and bring the world closer together, so it makes sense that adding new features that increase user engagement would be important for the company.
  • How will this new reaction show up on the user’s app? How will they know about it? Does it change the user workflow in any way?
    • Here, we are told when users want to react to a story this will show up as an additional icon in the existing reaction menu (it will be as visible as other existing reactions). In addition, users for whom this new reaction is available will get a notification informing them about the new reaction when they first open the app. This doesn’t change the user workflow in any way.
  • Is this meant to be for all users? Example: if someone follows you but you don’t follow them back, would they have access to this new reaction as well? Any app versions/platforms restrictions?
    • Here, we are told it’s for all users, and not restricted to any specific app versions/platforms. You can be proactive here and mention some user segments which might be interesting to look at in the analysis phase. Example: distinguishing between celebrities who have a lot of followers but may not follow most of their followers vs regular users, because of different patterns. You can mention that this will be discussed in more detail later.

Step 2: Identify key metrics

The interviewer is looking for well-defined metrics (1-2 in each category) relevant to the scenario. Try to be as specific as possible (e.g. include the exact definition and time period when discussing the metrics). Consider whether using a ratio metric or continuous metric makes more sense.

Let’s identify the key metrics of interest:

Primary success metric: We are interested in measuring adoption of the new reaction as well as the change in user engagement to reactions and stories. Stories with reaction rate (# stories with reaction/# stories viewed), new reaction adoption rate (# times new reaction used/# times user reacted to a story), interaction rate (# interactions per story/# stories viewed), # stories created are relevant metrics.

North Star metric: User engagement measured by time spent on the app (minutes per day) and dau (daily active users) are key metrics that social media companies typically care about. We want to make sure the new reaction doesn’t hurt overall user engagement.

Guardrail metric: Reaction rate (# reactions/# stories viewed) is important to monitor since this can help alert about unexpected bugs such as if the new reaction causes other existing reactions to not work properly. If # stories viewed per user goes down, this could also be an indication of bugs or that users find the new reaction annoying. Crash rate or bounce rate are also good guardrail metrics to ensure system health isn’t declining.

Secondary metric: Repeated usage (users who use the new reaction multiple times) can be interesting to analyze to identify potential super users. Notification engagement rate (i.e. how users react to the notification about the new reaction) can also be interesting to understand if the notification is effective in informing users about the new reaction feature or if further iterations are needed.

Step 3: Select unit of randomization

Discuss the unit of randomization and triggering criteria that you’ll select to ensure representative samples.

This is an important technical aspect to discuss since the experiment design can vary quite a bit depending on the unit of randomization. The interviewer wants to understand if you are familiar with the key assumptions of an A/B test. This is particularly important for this experiment because of network effects, which means we can’t randomize at the user level (most common unit of randomization).

We can’t randomize at user level because it can lead to inconsistencies in the user experience if some users in a user’s network are in Treatment whereas others are in Control. This is also a violation of SUTVA (Stable Unit Treatment Value Assumption) i.e. the treatment of one unit does not affect the outcomes of another unit. If this assumption does not hold,  it makes it harder to draw valid conclusions from an experiment. Discuss other ways to randomize with the pros and cons of each.

  1. Geo randomization: Randomize by country or market i.e. in some countries, all users located in those countries get the new reaction, while in other similar countries none of the users get the new reaction. Pros: This mitigates the inconsistent user experience and SUTVA violation issues, and is relatively technically easy to implement. Cons: Finding similar countries or markets might be challenging since user behavior and usage of stories could vary by market, and even if we do find some the sample size will probably be very small making it difficult to adequately power the experiment. User location might change, and users might have network connections in multiple countries leading to potential inconsistencies in the user experience.
  2. Cluster randomization: Randomize by user network cluster. Control: user clusters (i.e. users and everyone in their immediate networks) don’t get the new reaction, Treatment: all users in these user clusters get the new reaction. The first step in this case will be to assign all users to clusters and then randomly assign each of the clusters to Control or Treatment. Pros: mitigates the inconsistent user experience and SUTVA violation issues, and ensures consistent user experience. Cons: complex to implement since we need to first come up with a robust clustering algorithm.

Triggering criteria: Assign users to the experiment when they open the app, and if technically feasible, only if their homepage has stories available to view, to reduce noise/variance which increases the power of the experiment.

Step 4: Formulate hypotheses

The alternative hypothesis is that the new reaction will increase the primary success metric by X%, while not hurting guardrail metrics and creating a neutral or slight increase in the north star metric. We will also monitor secondary metrics to learn about user engagement, notification effectiveness and stickiness of the new reaction.

The null hypothesis is status quo i.e. no change in key metrics. We will discuss the effect size X% later, when we discuss power analysis.

Step 5: Select a statistical test

Consider the nature of the data and the hypothesis to select the appropriate statistical test.

Be prepared for follow-up questions about why you chose a particular test.

If we’re using an experimentation platform, it will have in-built statistical tests, typically the Z test is used for both continuous and ratio metrics for large sample sizes. In this case, it is also good to mention variance reduction because cluster randomization will result in smaller sample sizes compared to user randomization. You’re usually not expected to discuss variance reduction techniques in detail but there is some published research on this topic.

Step 6: Conduct power analysis

Discuss the inputs and output of the power analysis. Discuss the ramp-up strategy, if needed.

The interviewer is testing whether you know how the different inputs to power analysis affect the output, and how you use the output to estimate the duration of the experiment. They are also evaluating whether you have experience running an A/B test in a real world situation where risk management and practical constraints are very important.

The main inputs for power analysis include:

  • Effect size: typically based on input from the product team and other stakeholders. The effect size should be practically significant, i.e. it has a meaningful impact on the business.
  • Power: the probability of detecting a true effect, which is typically 0.8.
  • Significance level (alpha): the probability of detecting false positives, which is typically 0.05.
  • Variance: typically estimated using historical data.

The output of power analysis is the sample size required to detect the minimum detectable effect (MDE). From the sample size, we can estimate the duration of the experiment needed to get this sample size.

Other factors to consider when determining the duration are seasonality or variability in the metric (e.g. weekday user behavior might be different from weekend) and app adoption, if the new feature requires users to update their mobile apps.

To expose 50% of the population to get the required sample size, a gradual ramp-up strategy could be:

  • Day 1: Launch A/B test to 1% of population
  • Day 2: Monitor key metrics for unexpected data. If everything looks good, ramp up to 10% and continue to monitor.
  • Day 3: Assuming no unexpected data, ramp up to 50%

Step 7: Analyze test results

Talk through how and when you would analyze test results.

The interviewer is evaluating whether you are familiar with the nuances and potential pitfalls in running and analyzing an A/B test. This is very important since mistakes here can invalidate the test results and lead to incorrect conclusions.

Assuming the test has run for the required duration, we’d analyze the data by overall metrics and relevant segments (e.g. user type: new vs tenured users,celebrities vs regular users, frequent vs infrequent users and platform).

It's pretty unusual for all your metrics to look perfect. In this case, you might notice some cannibalization effects i.e. the new reaction might cause decreased usage of existing reactions, even though your primary success metric increased. Doing further analysis by looking at specific user segments could be one way to help inform what the next steps should be.

Step 8: Evaluate and make recommendations

The interviewer is evaluating whether you consider factors other than just the test results when making a recommendation since multiple factors are also considered in a real-world setting.

If the impact on metrics is as expected, i.e. a statistically significant positive change in the primary success metric, neutral or positive north star metric, and no significant regression to the guardrail metrics, the A/B test can be shipped to 100% of the population.

If there is a trade-off, further discussions with stakeholders are needed. The leadership at the company may be okay with a short term tradeoff, as long as the north star metric isn’t regressing. Other factors to consider are novelty effects and the risk of shipping.

In some cases it's also possible that we might want to keep the experiment running for a longer time before we're comfortable with making the final decision. Alternatively, we may do a follow-up experiment. Example: if the new feature is very popular with a small % of users, but not used by the majority, we could run follow-up UX studies to understand why and identify design changes which we can test in future iterations of the product.