Skip to main content

How to Answer Power Analysis Questions

Premium

Power analysis usually comes up in an interview when the interviewer wants to assess your conceptual knowledge about the key inputs.

Example questions include:

  • “What are the key inputs to run power analysis?”
  • “How does the required sample size change if we increase the power?”

The actual power analysis calculation is usually done using online sample size calculators, but you should know the 4 key inputs to the calculation and how they affect the sample size.

Below, we’ll describe the 4 key power analysis inputs. Then, we’ll explain how to use the output to estimate the experiment’s duration.

The 4 key inputs

Effect size or minimum detectable effect (MDE)

Effect size is the magnitude of the difference or relationship between variables that you are interested in detecting. MDE refers to the smallest effect size you want the test to be able to detect.

It's good practice to align with the product team and other stakeholders to understand what constitutes a meaningful effect from a business perspective and is feasible in terms of sample size and resource requirements. Aim for an MDE that strikes a balance between sensitivity to detect meaningful effects and feasibility. The smaller the MDE, the bigger the sample needed.

If you want your test to be able to detect a >=1% change in your key metric, the MDE is 1%. A smaller MDE of 0.5% will require a larger sample size and may not be a meaningful enough change for the business. A larger MDE of 2% requires a smaller sample size, but it’s often easier to make smaller and more incremental product changes compared to larger ones, so an MDE of 2% might be too high.

Significance Level (ɑ)

The probability of a false positive i.e. the probability of rejecting the null hypothesis when it is actually true. Commonly used significance levels include 0.05 and 0.01, indicating a 5% and 1% chance of Type I error, respectively.

Power (1-𝛃)

The probability of correctly rejecting the null hypothesis when it is false. In other words, power represents the ability of a study to detect a true effect. Common values are 0.8 or 0.9. Higher power also increases the risk of a false positive, so there is typically a tradeoff between significance level and power.

Variance

Variance is the variability of the data within each group. The higher the variance, larger the sample size needed. Variance reduction methods such as controlled-experiment using pre-experiment data (CUPED) can be used to reduce the required sample size.

Use Lehr's (rough) rule to remember how effect size and variance affect the sample size. It says that the sample size 𝑛 (each group) for a two-sided two-sample t-test with power 80% and significance level 𝛼=0.05 should be:

n16s2d2 n \approx 16\frac{s^2}{d^2}

Where

s2s_2 is an estimate of the population variance and 𝑑𝑑 = μ1\mu_1μ2\mu_2 is the effect size or the to-be-detected difference in the mean values of both samples

Using the output

Once you have the required sample size (n) from the power analysis, you should also know how to estimate the duration of the test/experiment. The test should be run long enough to gather enough samples to meet the sample size requirement.

To calculate this estimation,

  1. Determine how quickly you can collect data from each user or observation in your test (data collection rate). This could be measured in terms of users per day, observations per hour, etc depending on the test.
  2. Then, estimate the test duration using this formula: Estimated test duration = (n * number of test cells) / data collection rate)

If you need 1000 users per test cell, and based on past traffic data, you get ~100 users per day, for a 2 cell test (aka two-sample test), you will need to run the test for 1000*2/100 = 20 days.

Other key factors to consider when deciding the experiment or test duration include:

  • Seasonality and trends: Take into account any seasonal patterns or trends in your data that may affect the results of the experiment. It's important to run the test for a duration that captures a representative sample of your audience across different time periods. Example: weekdays vs weekends, accounting for mobile app adoption time if you’re testing new versions of an app which requires users to update their app versions.
  • Business cycle: Consider the typical buying or engagement cycle of your audience. Depending on your business model, it may be necessary to run the test for a duration that spans multiple cycles to account for variations in behavior. Example: to measure changes in monthly user retention, the test probably needs to be run for >1 month.
  • Experiment duration constraints: Factor in any practical constraints or limitations on the duration of the experiment, such as budget constraints, timeline restrictions, or external deadlines.
  • Contingencies: It's also a good idea to build in some buffer time for unexpected delays or complications that may arise during the experiment.
  • Senior candidates should emphasize the practical constraints and trade-offs involved in determining sample size such as budget, time constraints or logistical challenges.
  • Senior candidates proactively discuss ways to decrease required sample size. Example: variance reduction techniques such as CUPED or matching.