Design Test for Driver Matching Algorithm
In this mock interview, Apurvaa (Senior Data Scientist @ Instacart, ex-Amazon) answers the prompt, "You work at a ride-sharing company, and your team wants to test a new rider-driver matching algorithm that should reduce match times. How would you test this?"
This is an open-ended A/B testing/experiment design question. The interviewer is testing whether you can convert this somewhat vague question into a well-defined problem statement, establish scope, and come up with reasonable hypotheses and metrics using a structured framework. They also want to evaluate your understanding of the end-to-end A/B testing process and your familiarity with the underlying statistical concepts at each step.
Use the A/B testing framework to answer this question.
Step 1: Define the problem statement and scope
This is the most important part of the process. If you miss crucial details, you might work towards an incorrect solution. Make sure to align with the interviewer on key goals, assumptions and details. The interviewer is looking for you to lead the discussion and establish a well-defined scope. Discuss the user journey and specific user segments to define the scope.
Clarifying questions:
- Why is this experiment important for the company? What decisions will the experiment drive?
- This is important to understand when deciding metrics. The experiment goal should also tie to the company’s KPIs and mission. Here, the interviewer turns the question back to the interviewee.
- By improving the matching algorithm, users will be able to get rides faster, which will improve their experience and make them more likely to be repeat customers. This should also increase the efficiency of the system as a whole, since with the reduced matching time, there would be time for more rides overall.This ties well with the primary mission of ride sharing companies which is to improve transportation and make it easier and more efficient to go from point A to point B, so it makes sense that adding new features that reduce the time a user spends waiting for their ride and that make the system more efficient in general would be important for the company.
- Clarifying questions about the new algorithm: I assume this is a backend change and the rider/driver won’t know about it in the experiment? Is this the current workflow: rider requests a ride and has to wait for some time to be matched with a driver, and is sent a notification once they have been matched? Does this new algorithm change the rider/driver workflow in any way?
- Here, we are told that this is purely a backend change, so the new algorithm won’t be visible to riders/drivers. This doesn’t change the rider/driver workflow in any way. The interviewer asks a follow-up question here: We are considering informing riders/drivers about this faster algorithm eventually though. Do you think that should be a part of the experiment?
- Discuss relevant hypotheses here: That’s interesting because there is a psychological aspect to this as well i.e. if you tell users that you’ve made something faster maybe they’re more willing to wait longer? It could also backfire though, if the new algorithm isn’t actually that much faster. In either case, this is a confounding factor which we can probably test as a follow up if we see good results from the current experiment.
- Is this meant to be for all types of rides? Example: most ride sharing companies have different categories: priority rides vs standard vs wait and save? Any app versions/platforms restrictions?
- Here, we are told it’s for standard rides only, and no app version/platform restrictions.
- Any particular segments of interest?
- Here, the interviewer turns the question back to the interviewee. Discuss some segments that could be of interest. Example: distinguishing between times of high demand and low supply (where matching times might be longer and the new algorithm is likely to be more impactful) vs low demand high supply periods. We can discuss this in more detail when we talk about the analysis phase.
Step 2: Identify key metrics
The interviewer is looking for well-defined metrics (1-2 in each category) relevant to the scenario. Try to be as specific as possible (e.g. include the exact definition and time period when discussing the metrics). Consider whether using a ratio metric or continuous metric makes more sense.
Let’s identify the key metrics of interest:
Primary success metric: We are interested in measuring the effectiveness of the new matching algorithm as well as its impact on overall system efficiency. Average match time per ride (clarify with the interviewer whether seconds is the appropriate unit here), average total time per ride and driver utilization (minutes spent driving/minutes available) are relevant metrics. If the new algorithm is successful in reducing ride wait time in most cases, we should also expect ride request rate i.e. ride requests/app opens to go up.
North Star metric: Total number of rides and gross bookings (the total dollar value of transactions invoiced to rideshare riders, plus any applicable taxes, tolls, and fees) are key metrics that rideshare companies typically care about. We want to make sure the new algorithm doesn’t hurt overall usage and revenue of the app.
Guardrail metric: Ride cancelation rate because of no match can alert us of unintended consequences of the new matching algorithm. Measuring instances where driver wait time exceeded the standard threshold (typically 5 minutes) can inform us if users have become accustomed to longer match times, and so shorter match times inadvertently result in longer wait times for the driver. Crash rate and latency are also good guardrail metrics to ensure system health isn’t declining.
Secondary metric: Ride feedback (average rating provided by both riders and drivers) can help us understand if riders and drivers are noticing the match time improvements and if that’s improving the user experience. Driver availability i.e. time marked available by drivers, could also help understand if drivers are changing their behavior in response to faster match times and lower overall ride time (better system efficiency).
Step 3: Select the unit of randomization
Discuss the unit of randomization and triggering criteria that you’ll select to ensure representative samples.
This is an important technical aspect to discuss since the experiment design can vary quite a bit depending on the unit of randomization. The interviewer wants to understand if you are familiar with the key assumptions of an A/B test. This is particularly important for this experiment because of network effects, which means we can’t randomize at the user level (most common unit of randomization).
We can’t randomize at the rider level because of spillover effects since the same driver supply is shared across riders (Control and Treatment in this case). This is also a violation of SUTVA (Stable Unit Treatment Value Assumption) i.e. the treatment of one unit does not affect the outcomes of another unit. Example: if 2 riders request a ride and only 1 driver is available equidistant from both, the rider in Treatment has higher chance of matching because of the improved algorithm, and so the experience of the other user in Control depends on what Treatment rider does. If the SUTVA assumption does not hold, it makes it harder to draw valid conclusions from an experiment. Discuss other ways to randomize with the pros and cons of each.
- Geo randomization: Randomize by city or region i.e. in some cities, all users located in those cities get the new algorithm, while in other similar cities none of the users get the new algorithm. Pros: This mitigates the SUTVA violation issues, and is relatively technically easy to implement. Cons: Finding similar cities might be challenging since traffic patterns and other external factors that affect ride times such as weather, events, etc could vary by city. Even if we do account for some of these factors, the sample size will probably be very small making it difficult to adequately power the experiment.
- Switchback testing: Switch between test and control treatments based on time, rather than randomly splitting the population. At any given time, everyone in the same network receives the same treatment. Over time, we flip between test and control and collect metrics for both, which we can compare to evaluate the impact of our change.
- Discuss with the interviewer what an appropriate time interval to switch is. The time interval should be long enough to capture the effect we want to measure, but also short enough to allow for as many test and control samples as possible.
- In addition to time intervals, it’s also possible to segment users into independent clusters, such as cities. The treatment at each interval can then be assigned separately for each cluster. This provides better sampling of test and control at different times of day and increases our total test and control data points, which helps reduce the variance and increase the power of the test.
- Finally, it may also be helpful to apply burn-in and burn-out periods to exclude users who are exposed to the experiment at a time near the switching boundary, since they may be influenced by both the test and control experiences. Say a rider opens their app at 9:03 am and receives the control treatment. Only 3 minutes have passed since the switch from test to control, so the driver availability is still heavily impacted by the spillover effect of the test treatment during the previous hour. We want to exclude this user’s metrics from the analysis to avoid cross-contamination of test and control effects in our metrics. For example, a switchback window of 1 hour with 5-minute burn-in and burn-out periods would only include exposures that fire during the middle 50 minutes of the window. Pros: mitigates the SUTVA violation issues, and reduces bias due to external confounding factors such as traffic. Cons: complex to implement since we need to first come up with a robust switchback algorithm. There is still a violation of the independence of observations assumption. To give a more concrete example, the average time it takes to complete deliveries in one area in one 10-minute chunk of time is related and highly correlated to the average time it takes to complete deliveries in the same area in the next 10-minute chunk of time — much more so than in the case of a single delivery and the next delivery assigned. We can discuss this in more detail when we discuss statistical tests.
Triggering criteria: Assign users to the experiment when they open the app, not when they request a ride since lower match time could affect their decision to request a ride and lead to biased results.
Step 4: Formulate hypotheses
The alternative hypothesis is that the new reaction will increase the primary success metric by X%, while not hurting guardrail metrics and creating a neutral or slight increase in the north star metric. We will also monitor secondary metrics to learn about user feedback and if it’s effective in driving higher driver availability.
The null hypothesis is status quo i.e. no change in key metrics. We will discuss the effect size X% later, when we discuss power analysis.
Step 5: Select a statistical test
Consider the nature of the data and the hypothesis to select the appropriate statistical test.
Be prepared for follow-up questions about why you chose a particular test.
If we’re using an experimentation platform, it will have in-built statistical tests, typically the Z test is used for both continuous and ratio metrics for large sample sizes. In this case, standard Z test not recommended because this would likely lead to underpowered experiments since the total number of units would be low. For example, a 2-week experiment with a 1-hour switchback window and 5 clusters would give us only 1680 total units. The violation of independent observations assumption also makes results from standard Z or t tests unreliable. We can use the Delta method to calculate variance to account for non-independence and then use a t test. A better (more easily generalizable) alternative in this case is bootstrapping (especially since we might be interested in median time to match rather than the mean, and first and 999th percentile of match time impact for example, in which case standard statistical tests cannot be used since the Central Limit Theorem doesn’t apply): The bootstrapped confidence intervals are obtained as follows:
- Collect a bootstrap sample with replacement from the set of test buckets and separately from the set of control buckets.
- Calculate the difference in means between test and control samples.
- Repeat steps one and two multiple (thousands) times. This gives us a distribution of the metric deltas. The 95% confidence interval is the range from the 2.5% quantile to the 97.5% quantile from the distribution of deltas in step three.
Step 6: Conduct power analysis
Discuss the inputs and output of the power analysis. Discuss the ramp-up strategy, if needed.
The interviewer is testing whether you know how the different inputs to power analysis affect the output, and how you use the output to estimate the duration of the experiment. They are also evaluating whether you have experience running an A/B test in a real world situation where risk management and practical constraints are very important.
In this case, since our recommendation is to use bootstrapping rather than a standard statistical test, power analysis is not needed. The rule of thumb for bootstrapping is to have > 1000 samples.
Step 7: Analyze test results
Talk through how and when you would analyze test results.
The interviewer is evaluating whether you are familiar with the nuances and potential pitfalls in running and analyzing an A/B test. This is very important since mistakes here can invalidate the test results and lead to incorrect conclusions.
Assuming the test has run for the required duration, we’d analyze the data by overall metrics and relevant segments (e.g. user type: new vs tenured users, frequent vs infrequent users, high demand low supply vs low demand high supply periods and platform). As discussed previously, it would be useful to look at distributions of match time, since it’s possible that average match time has decreased but there could be edge cases where the match time increased. This could have a disproportionately negative impact on user experience for some users.
Step 8: Evaluate and make recommendations
The interviewer is evaluating whether you consider factors other than just the test results when making a recommendation since multiple factors are also considered in a real-world setting.
If the impact on metrics is as expected, i.e. a statistically significant positive change in the primary success metric, neutral or positive north star metric, and no significant regression to the guardrail metrics, the A/B test can be shipped to 100% of the population.
If there is a trade-off, further discussions with stakeholders are needed. The leadership at the company may be okay with a short term tradeoff, as long as the north star metric isn’t regressing. Other factors to consider are novelty effects and the risk of shipping.
In some cases it's also possible that we might want to keep the experiment running for a longer time before we're comfortable with making the final decision. Alternatively, we may do a follow-up experiment. Example: if we find that the new algorithm improves average match time but worsens it for >99th percentile, we can follow up with the ML/algorithms team about guardrails and test future iterations of the algorithm before deciding to launch.