Design an ETA System for a Maps App
In this interview, Stefan designs a system that predicts the estimated time of arrival (ETA) for a maps app.
A special challenge to call out here is distribution shift. The situation on a given road or neighborhood changes daily, so the model must be robust to high data variability. All ML models need to account for this, but distribution shift is particularly relevant for this problem.
Related questions in Exponent's ML system design question bank include:
- Design an automated comment moderation system
- Design an ML monitoring system (drift, performance, outliers, and quality) for a fantasy sports app
The depth and topics of a real interview will vary, depending on the interview length and prompts from the interviewer. Similarly, the written solution below takes a more detailed approach than the video. Study both resources to understand how different approaches and levels of detail could be given for the same interview prompt.
Answer framework
While our framework can apply to any ML system design interview question, this mock features a slightly adjusted framework that’s tailored to the given prompt. As you become more familiar with answering ML system design questions, you can also practice adjusting the framework when the prompt calls for a more nuanced approach. The alternative framework used for this mock involves 6 steps:
- Step 1: Clarify data acquisition
- Step 2: Bridge the problem space and data space
- Step 3: Parametrize the inference function
- Step 4: Train learned functions
- Step 5: Validate the overall approach
- Step 6: Deploy the model
Step 1: Clarify data acquisition
Let’s start by making sure we have access to all the data we need. In the problem setup, we know that:
- The map, System 1, contains all necessary information about roads:
- for each road segment,
- distance
- speed_limit
- free_flow_speed (the fastest that cars typically travel)
- priority_class (e.g. major highway, local road, etc.)
- The historical travel data, system 3, contains all necessary information about how people travel on roads:
- for each (segment, 2m_interval),
- num_cars
- avg_speed
- The shortest paths functionality finds the shortest path, given a weighted graph
- No additional labeling is required
To ensure we understand the overall function of the product, we’ll ask some clarifying questions:
- Is this product intended to work everywhere, or in some particular region?
- In terms of product requirements, can we assume that we have access to low latency (e.g. compute nodes, key value store)?
Based on the interviewer’s responses, we now understand that:
- We’ll predict data for the whole world
- We have access to high performance cluster of CPUs and key value store
Now that we understand the overall product infrastructure and environments, let’s create a high-level plan.
The key aspects of the ML problem include:
- Data: cleaning the map and travel data in preparation for basic training
- Model: parametrizing the inference function and discussing the interface
- Training: engineering features for the model
- Validation: setting up the train-val split
- Deployment: clarifying how the model will be used in the product
Now that we’ve created a v1 of the system design, we can start filling in the details.
Step 2: Bridge the problem space and data space
So far, the raw data can be organized into 2 tables.
Table 1, “map_table,” which has the following columns:
- segment_id (index column)
- distance
- speed_limit
- free_flow_speed
- priority_class
Table 2, “travel_table_1,” which has the following columns:
- year (index column)
- interval_within_year (index column)
- num_cars
- avg_speed
The raw data from the systems above is not exactly convenient for our purposes. When solving this problem, we don’t want to deal with large sets of unstructured records.
The data tables should be >99% correct to produce high quality results. One automated cleaning method would be to run a filter on the data tables to tables to eliminate rows with null or invalid values like negative numbers.
We also want to have more convenient data repositories that we can index on, so we’ll create some JOIN tables that will help us work with the data in a more effective way.
Table 3, “travel_table_2,” will have the following columns:
- interval_within_week (index column)
- num_cars
- avg_speed
We can create this downstream table in 2 ways. One method is to write a SQL query that creates this table all in one go. The other method, which we’ll select, is to create an offline data pipeline that defines a task in Python. The task would read each of the records in “travel_table_1” and output the columns in “travel_table_2.” This would happen in Apache Beam or some other MapReduce-related framework. This data pipeline wouldn’t be run often, but rather something we’d run to cache intermediate results for future queries.
Now that we have a table indexed by a weekly time interval, we can also create an online data processing pipeline to compute the mean m, i.e. ETA, in the following table.
Table 4, “record_table,” will have the following columns:
- segment_id
- interval_within_week
- num_cars (aggregate)
- avg_speed (aggregate)
- ETA
These records map from (road, time) to ETA, so we’ll be able to use them for training and validation of our functionality. To get the overall num_cars, we’ll add the numb cars for every interval. To get the overall avg_speed, we’ll take a weighted average of all the average speeds, where the weights are proportional to num_cars. Then we’ll compute the ETA by dividing the distance of the segment by the avg_speed.
Step 3: Parametrize the inference function
Once we’ve got our data set up in a usable state, we’ll parametrize our inference function so we can learn it.
First, let’s define the interface by define an inference function. To start, let’s say the input is a road segment’s ID and the 2-minute interval within the week.
def f (segment_id, interval_within_week) → (ETA)
We’ve opted to use the same interval per week, rather than unique intervals across different weeks, to confirm if there are weekly patterns in the data. This v1 function provides a good baseline to do statistics on past historical data.
Step 4: Train learned functions
Now we need to figure out its learned parameters, i.e. train the model. We’ll use a simple parametrization formula that predicts the travel time on a given road at a given time during the week, using the mean historical value:
ETA = f(segment_id, interval_within_week) = m
In this context, training is simple: we’ll compute the historic mean for each (segment_id, interval_within_week), and then store that in a dictionary that we lookup when doing inference.
Step 5: Validate the overall approach
Next, we’ll validate the function. This is a massively overlooked part of ML system design. In this context, where we have data across time, we’ll use backtesting. Meaning, the model will only be trained on historical data when deployed. Since the model will predict future ETAs, we’ll compute metrics that assess the model performance.
Let’s start with the train-val split. Two common ratios are 80-20 and 90-10, and there’s arguments for both. With 90-10, the model will have more training data, so it'll probably product more accurate results. With 80-20, the model will have more validation data, so the validation process will probably be more effective. Here we see a tradeoff between the two options.
For a relatively complicated function with little data, a 90-10 would probably be a better fit. However, for this problem, we have more than enough training data and a relatively simple model, so we should be more concerned about validation. Therefore, we’ll do a 80-20 train-val split, where 20% of the months will be randomly selected as the validation dataset.
For metric computation, we’ll determine how to compute metrics for just one record, which contains (segment_id (year, interval_within_year)). First, we’ll compute pre_eta, using train records up to the record we’re computing metrics for. Then, we’ll fetch true_eta. Lastly, we’ll compute the the absolute difference of pred_eta and true_eta.
Then, we’ll aggregate the metrics to calculate the mean, variance, and inter-quartile range.
In sum, we’ve setup the validation to perform the following:
- Perform a 80-20 train-val split on the month level
- Make a model using the train records
- Compute a distance metric on each of the val records (the higher the metric, the less accurate the prediction)
- Compute statistics about those metrics
Step 6: Deploy the model
The final component here is figuring out how we’ll use this system. This system is relatively simple to deploy because not much computation is needed for inference. We’re just doing lookups.
During deployment, the production function will use all available historic data. The high-performance key-value store would be used to store the function. The actual user application would call an ETA backend that uses 2 key components:
- The ETA function, which assigns ETAs to nearby road segments
- Shortest paths functionality, which computes the best path and its ETA
Interview improvements
This interview could be improved in some key ways:
- The inference function could be made more powerful. A function that uses a weighted average, with high granularity in the recent past and lower granularity in the more distant path, would perform better.
- Similarly, the validation methods could be split into “more recent” and “less recent,” more recent data is more important.
- Lastly, data cleaning and filtering methods would improve the model performance. The solution assumes the data is perfect, but this is never the case.
Other considerations
Some meaningful modifications to explore include:
- Determining how the “shortest paths” and “predict travel time along segment” pieces of the functionality can interact.
- Improve the base setup by using a more general representation of road segments. In this setup, each segment has an ID, but it could be more meaningful to build on top of something like a road embedding.