How to Answer ML System Design Questions
ML system design interviews typically last 45 minutes to 1 hour.
During the interview, an ML engineer asks you to design a system end-to-end, including pre-processing the data, training and evaluating the model, and deploying the model.
These questions assess your ability to consider real-world aspects of productionizing an ML model, such as efficiency, monitoring, preventing harmful model outputs, and building inference infrastructure. They also test your ability to model a business problem as an ML problem.
A framework for answering ML system design interview questions
ML system design interview questions are challenging because they require you to synthesize many ML concepts into a working solution. You have the added pressure of working within a limited time frame.
A framework helps you stay focused, budget your time strategically, and communicate with the interviewer. This lesson will teach a simple framework to use in your interview.
The 6-step ML system design interview framework
An effective ML system design interview answer follows these steps:
- Step 1: Define the problem. Identify the core ML task and ask clarifying questions to determine the appropriate requirements and tradeoffs. (8 minutes)
- Step 2: Design the data processing pipeline. Illustrate how you’ll collect and process your data to maintain a high-quality dataset. (8 minutes)
- Step 3: Create a model architecture. Come up with a suitable model architecture that addresess the needs of the core ML task identified in Step 1 (8 minutes).
- Step 4: Train and evaluate the model. Select a model and explain how you’ll train and evaluate it. (8 minutes)
- Step 5: Deploy the model. Determine how you’ll deploy, serve, and monitor the model. (8 minutes)
- Step 6: Wrap up. Summarize your solution and present additional considerations you would address with more time. (5 minutes)

The time estimates can vary based on where the interviewer wants to spend more time.
We’ll use the example, “Design a Spotify recommendation system,” to demonstrate how to apply each framework step.
Step 1: Define the problem
Time estimate: 8 minutes
Before you start designing your ML system, define the problem. This establishes the parameters for the rest of the interview and ensures you're on the same page as the interviewer. At this stage, the interviewer assesses your ability to clarify the problem scope and correctly identify the most important system requirements.
To define the problem, first identify the core ML learning problem and then ask clarifying questions about the system requirements.
Identify the core ML learning problem
When identifying the learning problem, explicitly state what you'd like to learn from the data and what function(s) you’ll use to get there. Then, clarify what model and datasets you need for your system. Below, we’ve defined common tasks and their typical models and datasets:
- A recommendation task ranks samples according to their similarity to the input. You typically use a collaborative or user-based filtering model and a large dataset with (user, item, rating) rows.
- A regression task predicts a continuous scalar value. You typically use a regularized form of linear regression and a (potentially small) dataset that maps a set of features to a scalar value (the property of interest).
- A classification task categorizes the input into one of various discrete categories. You typically use logistic regression and a dataset that maps a set of features to a category.
- A generation task outputs new samples conditioned on inputs that match the training distribution. You typically use a neural network and a dataset that associates samples from the input space and samples from the output space (e.g. (description, image) pairs).
- A ranking task predicts an ordering of a set of elements (often called documents). You typically use a regression model to predict a ranking score that you’ll sort on and a dataset that maps from (element, element set) to (goodness of element). The most common definition of "good" in this setting is relevance, i.e. how relevant the document is to the query intent.
The example solutions above are appropriate default options for simple datasets, but most data isn’t correlated simply. For example, for more complex regression or classification tasks, consider options such as SVMs, generalized additive models, spline-based methods, and neural networks. Always evaluate your prompt's issue, goal, and metrics to determine the most appropriate solution.
In the Spotify example, you could say:
"We are trying to build an ML-based recommender system on Spotify, which recommends artists to users, based on their liked playlists, songs, and artists.
The success of this system will depend on user engagement, which is defined by number of clicks. If a user clicks on a recommendation, that's a point towards the algorithm. If they don't, then we can agree it was a bad recommendation.
We can go deeper and assess the amount of time they engaged with the recommendation, but to keep things simple for now, lets go with just a click."
Clarify requirements and tradeoffs
Once you’ve identified the core ML learning problem, clarify the system’s goals. There’s no single correct answer, so ask follow-up questions to help determine the appropriate requirements and tradeoffs. Some important topics to cover include:
- Minimum requirements for accuracy and performance: What are this system's minimum accuracy and efficiency requirements? Can tradeoffs of accuracy for performance be made when there are spikes in traffic?
- Traffic/bandwidth: Approximately how many users will access the model at once, and what is the average amount of traffic (in tokens/second or images/second)? Is the traffic relatively uniformly distributed, or are there occasional traffic spikes? How many Daily Average Users (DAUs) will use the system?
- Data sources and requirements: What data sources are available for use, and are there noisy or missing values? Will the data contain toxic or problematic content? What data privacy requirements, relevant legal jurisdictions, or copyright restrictions exist for this product?
- Computational resources and constraints: What computational resources are available for training and serving the model? How easy is it to parallelize the workload, either through model or data parallelization?
In the Spotify example, you could say:
"Two clarifying questions:
- What kind of raw data do we already have access to, and do we need to collect any raw data?
- What is the condition of the raw data?
We’ll assume that we’ll have click data from users as one data source. The other source will be some metadata about the users, e.g. age group, location, and previous information.
Understanding the condition of the raw data helps us plan for what kind of pipeline and transformations are needed to get the data into a usable format. Let’s assume we get click data in a JSON serialized format. These are usually events that come in and land into an object store. The user metadata is a bit simpler, as it's available directly within the postgres account table. However, we must keep in mind that it is PII data, so it will have to be used carefully."
Step 2: Design the data processing pipeline
Time estimate: 8 minutes
Designing a data pipeline shows your interviewer that you understand the importance of high-quality data, not just high-quality algorithms. At this step, show your interviewer that you’re thinking through the key factors that affect data quality, such as:
- What kind of data is needed? Numbers, text, images, multimodal, etc.
- How will you collect the data? Programmatic labeling, synthetic data augmentation, human annotation, etc.
- Do you need to do any kind of feature engineering? Would it be helpful to pre-compute some features, such as categorizing people’s ages into bins of “adolescent,” “adult,” etc.?
- What kind of data pre-processing do you need to do? Tokenization, normalization, encoding categorical features in numerical form, removing low-quality data, imputing missing values, synthetically augmenting data, etc.
- Are there privacy concerns involved with the kind of data you’re using? If so, can you remove identifying information or apply filtering or pre-processing techniques that induce k-anonymity (for sufficiently large k)?
- How do you ensure that no data contamination is occurring? For example, if your data segments are generated by the same process (the same spammer creates multiple spam emails in the same spam classification dataset), then ensure that those segments are in the same split of your data.
In the Spotify example, you could say:
"Having clarified the data conditions and sources in the previous step, we’re ready to design a data processing pipeline. We’ll use the above two points to create data processing pipelines and fetch what we need to create our features. Then, we’ll access the raw click data and the Postgres table for the account information. Afterwards, we’ll create our features.
To collect and process the data, we’ll have to decide between using a batch-based or real-time solution. A batch-based system is usually easier to manage, whereas inferencing and training in real-time is compute-intensive and expensive. It’s usually better to have at least one be in batch, preferably the training (as this takes the most time). However, we can do inferencing in real-time if needed.
Ideally, both training and inferencing would be in batch. There would be some serverless job that pulls the latest recommendations stored by the batch job in a cache. This way, the recommendations are available at all times, but they refresh every few hours. For this scenario, we’ll use a batch-based system for both training and inferencing.
Since we have click data coming in as JSON events and landing in an object store, we’ll design the data pipeline by creating an ETL Pipeline. We’ll create an abstracted data model to illustrate how we want our data to look like in the end before feeding it into the model. Generally, we want our features to be as mutually exclusive as possible, because this prevents complicated correlations between features from occurring.
We’ll take the following feature engineering steps:
- Read the data in its raw format
- Deserialize it
- Finalize the first 4 features:
- Age group
- Location (City, State, United States)
- Array of most recent (100) to keep it simple: favorite artists. Each element in the array is a map object with artist info (artist name, active # days, trending rank, genre, number of followers, etc.)
- Array of last 100 listened songs (song name, artist name, active # days, trending rank, genre, number of followers, number of likes, average listen time for song, standard deviation of listen time for song etc.) To keep the song length metric simple, we can categorize this to (full, partial, skipped).
- Fetch the fields from the deserialized JSON records
- Clean them in preparation for feature engineering
- Mask PII (date of birth, full names, emails, etc.)
- Parse location (convert from coordinates to city, state)
- Discard things we don’t need. For example, the only PII we need is the userID and user date of birth to categorize them into an age group
- Normalize fields: convert everything to lowercase, remove spaces and punctuations, remove any noise, deduplication, format timestamps correctly.
- Fetch the artist and song details from the JSON array to create the arrays of songs/artists mentioned above.
- After cleaning the data, it will land in a Postgres database. The data will contain information on the click event that happened. This information usually comes in the form of card elements on the UI, which contain title of song, artist, artist ranking, genre of music, song duration, how long the song was played for, volume levels throughout the song, and more.
We’ll store all of these features in a new table, and then write them to a feature store for model consumption."
Check out Designing a Data Processing Pipeline for in-depth information on this topic.
Step 3: Propose a model architecture
Time estimate: 8 minutes
Select and justify the model
Now that you have your data, it’s time to pick and train a model. At this step, the interviewer evaluates your ability to select an appropriate ML model among today's various models. The interviewer also evaluates the reasoning behind your decision. Justify your model selection by addressing the following:
- Type of learning problem: What models are typically used for your core ML learning problem?
- Use case: Will this model make predictions ingested by another system, or will users directly interact with the model? Will the model need frequent re-training, adaptation, or personalization?
- Parsimony: What’s the simplest possible model you can select that offers sufficient accuracy?
- Practical constraints: Do any safety, privacy, storage, and/or business constraints affect the model selection?
In the Spotify example, you could say:
"Now that we’ve created a data pipeline, we’ll consider the types of models typically used for recommendation systems. Traditionally recommendation systems take advantage of data from other users and use that to recommend something to new or even existing users. This is known as collaborative filtering, which has the potential to become a challenge if there is a lack of data from other users. Additionally, these days recommendation systems are getting more involved with deep learning and traditional supervised techniques like decision trees, XGBoosts, etc. There’s a huge library of paths to choose from."
For a quick refresher of ML models, check out “A Tour of Machine Learning Algorithms” and “6 Natural Language Processing Models you should know.”
Select a model architecture
Suggest suitable model architectures that fit the system requirements (e.g. latency or memory optimization). For example, potential model architectures for a classification ML task include logistic regression as a baseline classifier, a more complex feed-forward neural network, or a search-optimized two-tower architecture.
Among the architectures, select one that best fulfills the scope of the problem, matches the amount of data available, and optimizes for tradeoffs between efficiency, accuracy, sensitivity, and/or interpretability. For example, you could explain to your interviewer that you’ve selected a simpler neural network model to optimize for training performance at the cost of increased latency at inference time.
In the Spotify example, you could say:
"To satisfy the use case we currently have, let’s start with a simple architecture. Assuming we have the required data, we’ll move forward with the collaborative filtering element. With music, trends are traditionally developed through mutual sharing between listeners.
The simplest model we can select will essentially create feature vectors of each user. Each feature vector is a unique ID for each user, comprised of a user’s features (age group, location, array of favorite artists maps, array of favorite songs maps).
We’ll score each of these vectors between -1 and 1. This scoring method consolidates the vectors into a single number that represents a user and their preferences. We’ll also score each item we recommend between -1 and 1, depending on its popularity and number of plays. This allows us to compare different users on the same scale (normalization).
We’ll then organize these scores for each user into a user-item matrix. Each user is on a row, and each item is on a column. We’ll then compute the product of each feature vector’s score with the recommended song’s score and set a threshold between -1 and 1.
Depending on how close the product is to 1, we’ll decide if we want to provide that item as a recommendation to the user. If we want to give very specific and limited recommendations, we can set the threshold high, and vice versa. Generally, it’s better to start with a low threshold to collect as much information as possible. Then, we can begin to pinpoint the optimal threshold value for future recommendations."

Your interviewer may ask you to whiteboard the model architecture into an ML system design diagram. Your whiteboard should include the data sources and pipelines identified in Step 2, the desired output, post-training storage, and inference.
Step 4: Train and evaluate the model
Time estimate: 8 minutes
Train the model
Once you’ve selected a model, it’s time to train it. Decide what optimizer algorithm you’ll use, what metrics you’ll need to monitor during training, and how you’ll tune hyperparameters. The metrics you monitor during training are critical because they alert you when something goes wrong and can indicate when to stop training.
Your training plan also depends on the type of hardware available to you. You may need to parallelize the training jobs and distribute your data and model parameters across multiple machines. Lastly, certain models may not require you to train all parameters. You may be able to fine-tune a pre-trained model, rather than train it from scratch, and take advantage of sparse, low-rank, or low-/mixed-precision training methods.
In the Spotify example, you could say:
"To create training inputs, we’ll take the process data, code non-numerical data, and featurize the rest of the data. The training will produce a user-item matrix. This matrix will then be used to create a probabilistic prediction as a recommendation for an item to the user.
The user is then presented with these recommendations. If a user clicks on any recommendations, the click data is collected as positive feedback. Any items that have been recommended that were not clicked will be considered negative feedback. The number of clicks over the total number of recommendations is considered the accuracy metric for the model."
Evaluate the model
After selecting the model, tell your interviewer how you’ll evaluate it. Your interviewer assesses how knowledgeable you are of different evaluation standards and metrics. Present a robust evaluation plan by considering where your model will be used and how an incorrect prediction may negatively impact the user. Some key evaluation standards include:
- Accuracy: F1, precision, recall, and confusion matrices, etc.
- Bias: Group fairness, etc.
- Calibration: Calibrating a model’s predictions to equal the probability that its prediction is correct.
- Sensitivity/robustness: Assessing whether minor changes affect a model’s prediction
- Comparisons against baselines: Comparing against the simplest possible model, a random baseline, and/or a human baseline.
Be prepared to talk about the pros and cons of your chosen evaluation metrics. For example, if it’s a ranking task and you pick precision@k, address how it compares to ndcg@k. Acknowledging the tradeoffs among the possible evaluation metrics demonstrates your ability to optimize a model for its particular purpose.
In the Spotify example, you could say:
"Once we’ve established the accuracy metric, we’ll use the features for the positive recommendations and the features for the negative recommendation to see the difference. This difference will indicate if certain features played a larger role in affecting user behavior versus the other. This data can then be used to create a feature weighting algorithm that learns to get better at weighing features. Consequently, the collaborative filtering algorithm will also improve."
Check out Evaluating a Model for ML Systems for a deeper dive into model evaluation.
Step 5: Deploy the model
Time estimate: 8 minutes
Although details such as the function of each ML framework’s compiler may be out of scope for a general ML system design interview, know how these components fit into the bigger picture. The three main questions to answer are:
-
How will you decide when to deploy a new model? Select appropriate evaluation metrics and identify strategies to test your model on production data, such as A/B tests, canary deployment, feature flags, and/or shadow deployment.
-
How will the model be served? Select the hardware (e.g. remote, on the edge), optimize and compile the model (e.g. NVCC, XLA), and consider how you’ll handle different patterns in user traffic.
-
How will you continuously monitor the health and performance of the deployed model? Unlike other software systems, post-production is very important for ML systems. You’re constantly improving performance and benchmarking models. Consider what kind of dataset you’ll use as a source of ground truth, how you’ll determine when the model’s performance has regressed enough to require intervention, and other tools you’ll build to troubleshoot other model serving issues.
In the Spotify example, you could say:
"The last step in this process is to understand when and how best to deploy our model into production.
First, we’ll define the appropriate metrics, which we previously discussed as engagement. Then, we can deploy out an A/B test plan for this model to understand if this is the best step for the user experience.
Second, we’ll need to understand the compute and storage resources we have to train, test, validate, and inference the information. Let’s say we are using some kind of cloud system like AWS. We can take advantage of AWS sagemaker (to house, train, and test the model), lambda (to service requested recommendations), elasticache (to store the recommendations), and provide them back to the application via an API endpoint. We can then auto-scale the resources to handle changing volumes of traffic from the application."
Check out Deploying an ML Model for a deeper dive into model deployment.
Step 6: Wrap up
Time estimate: 5 minutes
In the last few minutes of the interview,
- Debrief: review the problem scope, data processing pipeline, and how you would train, evaluate, and deploy the model.
- If there’s time, discuss your system design's main bottlenecks and tradeoffs. Why did you decide that those bottlenecks or tradeoffs would be acceptable? How would you scale the system to handle more data or inference/training requests? How would you adjust the model and/or data processing in the future to handle distribution shifts?
Ending with a high-level overview and additional considerations shows the interviewer you have a comprehensive understanding of the system. You’re also demonstrating your technical design skills by proactively identifying extra components and tradeoffs you’d consider in a less time-constrained setting. Once you’ve wrapped up, check in with your interviewer to see if there are follow-up questions.
In the Spotify example, you could say:
"To recap, we’ve just designed a high-level system to recommend artists on Spotify. We first identified our data sources as user metadata and click data. We then opted for a batch-based system to process the data, used a collaborative filtering model to score each user’s feature vectors, and collected click data to train the model. We then discussed the factors affecting model deployment, such as engagement and compute and storage resources.
The other consideration to shed additional light on is post-production work. Machine learning is very dynamic, since incoming data changes constantly. This affects the model and its performance, so it’s important to have monitoring and observability on the model drift, data drift, and feature drift. It’s essential to observe the performance of the model to ensure that we are still meeting against our metric. We can check model performance constantly by observing the metric we are testing against (churn)."
Common pitfalls
- Rushing into the solution. Rather than jumping into the design, first analyze the specific problem you’re trying to solve by clarifying the system requirements, the context of the problem, the scale of the data, etc. Once you develop a baseline model, get the interviewer's input about what pieces to focus on.
- Looking for the “right” answer. In most cases, there are no strictly right or wrong answers. Some are better justified than others, and your interviewer expects you to thoroughly justify your answers by explaining why you chose your design over possible alternatives.
- Defaulting to state-of-the-art (SotA) models. It's certainly important to check ML benchmark leaderboards to identify the current SotA models for a given task. However, keep in mind that SotA models are often less efficient to train and run inference with (requiring more compute or data). They're also usually evaluated on academic benchmarks only, rather than in real-world settings. Practice building your own models and research other models so that you have a holistic understanding of the available options.
- Overcomplicating the model. When training models, many things can go wrong, so start with a low-capacity, v1 solution. Once you have a v1 solution of the system that would work on clean data, expand the model capacity to account for additional pieces of complexity (e.g. messy data and corner cases). Starting with a basic model also budgets time for the interviewer to identify the pieces of the ML design they’d like you to focus on. Taking those hints shows that you can collaborate and incorporate feedback on your design.
- Overlooking model evaluation and validation. Model selection is just one part of the problem, so budget time for the other steps. Clarify how you’ll initially validate a model learned from some data (your strategy should involve quantitative and qualitative analysis), and discuss how continual validation will happen (e.g. using a metrics dashboard).