Skip to main content

Rubric for ML Concepts Interviews

Premium

Interviewers are hiring for an ML engineer who can train, implement, and launch a fully functioning ML model. We collaborated with ML engineer experts from Microsoft and Pinterest to identify the rubric signals and criteria outlined in this lesson.

A successful candidate should be able to demonstrate expertise in the following signals:

  1. Data handling: assesses your ability to source data and transform it into a more workable format.
  2. Model selection: assesses your ability to select a model, given a particular task, and discuss the bias-variance tradeoff on that model’s parameterization.
  3. Optimization: assesses your ability to identify a loss function and reasonable optimization scheme for training.
  4. Evaluation: assesses your ability to evaluate the performance of a model using appropriate metrics.
  5. Production: assesses your ability to discuss launching a model into production and identify when to replace the model with a model refresh or an entirely new model.

This list captures the life cycle of most ML models. However, senior-level candidates should demonstrate significantly more knowledge in the following areas:

  • State-of-the-art approaches in your particular area of expertise
  • Corrections for model failures, while training and while in production
  • Engineering constraints in the real world, and how they affect productionized models
  • Domain expertise in a particular field of ML (e.g. CV, audio engineering, NLP, recommendation systems)

An interviewer will assess these signals using a rating scale of “very weak” to “very strong.”

The overall rating of the rubric signals translates to:

  • Very Weak: While rare, this rating can be given if your answers don’t match up with the experiences and skills mentioned in your resume.
  • Weak: This rating is common, but getting it doesn't prevent you from moving forward. When reviewing, the hiring manager may follow up with the interviewer to understand what happened. Depending on the context, you may still move on to the next round of interviews.
  • Strong: This rating is also relatively common, and it means you gave the correct answer but fell just short of a “very strong” rating. You may have missed some details or approached the solution with too basic a strategy.
  • Very Strong: More rare than “strong,” this rating signifies that you knew the exact way to solve something, and you were confident in your response.

ML Concepts Rubric

Keep in mind that most questions won’t cover all of these rubric signals at once, since each interview question will focus on particular ML concepts. Additionally, companies may assign different weights for each question; stereotypical machine learning questions (e.g. “How would you handle an exploding gradient, given a neural network model?”) will likely be weighted the least, whereas in-house, domain-level questions (e.g. “What datatypes are used with PySpark MLLIB?”) will be weighted higher.

The graphic above summarizes the rubric breakdown for each signal. The sections below break down the rubric through example questions to make the ratings more concrete. These example questions are asked in the context of this example interview scenario:

You’re interviewing for a company that makes product recommendations, and the interviewer asks you various questions related to recommender systems.

Data handling

Data handling, more colloquially referred to as data wrangling, is the process by which raw data is transformed into valuable, cleaned signals for ML models. The majority of your day-to-day work in ML will be in this category alone, so a good portion of the interview will derive from this category.

In the sample product recommendation scenario, a potential data question is: “We are building a recommendation engine as a neural network with 10,000 features. How might we go about reducing the input?”

Based on your response, the interviewer could make the following ratings:

  • Very Weak: Arbitrarily tries every combination of features that fail to create more workable data.
  • Weak: Suggests a method of feature selection/reduction, but doesn’t know any details beyond a simple call to SciPy.
  • Strong: Suggests multiple methods of feature selection/reduction, with some understanding of how each method works.
  • Very Strong: Suggests multiple methods of feature selection/reduction, with an in-depth understanding of how each method works. For example, may explain why one might want to shrink a neural network. Or, may point out that the number of parameters and calculations depends on the input, so reducing the input reduces the number of calculations.

Senior candidates should be incredibly competent in this material, so it’s less likely that data transformation or featurization questions will be asked. Instead, they might be asked how to create data pipelines to improve existing workstreams. To prepare, research data streams the company already possesses. For example, Google’s search engine includes 200 petabytes of crawled internet data per day. A Google interviewer may ask, “How you would extract useful information from this data at scale?”

Model selection

Model selection covers topics like models of binary discernment (e.g. regression, support vector machines, perceptrons, decision trees), types of neural networks (e.g. deep, convolutional, recurrent, and graph), and reinforcement learning.

In the sample product recommendation scenario, you may be asked: “We want to build a new recommender system for user features. What type of model would you use?”

The answers here can vary widely depending on the company, application, and specific context. Based on your response, the interviewer could make the following ratings:

  • Very Weak: Fails to ask clarifying questions, jumps into solutions. Suggests a model that will not work, and clearly doesn’t know what a recommender system is.
  • Mixed: Asks some basic clarifying questions about the model. Suggests a model that may not work, but can speak to how adjustments can be made.
  • Strong: Asks some thoughtful clarifying questions around the model’s goals and available data. Suggests a more basic model, such as a decision tree or a probability estimator. Might have needed a hint, but otherwise would have gotten the answer on their own with more time.
  • Very Strong: Clarifies the scope, available data, and values of the system. Confidently suggests a suitable model like XGBoost. Explains how the algorithm works and how to formulate the data. Defines how long will the model will take to train and considers scaling opportunities.

Senior candidates should also be prepared to discuss recent generative models and/or state-of-the-art models. The amount of business applications for these recent types of models is growing rapidly, and it’s becoming increasingly necessary to hire people who have experience with these models. When appropriate, discuss how Google Bard, ChatGPT, or other models can advance the company’s business strategy.

Optimization

Optimization is a rich area for interview questions, which cover topics like loss functions, hyperparameters, and different optimizers.

In the sample product recommendation scenario, you may be asked: “We want to significantly reduce the complexity of our model. How can we construct a loss function that still values our priorities in our dataset?”

Based on your response, the interviewer could make the following ratings:

  • Very Weak: Demonstrates a clear lack of understanding about loss functions.
  • Weak: Suggests a standard loss function. Or, suggests a loss function with regularization strategies that can’t be easily optimized. Might have needed a hint from the interviewer.
  • Strong: Suggests a loss function with L1 regularization. Clearly explains how the optimization scheme would work.
  • Very Strong: Constructs a loss function that includes L1 regularization and can be optimized. Mentions other regularization strategies that avoid model overfitting.

Senior candidates should already be incredibly competent in this material, so it’s unlikely that they’ll receive simple optimization questions. However, they may receive questions where the answers indirectly relate to optimization. For example, for a prompt asking candidates to fix a modeling problem, candidates may discuss regularization or other strategies to reduce the feature space.

Evaluation

Topics relevant to the evaluation signal include interpreting and calculating metrics for both classification and regression ML models.

In the sample product recommendation scenario, you may be asked: “How do you know your recommendation engine is working well? What metrics would you use?”

Based on your response, the interviewer could make the following ratings:

  • Very Weak: Demonstrates zero experience evaluating models and is unable to suggest relevant metrics.
  • Weak: Suggests a metric that won’t work. Or suggests a metric, but can’t explain how the metric actually works to evaluate the model or how it is computed.
  • Strong: Clarifies model predictions. Suggest useful metrics, such as precision, recall, TPR, FPR, AUC, or accuracy.
  • Very Strong: Suggests how the evaluation metric might depend on the inference model (i.e. decision trees vs. regression). Mentions (proactively) the cold start problem for new vs. returning and suggests solutions to resolve the issue.

Senior candidates should also discuss metrics that pursue a desired business outcome. They should consider the metric’s overall goal while developing the model. They should also demonstrate the ability to benchmark model efficacy.

Production

Model production involves skills such as deriving appropriate signals from A/B testing, incorporating user telemetry data into model health checks, and recognizing when a model needs to be replaced or refreshed.

In the sample product recommendation scenario, you may be asked: “How can we tell when a model needs to be refreshed?”

Based on your response, the interviewer could make the following ratings:

  • Very Weak: Doesn’t understand how a model can drift over time and needs significant help to identify some plausible outcome.
  • Weak: Suggests more basic metrics of performance, maybe with a hint or two.
  • Strong: Suggests more appropriate metrics of performance and how these change over time. Notes that a model needs to be replaced after a certain drop below a threshold.
  • Very Strong: Provides clear reasoning of how to measure a model’s health. Describes how to properly measure how the input and output of the model drift over time.

Senior candidates should be expected to discuss, given specific model performance metrics, appropriate next steps that should be taken. They should demonstrate their knowledge of when a model should be refreshed or replaced through concrete examples.

There are key strategies and pitfalls to avoid when answering questions for each of these rubric categories. We discuss these best practices in the following lessons: