Skip to main content

Deploying an ML Model

Premium

Deploying a model often involves more complex engineering challenges in the ML model lifecycle. There are a variety of decisions to make, such as:

  • Whether the model will run in the cloud or on-device
  • How the model will be optimized and compiled
  • What hardware to serve the model with
  • How to handle user traffic
  • How to ensure that the new model outperforms the production model
  • How to continuously monitor the model.

Although fine-grained details such as how each ML framework’s compiler works are mostly out of scope for an ML system design interview, you should know the bigger picture of how these components fit. The graphic below outlines how deployment interacts with the different parts of an ML system.

ML Model Overview

In the sections below, we describe how to discuss the three main components of ML deployment:

  1. Deploying the model
  2. Serving the model
  3. Monitoring the model

Deploying the model

In general, you should only deploy a new model when you’re reasonably confident that it will perform measurably better than the current production model on real-world data. Beyond picking appropriate evaluation metrics, you should also consider how to test your model on production data, such as via A/B tests, canary deployment, feature flags, or shadow deployment.

Serving the model

First, select the hardware by determining whether the model be served remotely, or on the edge (in the browser or on-device). Serving the model remotely may allow you to use more compute resources, but network latency can create slow response times. Serving on the edge, on the other hand, may be more efficient and offer better security and privacy (because the user information isn’t sent elsewhere) but compromises model capacity. However, some of these trade-offs may be improved using modern model compression or knowledge distillation techniques.

Once you’ve selected the hardware, optimize and compile the model. Many compilers already exist for common pairs of ML frameworks and hardware (e.g. NVCC, the compiler for NVIDIA GPUs, supports PyTorch via the Cuda toolkit; XLA optimizes and compiles TensorFlow code for TPUs, GPUs, and CPUs). However, your code may still require additional optimizations for maximal efficiency. For example, you can vectorize all iterative and batching operations to run on the same hardware the data exists on.

Lastly, determine how you’ll handle different patterns in user traffic. Predictions can be asynchronously batched or handled as soon as they arrive, which may incur less latency but use computational resources less efficiently. Whenever user traffic spikes, consider using a smaller, less accurate model or a single model (instead of ensembling predictions from multiple models).

Monitoring the model

Once the model has been deployed, continue monitoring its health and performance.

Performance regressions are common in real-world settings, especially because data and user behaviors constantly shift. A model that was once deemed accurate for a dataset sometimes can become obsolete, and a new model or a new set of features may be needed. It’s critical to set up infrastructure and observability that detects drift in features, data, or models, and then benchmarks competing models when appropriate.

To evaluate on real-world data, you need some source of ground truth. Do you have a hand-labelled dataset of gold standard data that’s continuously updated, or will you rely on less direct metrics (e.g. number of clicks on recommended movies)? How do you determine when the model’s performance has regressed enough to require intervention? What other tools will you build to monitor and troubleshoot model serving issues (e.g. high inference latency, high memory use, numerical instability)? For more information, check out this guide for monitoring the health of deployed models.

Keep in mind that designing an effective model is just one part of the ML system design interview. Prior to selecting and deploying a model, you need to clarify system requirements and design a data pipeline. Check out our framework lesson to learn how to integrate these steps into an organized solution.