Skip to main content

Design a System to Predict Netflix Watch Times

Premium

Introduction

A system design interview tests your ability to build a system in a practical setting and adapt it to fit the particular business needs of the company and the users. This often requires discussing the tradeoffs between different methods and making concrete decisions. Interviewers generally look for candidates to display holistic knowledge of a system, as well as how decisions in one part of the system can affect decisions in other parts.

In this conversational interview, the candidate explores key considerations of designing a system for predicting watch times. Rather than designing the system end-to-end and following the typical interview framework, the interviewer tests the candidate's understanding of handling large amounts of data, as well as efficient algorithms for scaling predictions for users across different scales (country, region, family, individuals).

Question 1: Problem scope

The interviewer asks, “How would you approach this problem?”

As with any machine learning system, we’ll need to ask a lot of clarifying questions about the data that’s available to us, as well as the scope of the problem. This includes:

How much data do we have? What user information do we have access to? Can we get watchtime logs on a per-user basis? How representative is the data sample that we have of the general population? How fine-grained does our prediction system need to be? Do we want to tailor our predictions based on anything more than just good predictions of previous watch times?

Once we have the answers to these questions, we can start thinking about the kinds of systems we might choose to build to predict watch times.

First, we should consider the kinds of data we have available, and what kind of useful features we might be able to pull out from each example. To help organize things, we’ll split features into user features and movie/TV show features.

User features include things like demographic information, user profile/ID, past interactions, etc. while movie/show (“item”) features include things like genre, actors, length of item, popularity with other users, etc. We might also want to model other time-based features such as time/date of the watch time log and previous activity during the session.

Question 2: Features and tradeoffs

The interviewer asks, “How would you represent each of these features and what are the tradeoffs between these representations?”

Most of these features could be represented using one-hot features, i.e. using one feature for each possible actor, genre, series, etc. This allows our model to learn based on the presence or absence of individual features, which may provide strong signals for watch time, and also allows us to add on new features as necessary to our model. One-hot features are easy to implement, since they don’t require learning any kind of continuous embedding space, and small modifications (e.g. using a continuous number from 0-1 to represent how prevalent an actor is within a series) can add additional information.

However, with too many features, the input to our model can grow dramatically in size, and we might risk overfitting to individual features that appear very few times in the training set. This can become especially bad when we have a huge combination of potential features for each aspect of the problem we want to model.

One way to fix some of the issues with one-hot encodings is to restrict our options to the top-k actors/genres in order to prevent our features from blowing up. Additionally, learning some sort of continuous embedding space where we put similar actors or films together with a fixed-length representation is great for embedding new actors or features. However, it is also difficult to learn properly. It also ignores the fact that we might want to model items and users jointly.

Question 3: Model selection

The interviewer asks, “What kind of model would you use for watch time prediction, considering that we need to accommodate old and new users?”

The simplest approach here would be to train a single monolith model on all of the features that we have available to us. This allows us to aggregate information across different demographic groups which can give us more learning signal, but may sacrifice performance on smaller subgroups that we care about (i.e. smaller less populous countries). We can opt to train more demographic specific models but risk overfitting if these groups get very small. At the smallest level we want to model a specific user’s interactions but we encounter issues because many users don’t have many data points with which to train a reasonably accurate model with.

Question 4: Unsupervised methods

The interviewer asks, “Can you leverage the scale of data to model user behavior in an unsupervised way?"

An unsupervised approach such as collaborative filtering relies on measuring similarity between users and using that similarity to predict the watch time. For example, if user A is very similar to user B and user B watched this particular movie for a long time, then you might predict that user A will watch for a similar amount of time.

Question 5: Embeddings

The interviewer asks, “How would you build embeddings for users and items to measure similarity over?”

This representation would most likely be a large N users by M items sparse matrix, where we can fill in a value if a user has watched a specific item. The simplest approach would be to set the value to 1 for the [ij] value if user i has watched item j, but obviously this isn’t too helpful for predicting actual watchtime.

Another option is to instead store the actual watchtime in this value, which allows us to calculate similarity metrics based on watch times and use this to predict for new users.

Other options could include -1/0/1 for if a user disliked/didn’t rate/liked an item. We can integrate other signals into other matrices as well.

Question 6: Matrix storage

The interviewer asks, “How do you anticipate storing this matrix?”

Storing this matrix is difficult because it’s very large, but we can make our job easier since it’s very sparse and we only need to store a few values per user. We can also shard the data to only perform a similarity search over a small set of users/items that we think our user will be most similar to.

Question 7: Similarity metrics

The interviewer asks, “What kind of similarity metrics would you use to compare users?”

A few canonical similarity metrics include Euclidean distance, Jaccard distance, cosine similarity, and Pearson correlation. Euclidean distance is our standard distance metric but doesn’t match the feature space that we’re considering, since it penalizes differences in watch time and would prioritize users with few items watched.

  1. Jaccard distance is the intersection divided by the union, or a measure of overlap between two users. This works great for binary representations.
  2. Cosine similarity measures the angle between two vectors but discards the length information. It works for any vectors and includes direction It is maximally small if vectors point in different directions (one user liked and another disliked)
  3. Pearson correlation measures how correlated two numerical values are.

Each metric makes a few assumptions about the kind of data being compared.

  1. Euclidean distance is generally not good for any collaborative filtering problems since we assume very sparse features and matrices.
  2. Jaccard distance requires binary representations and can only compare one-hot features. We can of course turn any continuous representation into a one-hot by thresholding.
  3. Cosine similarity works well for continuous embedding spaces where we have captured some notion of positive vs. negative signal (like vs. dislike)
  4. Pearson correlation works well when we might expect a linear relationship between vectors (i.e. we want to measure similarity in magnitude as well)

For this particular problem, Pearson correlation makes the most sense since we’re predicting watchtime (a continuous value) and we expect two people with similarly large or small watch times for a show to have similar profiles.

Question 8: Supervised vs. unsupervised

The interviewer asks, “What tradeoffs does the unsupervised approach make compared to the supervised approach?”

Originally we were motivated to use an unsupervised approach to learn a personalized model for each user without needing a specific supervised model for each user. Collaborative filtering means we avoid overfitting since we’re measuring similarities across all users, and makes it easy to add new features or users. We can also easily add tasks or other signals in order to make other predictions such as what a user might watch next based on their last watched TV show/movie by measuring similarity between items.

However, we do need to build more infrastructure to perform large-scale similarity comparisons across our users and store the sparse matrices, and we also have a significant cold-start issue where new users do not have any information that can be used to predict watch time.

A simple way to combine supervised and unsupervised approaches is to use a latent factor model. The latent factor model is a relatively simple linear model that learns a mean term across all users and items, a residual term per item and per user (e.g. how much more above the mean do users typically watch this item/does this user watch in general), as well as a similarity term. The latent factor model also allows us to model time by having our latent factors depend on certain temporal features, and also allows us to take the same approach to add in interactions between users (e.g. social graph information).

Question 9: Deployment

The interviewer asks, “What issues might you run into during deployment of this system in production and how would you fix them?”

There are a lot of issues that come up during production, most of which are based on temporal changes.

Different shows change popularity depending on the time of year or even day of the week (e.g. Christmas, Valentine’s day) Depending on changes to the UI, the way users interact with Netflix could change what items should be prioritized to increase watch time. Users preferences shift over time and trends come and go regularly; our model needs to be able to keep up with virality and the fast pace of social media both locally and globally.

Depending on how severe these factors are, we may need to retrain the model on new data or change how we integrate fast-moving trend information. This process relies on collecting high quality evaluation data and regularly testing our model on new data to see whether dips in performance are due to temporary effects (e.g. UI changes, seasonality) that can be integrated into the model, or whether its indicative of a distribution shift that would require retraining or adjusting the model.

Conclusion

This interview describes many different systems and the tradeoffs inherent in choosing specific kinds of models. As with any interview, a lot of different aspects of the methods were glossed over or not mentioned in detail. System design interviews are tough, and it can be highly dependent on the interviewer which direction the interview goes. In general, try not to rush your decisions and ensure you justify your thinking with nuanced discussions.