Skip to main content

Design a Spotify Recommendation System

Below is a supplemental solution that utilizes our interview framework. Check out the mock video and read through the written solution to evaluate how you would structure your answer.

ML Spotify Mock

Step 1: Define the problem

Identify the core ML learning problem

We are trying to build an ML-based recommender system on Spotify, which recommends artists to users, based on their liked playlists, songs, and artists.

The success of this system will depend on user engagement, which is defined by number of clicks. If a user clicks on a recommendation, that's a point towards the algorithm. If they don't, then we can agree it was a bad recommendation.

We can go deeper and assess the amount of time they engaged with the recommendation, but to keep things simple for now, lets go with just a click.

Clarify requirements and tradeoffs

Two clarifying questions:

  1. What kind of raw data do we already have access to, and do we need to collect any raw data?
  2. What is the condition of the raw data?

We’ll assume that we’ll have click data from users as one data source. The other source will be some metadata about the users, e.g. age group, location, and previous information.

Understanding the condition of the raw data helps us plan for what kind of pipeline and transformations are needed to get the data into a usable format. Let’s assume we get click data in a JSON serialized format. These are usually events that come in and land into an object store. The user metadata is a bit simpler, as it's available directly within the postgres account table. However, we must keep in mind that it is PII data, so it will have to be used carefully.

Step 2: Design the data processing pipeline

Having clarified the data conditions and sources in the previous step, we’re ready to design a data processing pipeline. We’ll use the above two points to create data processing pipelines and fetch what we need to create our features. Then, we’ll access the raw click data and the Postgres table for the account information. Afterwards, we’ll create our features.

To collect and process the data, we’ll have to decide between using a batch-based or real-time solution. A batch-based system is usually easier to manage, whereas inferencing and training in real-time is compute-intensive and expensive. It’s usually better to have at least one be in batch, preferably the training (as this takes the most time). However, we can do inferencing in real-time if needed.

Ideally, both training and inferencing would be in batch. There would be some serverless job that pulls the latest recommendations stored by the batch job in a cache. This way, the recommendations are available at all times, but they refresh every few hours. For this scenario, we’ll use a batch-based system for both training and inferencing.

Since we have click data coming in as JSON events and landing in an object store, we’ll design the data pipeline by creating an ETL Pipeline. We’ll create an abstracted data model to illustrate how we want our data to look like in the end before feeding it into the model. Generally, we want our features to be as mutually exclusive as possible, because this prevents complicated correlations between features from occurring.

We’ll take the following feature engineering steps:

  1. Read the data in its raw format
  2. Deserialize it
  3. Finalize the first 4 features:
    1. Age group
    2. Location (City, State, United States)
    3. Array of most recent (100) to keep it simple: favorite artists. Each element in the array is a map object with artist info (artist name, active # days, trending rank, genre, number of followers, etc.)
    4. Array of last 100 listened songs (song name, artist name, active # days, trending rank, genre, number of followers, number of likes, average listen time for song, standard deviation of listen time for song etc.) To keep the song length metric simple, we can categorize this to (full, partial, skipped).
  4. Fetch the fields from the deserialized JSON records
  5. Clean them in preparation for feature engineering
    1. Mask PII (date of birth, full names, emails, etc.)
    2. Parse location (convert from coordinates to city, state)
    3. Discard things we don’t need. For example, the only PII we need is the userID and user date of birth to categorize them into an age group
    4. Normalize fields: convert everything to lowercase, remove spaces and punctuations, remove any noise, deduplication, format timestamps correctly.
    5. Fetch the artist and song details from the JSON array to create the arrays of songs/artists mentioned above.
    6. After cleaning the data, it will land in a Postgres database. The data will contain information on the click event that happened. This information usually comes in the form of card elements on the UI, which contain title of song, artist, artist ranking, genre of music, song duration, how long the song was played for, volume levels throughout the song, and more.

We’ll store all of these features in a new table, and then write them to a feature store for model consumption.

Step 3: Propose a model architecture

Select and justify the model

Now that we’ve created a data pipeline, we’ll consider the types of models typically used for recommendation systems. Traditionally recommendation systems take advantage of data from other users and use that to recommend something to new or even existing users. This is known as collaborative filtering, which has the potential to become a challenge if there is a lack of data from other users. Additionally, these days recommendation systems are getting more involved with deep learning and traditional supervised techniques like decision trees, XGBoosts, etc. There’s a huge library of paths to choose from.

Select a model architecture

To satisfy the use case we currently have, let’s start with a simple architecture. Assuming we have the required data, we’ll move forward with the collaborative filtering element. With music, trends are traditionally developed through mutual sharing between listeners.

The simplest model we can select will essentially create feature vectors of each user. Each feature vector is a unique ID for each user, comprised of a user’s features (age group, location, array of favorite artists maps, array of favorite songs maps).

We’ll score each of these vectors between -1 and 1. This scoring method consolidates the vectors into a single number that represents a user and their preferences. We’ll also score each item we recommend between -1 and 1, depending on its popularity and number of plays. This allows us to compare different users on the same scale (normalization).

We’ll then organize these scores for each user into a user-item matrix. Each user is on a row, and each item is on a column. We’ll then compute the product of each feature vector’s score with the recommended song’s score and set a threshold between -1 and 1.

Depending on how close the product is to 1, we’ll decide if we want to provide that item as a recommendation to the user. If we want to give very specific and limited recommendations, we can set the threshold high, and vice versa. Generally, it’s better to start with a low threshold to collect as much information as possible. Then, we can begin to pinpoint the optimal threshold value for future recommendations.

Step 4: Train and evaluate the model

Train the model

To create training inputs, we’ll take the process data, code non-numerical data, and featurize the rest of the data. The training will produce a user-item matrix. This matrix will then be used to create a probabilistic prediction as a recommendation for an item to the user.

The user is then presented with these recommendations. If a user clicks on any recommendations, the click data is collected as positive feedback. Any items that have been recommended that were not clicked will be considered negative feedback. The number of clicks over the total number of recommendations is considered the accuracy metric for the model.

Evaluate the model

Once we’ve established the accuracy metric, we’ll use the features for the positive recommendations and the features for the negative recommendation to see the difference. This difference will indicate if certain features played a larger role in affecting user behavior versus the other. This data can then be used to create a feature weighting algorithm that learns to get better at weighing features. Consequently, the collaborative filtering algorithm will also improve.

Step 5: Deploy the model

The last step in this process is to understand when and how best to deploy our model into production.

First, we’ll define the appropriate metrics, which we previously discussed as engagement. Then, we can deploy out an A/B test plan for this model to understand if this is the best step for the user experience.

Second, we’ll need to understand the compute and storage resources we have to train, test, validate, and inference the information. Let’s say we are using some kind of cloud system like AWS. We can take advantage of AWS sagemaker (to house, train, and test the model), lambda (to service requested recommendations), elasticache (to store the recommendations), and provide them back to the application via an API endpoint. We can then auto-scale the resources to handle changing volumes of traffic from the application.

Step 6: Wrap up

To recap, we’ve just designed a high-level system to recommend artists on Spotify. We first identified our data sources as user metadata and click data. We then opted for a batch-based system to process the data, used a collaborative filtering model to score each user’s feature vectors, and collected click data to train the model. We then discussed the factors affecting model deployment, such as engagement and compute and storage resources.

The other consideration to shed additional light on is post-production work. Machine learning is very dynamic, since incoming data changes constantly. This affects the model and its performance, so it’s important to have monitoring and observability on the model drift, data drift, and feature drift. It’s essential to observe the performance of the model to ensure that we are still meeting against our metric. We can check model performance constantly by observing the metric we are testing against (churn).