Skip to main content

Design an App Suggestion System for Phones

Premium

In this mock interview, Kevin (ML Engineer @ Reddit, ex-Meta, ex-Amazon) answers the prompt, "Design an app suggestion system for phones".

This problem is very open-ended and can be approached in different ways. One key challenge is clearly defining and narrowing the requirements before starting. Another difficulty is obtaining training data—for example, if we’re recommending apps users have installed, how do we identify negative examples (apps they wouldn’t use)?

In this mock, the focus is on recommending apps that users have already installed on their phones, not suggesting new apps to download from the app store.

Key concepts to review for this problem include designing ranking systems, building ML models for recommendations and rankings, and evaluating ranking systems both offline and online.

Similar practice questions to try out:

Define problem and scope

The interviewer wants to see how you approach an open-ended question and turn it into a machine learning problem. Start by asking clarifying questions to ensure you and the interviewer are aligned and to narrow the scope to something manageable within the time frame. Then, define the business metrics you aim to optimize, and finally, translate those into a solvable ML problem.

Your clarifying questions should guide your solution. For example, if you ask about scale, your solution should account for it rather than treating it as a simple “checklist” item.

For senior candidates, it’s important to lead the discussion by making assumptions and confirming them with the interviewer. Less senior candidates might directly ask the interviewer for clarification instead.

Clarifying questions:

Are we recommending apps to download or apps to use? This is a crucial question because it changes the entire problem. Recommending apps to download from an app store involves tens or even hundreds of thousands of options, while recommending apps a user has already installed typically involves only 25–50 apps. This dramatically affects the scale of the problem.

For this mock, assume we are recommending apps that users have already downloaded and might want to use.

Can we assume the average user has 25–50 apps installed? This assumption helps narrow down the problem’s scale. Recommending an app from a small pool of 25–50 options is much simpler than choosing from thousands or more.

Are we recommending apps for specific, relevant moments? For example, should we recommend Spotify when the user is at the gym? This helps define the focus of the recommendations and clarify the business and machine learning goals.

Goals:

Business goals: These are the high-level objectives for the business, typically focused on improving key metrics like engagement with the phone or apps. Examples include increasing daily active users, boosting app downloads, or enhancing the overall phone experience. All these metrics likely tie back to revenue—for instance, more app downloads generate revenue for app makers, and a better phone experience leads to more phones sold.

ML goals: Based on the business goals, the ML objective is to frame this as a problem to solve with machine learning. For example, the goal might be to recommend apps that increase engagement, such as improving click rates.

Additional considerations:

  • User Satisfaction: A senior candidate might point out the importance of maintaining a good user experience, avoiding fatigue from too many or irrelevant recommendations.
  • Privacy: An even more senior candidate might address privacy concerns, noting that users may not want their personal data used for training. This could lead to exploring solutions like federated learning to protect user data.

Scale:

In this problem, scale isn’t a major limitation because we’re not recommending apps from the app store. Instead, we’re focusing on the ~25–50 apps each user has already installed.

Let’s assume there are about 500 million smartphone users.

In this scenario, the model would be stored on each user’s phone. This reduces the need to handle large-scale processing centrally, except for sending updates to users’ models.

However, scale does come into play when training the model. With data from 500 million users, and assuming each user has 25–50 apps and spends 4–5 hours per day on their phone, we have a massive dataset to work with during training.

High level design

For most recommendation and ranking problems, the process typically involves three main steps:

  1. Candidate generation: Narrow down a large pool of items to a smaller set of potential recommendations.
  2. Scorer: Assign a relevance score to each candidate based on various factors.
  3. Re-ranking: Refine the final order of recommendations based on additional criteria, like user preferences or business priorities.

Low level design

Candidate generation

In this case, candidate generation isn’t necessary because we’re only ranking between 25–50 items. A senior candidate would recognize this step is redundant for such a small pool.

However, if we were ranking apps for users to download from the app store, where there are thousands of options, candidate generation would be an essential step to narrow down the choices.

Scorer

There are several engagement metrics we could optimize, such as:

  • Likelihood the user clicks the app recommendation.
  • Likelihood the user does not downvote the recommendation.
  • Time spent on the recommended app.

For simplicity, we’ll focus on the likelihood that the user clicks the app.

Choosing a model

There are various models to use for the scoring step. It’s helpful to list a few, explain their pros and cons, and then decide which one to use based on the problem requirements.

Logistic regression

ProsCons
Simple, fast, and interpretable.Assumes a linear relationship between features and the target.
Small model size, easy to store on a user’s phone.Cannot capture non-linear patterns.

Ensemble trees (e.g., random forest)

ProsCons
Fast and interpretable.Cannot use embedding features (e.g., text descriptions).
Can capture non-linear patterns.Cannot handle multiple labels (e.g., predicting different engagement types).
Requires little storage (just a few MBs).

Neural networks (NN)

ProsCons
Can handle embeddings for features like text descriptions of apps.Requires large amounts of data to train effectively.
Can predict multiple engagement types (e.g., likelihood to open, time spent).Slower and needs more memory.
Less interpretable (“black-box” nature).

Final choice

Although we’re focusing on the likelihood of the user clicking an app, it’s worth noting that companies often optimize for multiple engagement metrics in real-world scenarios.

Given the scale of this problem (500M users using their phones 4–5 hours per day), we have abundant training data. This makes a neural network the most suitable choice, as it can leverage the data to capture complex patterns and handle multiple engagement metrics if needed.

Reranker

Here, we need to consider factors like freshness, fairness, user fatigue, and the potential boosting of new apps that haven’t been used much yet. For example, we may want to avoid recommending the same apps repeatedly, in order to keep recommendations fresh, reduce user fatigue, and ensure fairness.

A more senior candidate would likely mention challenges like cold start (the issue of recommending new apps with little or no user data) and bias. Since our training data mostly includes frequently used apps, newer apps might rarely appear as positive recommendations in the training set.

Data

Labels

A tricky part of this problem is how to define and collect the labels. If we only consider labels from apps we recommend to users, the dataset may be limited. For instance, negative labels would only come from the apps we recommended but the user didn’t click. This also introduces a bias because we only get labels after recommending an app.

If we use whether the user opened the app as a label, we still need to figure out how to define negative labels.

Here’s one way we might define the labels:

  • Positive labels: The user opens the app.
  • Negative labels: The user is on their phone but does not open the app during that session.

However, this approach may lead to a significant data imbalance, which we’ll discuss next.

Features

For this problem, we want to include a variety of feature types to build a strong model. We don’t need to list every single feature, but it’s important to show that we understand the key categories of features that can be used. These could include app-related, user-related, user-app interaction-related, and temporal features. This demonstrates to the interviewer that we know what types of features are relevant, and with more time, we could develop a more detailed list.

Examples:

  • User & Context: Demographics (age, gender), general location (e.g., city or state), whether the user is on Wi-Fi or mobile data, frequently used apps.
  • Temporal: Time of day, day of the week, week of the year, etc. Seasonality could influence app usage (e.g., holiday-related apps).
  • Geographic: GPS location.
  • Apps & User App Interactions: App category, subcategory, app ratings, app developer, and historical user app interactions (e.g., number of times opened in the last X days, average time spent on the app in the last X days).

Feature preprocessing/engineering

In this section, we want to show the interviewer that we understand how to preprocess and handle different types of features effectively. It’s important to mention the pros and cons of each approach and demonstrate that we know how to work with different feature types.

Missing data

  • If a feature has too much missing data (e.g., 80% missing), it might be best to drop the feature completely. We can also check if the missing values are correlated with other features, though the correlation could be weak due to the missing data itself.
  • If only a few rows are missing: We can either drop the rows with missing data or fill them in. Filling missing data:
    • For categorical features, we can fill with the most frequent value (mode).
    • For numerical features, we can use the median or mean. The median is more robust to outliers.
    • We can also use KNN imputation, which fills missing values based on the values of nearby rows.

Class imbalance

In our case, users are more likely to not open an app than to open it, which can cause class imbalance.

TechniqueProsCons
Downsampling- Reduces data size- Changes data distribution
- Simple to implement- Removes data, which could lead to loss of information
Upsampling- Increases data size- Changes data distribution
- Can balance classes- Synthetic data (e.g., SMOTE) could lead to inaccuracies
Class Weights- Adjusts for imbalance without data change- Requires careful tuning of weights
- May lead to model bias if weights are not well-tuned

Feature engineering

Categorical features:

  • Use one-hot encoding if the feature has a low number of unique values (low cardinality).
  • Use label encoding for features with many unique values (high cardinality).
  • For advanced models, we can use embeddings.

Numerical features:

  • Normalize the features if the model is sensitive to scale (e.g., logistic regression, neural networks).

Text features (e.g., app descriptions):

  • Remove stop words and punctuation.
  • Use a pre-trained model to obtain an embedding.

Image features (e.g., app logos):

  • Apply frame sampling and resize the image.
  • Normalize pixel values and apply color scaling.
  • Use a pre-trained Convolutional Neural Network (CNN) to extract features from the image.

Interaction features:

  • If not using a neural network with an explicit interaction layer, we can manually create interaction features, such as taking the dot product of two features.

These preprocessing and feature engineering steps help make the data ready for modeling and improve the performance of the model.

Metrics

To evaluate if our model is effective, we focus on two main types of metrics: offline (for testing the model before deploying) and online (for monitoring its performance after deployment).

Offline metrics

Offline metrics help us determine if the model is working well before deploying it in production. These metrics tell us how good the model is on test data.

Common metrics for binary classification include:

  • Logloss: Measures how well the model’s predicted probabilities match the actual outcomes.
  • Normalized Logloss: Helps compare different models, particularly useful for imbalanced datasets.
  • PRAUC (Precision-Recall AUC): Good for imbalanced datasets, measuring how well the model handles positive class predictions.
  • AUC (Area Under the ROC Curve): Measures overall model performance across different thresholds.

Make sure you understand the advantages of each metric, e.g., PRAUC is better for imbalanced datasets compared to AUC.

Online metrics

Online metrics are used to monitor the model’s performance after it’s deployed in production. They help us make sure the model is doing what it’s supposed to.

Shadow testing:

  • Run the model in production but don’t use its results yet.
  • Log the results and compare them with offline metrics (like logloss) to detect any issues (like errors or slow performance).

A/B testing:

  • Test the model with a small group of users before fully deploying it.
  • Measure if the model improves business metrics, such as app downloads or user engagement.
  • Track guardrail metrics (e.g., revenue) to ensure no negative impact. If these metrics regress, the experiment should be stopped.