Skip to main content

Predict User App Deletion

HardPremium

Given a dataset that includes features such as time spent on and the number of interactions in-app, alongside the outcome of whether a user deletes our app, develop an ML solution to predict whether users will delete our app.

Instructions

Download the dataset here and try out this ML interview coding question on your local .ipynb environment. There are many ways to tackle this task but we encourage you to look at the hints below to ensure you are on the correct path.

Dataset

Features are grouped semantically into User, User-Behavioral, Context and App.

FeatureDescriptionData TypeCategory
F1CountrycategoricalUser
F2GendercategoricalUser
F3AgenumericUser
F4How did the user download the appcategoricalUser
F5Number interactions 7dnumericUser-Behavioral
F6Number of interactions 14dnumericUser-Behavioral
F7Number of interactions 30dnumericUser-Behavioral
F8Time spent on app - daily avgnumericUser-Behavioral
F9Time since last usenumericUser-Behavioral
F10Time of daynumericContext
F11Day of weeknumericContext
F12IOS or AndroidcategoricalContext
F13Version of OScategoricalContext
F14App versioncategoricalApp

Currently active users are indicated with Label: 1, while users who deleted the app in the last N days are indicated with Label: 0. Label is what we would like to predict.

This is the formatted dataset that we will be training our model on:

F1F2F3F4F5F6F7F8F9F10F11F12F13F14Label
USM17E1310452171I1721
UKU34E375613637A341
AR19P1912457124A520
874515I20
FP6199128620
USM17E1310452171I1221
UKU34E37813637I341
BR19P19257124A521
M64515I20
INFP6199128621

Don't forget to take a look at the dataset and do data cleaning if required e.g. handling missing values.

Check for label skew by looking at the empirical deletion rate. How would that affect your train/test/validation split, accuracy metrics and model training?

What type of model do we need? Since our label only has 2 values, we should use a binary classification model.

How to split train/test/validation sets? Consider a time-based split to prevent data leakage and ensure that the model learns from historical patterns to accurately predict future outcomes, capturing inherent temporal dynamics and weekly patterns.

What metrics do we use to measure performance of a binary model? Consider AUC and calibration.

Don't forget to preprocess the features to address basic data representation and scaling issues e.g. normalization for numerical features, encoding categorical features.

Can you develop a solution employing a deep neural network approach?