Predict User App Deletion
Given a dataset that includes features such as time spent on and the number of interactions in-app, alongside the outcome of whether a user deletes our app, develop an ML solution to predict whether users will delete our app.
Instructions
Download the dataset here and try out this ML interview coding question on your local .ipynb environment. There are many ways to tackle this task but we encourage you to look at the hints below to ensure you are on the correct path.
Dataset
Features are grouped semantically into User, User-Behavioral, Context and App.
Currently active users are indicated with Label: 1, while users who deleted the app in the last N days are indicated with Label: 0. Label is what we would like to predict.
This is the formatted dataset that we will be training our model on:
Don't forget to take a look at the dataset and do data cleaning if required e.g. handling missing values.
Check for label skew by looking at the empirical deletion rate. How would that affect your train/test/validation split, accuracy metrics and model training?
What type of model do we need? Since our label only has 2 values, we should use a binary classification model.
How to split train/test/validation sets? Consider a time-based split to prevent data leakage and ensure that the model learns from historical patterns to accurately predict future outcomes, capturing inherent temporal dynamics and weekly patterns.
What metrics do we use to measure performance of a binary model? Consider AUC and calibration.
Don't forget to preprocess the features to address basic data representation and scaling issues e.g. normalization for numerical features, encoding categorical features.
Can you develop a solution employing a deep neural network approach?
Note that this is just one of the many ways to do this ML task. In this case, we have opted to use a deep neural network, but you may use other binary classification models e.g. CART.
The code below outlines a complete pipeline for training and evaluating a neural network model on a binary classification task using PyTorch, a popular deep learning framework. The pipeline is broken down into several key components:
-
Custom Dataset Handling:
- The
CustomDatasetclass is a subclass of PyTorch'sDataset. It's designed to handle a specific dataset format for binary classification, where features might be numerical or categorical. - Numerical features are directly used, with additional handling for missing values.
- Categorical features are processed using one-hot encoding, where each unique category value is transformed into a binary vector.
- The dataset class calculates the size of the input feature vector after preprocessing, which is crucial for defining the neural network architecture.
- The
-
Neural Network Model:
- The
PredictionModelclass defines a simple feedforward neural network architecture using anOrderedDictto organize layers, including batch normalization, ReLU activations, and a final sigmoid activation to output a probability for binary classification. - The network is flexible to accommodate different sizes of input features, which is determined by the preprocessing in the
CustomDataset.
- The
-
Training and Evaluation Loops:
train_loopandtest_loopfunctions are defined to perform the training and evaluation of the model, respectively.- In the training loop, the model parameters are updated using gradient descent via backpropagation. The Adam optimizer is used for adjusting the weights.
- In the testing loop, the model's performance is evaluated using the BinaryAUROC metric to measure the area under the receiver operating characteristic curve, a common metric for binary classification tasks. Additionally, a calibration metric is calculated to compare the sum of predictions to the sum of labels, providing a simple form of model output calibration assessment.
Pythonimport os.path
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
from torcheval.metrics import BinaryAUROC
from collections import OrderedDict
class CustomDataset(Dataset):
def __init__(self, path):
# feature indices of categorical feats
# the other feats are assumed to be numerical
self.categorical_feats = set([0, 1, 3, 11])
num_feats = 14
# make map between feeature and feature type (str or numeric)
feat_map = {}
for idx in range(num_feats):
feat_type = "num" if idx not in self.categorical_feats else "str"
feat_map[idx] = feat_type
assert len(feat_map) == num_feats
# build vocabs for categorical features
vocabs = {idx: set() for idx in self.categorical_feats}
with open(path) as fp:
next(fp) # skip header
self.lines = fp.readlines()
for line in self.lines:
tokens = line.strip().split(",")
for idx in self.categorical_feats:
vocabs[idx].add(tokens[idx])
# build revesre lookup for categorical feats
# this is used when one-hot encoding categorical feats
self.vocabs_reverse = {}
self.vocabs_size = {}
for idx in self.categorical_feats:
self.vocabs_reverse[idx] = {v: i for i, v in enumerate(vocabs[idx])}
self.vocabs_size[idx] = len(vocabs[idx])
# get input feature vector size
num_numerics = num_feats - len(self.categorical_feats) + 1
self.input_size = sum(v for v in self.vocabs_size.values()) + 2 * num_numerics
def __len__(self):
return len(self.lines)
def __getitem__(self, idx):
line = self.lines[idx]
tensors = []
values = line.strip().split(",")
for feat_idx, value in enumerate(values):
if feat_idx in self.categorical_feats:
value_idx = self.vocabs_reverse[feat_idx][value]
vec = F.one_hot(torch.tensor(value_idx), self.vocabs_size[feat_idx])
tensors.append(vec)
else:
try:
vec = torch.tensor([float(value)])
vec_is_missing = torch.tensor([0])
except ValueError:
# feature missing, so set is_missing indicator feature to 1
vec = torch.tensor([-1])
vec_is_missing = torch.tensor([1])
tensors.append(vec)
tensors.append(vec_is_missing)
# concatenate all features into one tensor
feature_tensor = torch.cat(tensors).float()
label = torch.tensor([int(values[-1])]).float()
return feature_tensor, label
class PredictionModel(nn.Module):
def __init__(self, in_features):
super().__init__()
self.model = nn.Sequential(
OrderedDict([
("norm", nn.BatchNorm1d(in_features, affine=False)),
("fc1", nn.Linear(in_features, out_features=20)),
("relu1", nn.ReLU()),
("batchnorm1", nn.BatchNorm1d(20)),
("fc2", nn.Linear(in_features=20, out_features=10)),
("relu2", nn.ReLU()),
("batchnorm2", nn.BatchNorm1d(10)),
("fc3", nn.Linear(in_features=10, out_features=1)),
("sigmoid", nn.Sigmoid())
]))
def forward(self, x):
return self.model(x)
def train_loop(dataloader, model, optimizer, loss_fn, device, epoch_num):
size = len(dataloader.dataset)
model.train()
for batch_idx, (feats, label) in enumerate(dataloader):
feats = feats.to(device)
label = label.to(device)
# forward
preds = model(feats)
loss = loss_fn(preds, label)
# backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
def test_loop(dataloader, model, loss_fn, device, epoch_num):
model.eval()
test_loss = 0
sum_preds = 0
sum_labels = 0
auc = BinaryAUROC()
with torch.no_grad():
for feats, label in dataloader:
feats = feats.to(device)
label = label.to(device)
# forward
preds = model(feats)
test_loss += loss_fn(preds, label).item()
# metrics
sum_preds += torch.sum(preds).item()
sum_labels += torch.sum(label).item()
auc.update(torch.squeeze(preds), torch.squeeze(label))
calibration = sum_preds / sum_labels
print("calibration = {}, auc = {}".format(calibration, auc.compute().item()))
def main():
dataset_train = CustomDataset("foo_train.csv")
dataset_test = CustomDataset("foo_test.csv")
dataloader_train = DataLoader(dataset_train, batch_size=5)
dataloader_test = DataLoader(dataset_test, batch_size=5)
model = PredictionModel(in_features=dataset_train.input_size)
loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
device = "cpu"
for epoch in range(100):
train_loop(dataloader_train, model, optimizer, loss_fn, device, epoch)
test_loop(dataloader_test, model, loss_fn, device, epoch)In this mock interview, Satyajit (Pinterest, Target) answers the prompt, "Design a machine learning system that predicts user app deletion in the next (n) weeks."
While running the test_loop I am getting an error RuntimeError: running_mean should contain 28 elements not 38. I think it's the difference between the categorical features in train and test.