Train a Model to Detect Bots
Not all ML system design interviews will give you the broad prompt, “Design a X system.” Interviewers may ask more specific questions, such as how you’d design or fix a particular part of the pipeline. In this case, you can tailor the framework to the given problem.
This mock interview shows an example of how to answer the prompt:
“At our company, preventing bots and malicious actors from creating accounts and making posts. To solve this problem, we would like to train a model to flag accounts as potential bots for manual review. Since bots are rarer than real humans, our initial training dataset consists of a set of account meta information, where a majority of the dataset is comprised of humans.”
The interviewer and interviewee focus on the specific problem of class imbalance, which leads to a rich discussion on sampling and training methods, robustness, fairness, and more.
Related questions in Exponent's ML system design question bank include:
Step 1: Define the problem
To understand the problem more clearly, we can ask the following clarifying questions about the dataset:
- How much data do we have?
- What kind of metadata features are available?
- How imbalanced is the dataset?
- What happens once we flag a bot account?
- How do we get gold standard labels for our dataset?
Assuming we want to focus on the problem of class imbalance, we can consider the basic case of a generic binary classifier trained with empirical risk minimization (ERM) on our given dataset.
If we have a minor imbalance (30% minority class), we might expect this classifier to do well. Although we might want to monitor the accuracy of our classifier on the minority class closely, we should generally be able to train a good classifier without too many issues.
If we have a more severe imbalance (5% minority class) we need to be much more careful, especially when evaluating the model's success, since metrics like accuracy can be misleading. For example, a model trained to predict that every account is a real user would achieve 95% accuracy but be totally unusable for our task. Since ERM trains models to minimize the empirical risk over the dataset, many models can become similarly deficient under such severe imbalance.
There are many tangents that an interviewer could make, such as discussing potential variations of ERM (e.g. IRM), how dataset size can affect the class imbalance and the generalization of a classifier trained with ERM, or how to deal with reliable vs. unreliable labelers.
Step 2: Propose a model architecture
Assuming that 5% of our dataset has been labeled as bots, we can deal with the class imbalance and model training through two orthogonal approaches. The first are data-based approaches, and the second are algorithmic approaches.
When describing different approaches to solving a problem, ask the interviewer or explicitly state that you’re making an assumption rather than moving forward blindly. A key part of these design interviews is vocalizing your thought process. Describing the pros and cons of each approach helps the interviewer understand how you’ve broken down the problem and makes follow-up discussions more focused.
Data-Based Approaches:
- Collect more data: the simplest approach is to collect more data to increase the minority class size. Although this might be expensive up front, it will save time and money in the long run as we improve and scale up our bot detection pipeline. If we can’t collect more data, we may have to consider other approaches.
- Undersampling: we undersample the majority human class to provide a more balanced dataset for the model to train on. If we have a very large dataset, then reducing our dataset size may not affect the final model too much. However, with a much smaller dataset, undersampling may reduce our dataset so much that our model underfits the problem and cannot learn properly.
- Oversampling: we oversample the minority bot class by showing the model more of the same examples. If we have a diverse set of bot examples, this may be a good approach, but if our examples are not representative of the class as a whole (e.g., they all come from a specific region), then our model may overfit to this specific distribution. In this case, our trained model may fail to generalize to data from other regions.
Algorithmic Approaches:
- Weighted loss: We can apply a penalty weight to each class in the loss function. For example, with a log-likelihood loss, we can penalize the loss incurred on the minority class much more than the loss incurred on the majority class. A simple method for estimating a good weighting is to use the empirical fraction of the observed class in the dataset. In this case, we have 5% bots, so we would weight our loss on the bot class 20 times more than on the majority human class.
- Bayesian approach: If the model produces probability estimates, we can use the output probabilities as an update to a prior based on dataset-level statistics. This allows us to set a prior based on data (e.g., we might believe there is a 5% probability a given user is a bot and a 95% probability they are a human), which we can then update over time as we get more data about the account.
Interviewers are looking for candidates to mention different approaches and describe when one should be used over another. A more technical interviewer might choose to dive deeper into the algorithmic approaches and ask specifics about the loss functions.
Select and justify the model
It's useful to discuss various models and their pros and cons. Any decision you make should be backed up by a motivation based on the problem definition and acknowledge potential tradeoffs. Discussing the pros and cons this way shows the interviewer that you’re thinking deeply about the problem rather than just arbitrarily selecting a model.
Although there are many different models we can use for a binary classification task, a good place to start is with the simplest models, although each has its pros and cons.
- A binary decision tree learns to optimally split the feature space with binary decisions. These models are more interpretable since we can directly see the path along the tree that led to the decision. However, they produce a single binary decision that does not allow us to trade off between false positives and negatives.
- A logistic regression model learns a linear decision boundary in the feature space and produces a probability distribution over classes, based on the distance from the boundary. Although these models are less interpretable, they allow us to tune a classification threshold for making a binary decision.
- Other models include SVMs, neural networks, etc., each with their own pros and cons.
Rather than choosing a single model, we might also choose to ensemble our models, or combine the decisions from multiple models together. There are many ways to perform ensembling, and each has its own pros and cons. We might take a model vote and classify based on the majority decision. If our models produce probability distributions, we can also aggregate by taking a weighted sum over our classifiers. This weight can be made dynamic if we think that some classifiers are better at classifying certain examples, similar to a mixture-of-experts model.
Step 3: Train and evaluate the model
Train the model
Before launching into this question, review the definitions again to show the interviewer you understand the concepts well. This doesn’t need to be too in depth, but simply mentioning the purpose of a test/validation split will help to start the conversation and may also help you form your own thoughts.
Test/validation splits ensure we get an accurate estimate of our model’s performance on unseen data. A simple validation split would work for most tasks, but the class imbalance means we must be more careful. Here are some things we need to consider when selecting a validation split:
- How much data do we have? If we have very little data in the minority class, then a larger validation split means we have a smaller training split and less training data. Selecting data for each split randomly could also affect how much of the minority class ends up in each one, so we might want to select examples on a class-by-class basis.
- Are there any subpopulations we care about? If we think there might be some underlying structure to our data, we should either perform stratified sampling proportionally from each class, or train only on individual subpopulations to test our generalization ability.
- Have we flagged accounts multiple times? We might need to de-duplicate our data to ensure there is no train-test dataset leakage, which would overestimate our model’s performance.
An interviewer may follow up on the de-duplication method, the effect of train-test leakage on the trained model and evaluation, subpopulation shift and covariate shift, differences in sampling strategies, k-fold cross-validation, and more.
Evaluate the model
The main concept to emphasize here is the tradeoff between a false positive and a false negative and the relative costs of each. Remember to connect these to the original problem. For example, you could discuss how a false negative may be more costly to the company depending on how severe the bot accounts are.
As we discussed before, accuracy is a poor choice for class-imbalanced datasets, since a naïve classifier can achieve very high accuracy. There are many metrics to discuss here, but most of them hinge on the values in the confusion matrix, a 2x2 matrix that measures the true positive, false positive, true negative, and false negative rates.
- True positive and true negative examples are ones that we classify correctly as a bot or human
- False positive examples are ones that we classify as a bot, but are actually human
- False negative examples are ones that we classify as human, but are actually a bot
We focus on false positives and false negatives, because these are the examples that our model fails on. We should consider how each of these types of errors affects what happens afterward. For example, missing a bot user could have huge consequences if they end up harming the real users on our platform. On the other hand, misclassifying a real person as a bot wastes time in investigations or can accidentally close a real user’s account. We can measure different aspects of this tradeoff with the following metrics:
- Recall and precision: Recall is the fraction of actual bots we manage to classify correctly, and precision is the fraction of flagged accounts that were actually bots. A model with high recall and low precision makes many false positive mistakes, whereas a model with high precision but low recall makes many false negative mistakes. One might be more costly than the other, depending on the problem setting.
- F1 score: This is the harmonic mean of the precision and recall score and aims to characterize a combination of the two metrics. We can average F1 scores across classes or over the entire dataset.
- AUROC: For a model that outputs a distribution, we can vary a threshold to change the recall and precision rates. The plot of the TPR and FPR is called the receiver operating characteristic (ROC) curve, and the area under this curve is the AUROC curve. A value of 0.5 means an essentially random decision and a value of 1.0 means a perfect classifier.
- FPR@95: This is the false positive rate achieved when we adjust our decision threshold to classify 95% of the positive classes correctly (95% recall). A low value indicates high precision, even when we require high recall.
A combination of these metrics and more traditional metrics like accuracy and loss should give us a good sense of the performance of our model.
Here we tailor some metrics for the task by specifying the bot severity and integrating this notion into the metrics. The interviewer could ask follow-up questions about the nuances of AUROC and exactly how its computed, or questions about the confusion matrix and Type I/II errors.
Step 4: Deploy the model
This question zooms out from the technical details and focuses on a real-world engineering scenario. Discuss topics that are specific to the problem, such as the business needs of the bot detection system. The more context you have, the better, so asking questions can guide the discussion in a more focused direction.
Monitoring a machine learning model once it has been deployed is essential to ensure that it performs well and can adapt to changes in real-world data. Distributions often change quickly in the real world, especially as malicious agents attempt to circumvent the new detection system.
A deployment monitoring system can be generally broken down into two parts: detecting changes in performance and adapting the model.
- Detecting changes in performance: Performance is left generally vague here because different metrics may be used to measure different performance indicators. In addition to overall metrics, we should consider metrics on individual groups or regions, metric change over time, and even A/B comparisons of different models. Although some change or degradation is inevitable, large changes due to distribution shift may signal that we need to adjust our model.
- Adapting the model: The simplest way to fix a poorly performing model is simply to collect more data and retrain the model. Depending on the model, we may be able to incorporate new information and training data without retraining from scratch. We should make sure to collect the specific data needed to address the shortcomings found through the monitoring systems. This strategy ensures that the model will improve efficiently.
For an engineering or data science role, monitoring and profiling the behavior of models is crucial. The interviewer may follow up on statistical testing for A/B tests, developing region-specific models, determining a threshold for retraining, or data pipelining for gathering data over time.
Other considerations
With additional time, you could connect the answers back to the original setup in more detail. Although most interviewers are looking to assess technical ability in a ML system design interview, maintaining high-level context shows that you are thinking both about the small details and the bigger picture.