Evaluating a Model for ML Systems
When selecting and training a model, it’s important to evaluate your model on a variety of dimensions. Real-world deployments of ML models often run into various challenges that require more than just accuracy-based metrics.
To tackle this part of the interview, consider the settings in which your model will be used and the different ways an incorrect prediction may negatively impact the downstream user.
In this lesson, we describe when to use each of the evaluation techniques highlighted below:
- Offline evaluation
- Online evaluation
- Bias
- Calibration
- Sensitivity
- Comparison against baselines
Offline evaluation
Offline evaluation measures the quality and effectiveness of a machine learning model based on historical or simulated data. Offline evaluation is usually cheaper and simpler to run, but it may not capture the true behavior of users.
Precision, recall, F1, and AUC ROC are examples of accuracy-based offline evaluation metrics. It’s standard to evaluate models with these metrics, but it’s also important to consider which datasets you evaluate these metrics on. Models may often perform differently on in-distribution vs. out-of-distribution datasets, as well as across different classes or segments of the data.
Online evaluation
Online evaluation measures the quality and effectiveness of a machine learning model based on its interaction with real users and data in a live system. Online evaluation can provide more realistic and actionable feedback, but it may be costly and complex to perform. Examples of online evaluation metrics include: clickthrough rate (CTR), engagement rate, and revenue lift.
Bias
If your model processes any kind of human data or makes decisions that affect humans, it’s important to evaluate how that model performs across different demographics of people. Group fairness is a common way to address this concern. It asserts that model error should be equal across groups and that predictions should be independent of a group’s sensitive characteristics, given the true label. Many infamous examples of biased ML models exist, such as the racist COMPAS recidivism prediction algorithm, Amazon’s discriminatory résumé screener, and the Tay chatbot’s racist Tweets.
Calibration
In settings where humans must decide whether to trust a model’s decision, it’s helpful for a model to be well-calibrated. A well-calibrated model’s predictions have probabilities equal to the probability that its prediction is correct. If a model is well-calibrated, then we can interpret its output probabilities as a robust estimate of the model uncertainty and can predict how likely it is that the model is incorrect.
Sensitivity
The sensitivity, or robustness, of a model determines how sensitive the model is to minor changes in the input or its weights. A model can be sensitive in multiple ways; changing a word in the input to a synonym or adding a benign object to an image’s background can change a model’s prediction. This sensitivity often isn’t desirable, so your models should be robust to various minor changes that don’t affect the label. Being too sensitive may make a model vulnerable to adversarial **attacks, where attackers perturb the input in order to compel a model to output an incorrect prediction.
Comparisons against baselines
Last but not least, ensure that your model is actually better than baseline techniques. If your model doesn’t perform better than a random baseline, then it isn’t providing any meaningful value-add. To create appropriate baseline methods, consider the simplest possible ways of modeling the problem. For example, you could compare your model to a bag-of-words model for a language understanding task or a logistic model for a binary classification task. You should also include at least one random baseline to show that your model performs better than random sampling, and one human baseline as a loose upper bound on performance.
For quick tips and tricks on how to thoroughly evaluate an ML model, check out this guide.
Now that we’ve reviewed evaluation metrics for ML models, we’ll cover best practices for deployment in Deploying an ML Model.