Skip to main content

Supervised Model Evaluation

Premium

Evaluating machine learning models is crucial for understanding their performance and effectiveness. Different tasks, such as classification and regression, require specific metrics to assess how well a model performs. In classification, we often look at metrics like accuracy, precision, and recall, while in regression, metrics such as Mean Squared Error (MSE) and R-squared are commonly used. This lesson will delve into these evaluation metrics, helping you understand their applications and implications.

In supervised learning, models are typically evaluated using a set of standard metrics such as accuracy, precision, recall, F1-score, and ROC-AUC, which directly compare predicted labels to true labels. These metrics are well-defined because supervised models are trained with labeled data, providing a clear ground truth for evaluation. Thus we can summarize them in this one lesson.

In contrast, the evaluation of unsupervised models is more varied and often depends on the specific type of model and the task at hand. Unsupervised learning deals with unlabeled data, so there’s no direct ground truth to compare against. We will provide evaluation methods for each unsupervised model within the their respective lessons.

Overview

TaskMetricDescriptionCaveats/Notes
ClassificationAccuracyRatio of correctly predicted instances to total instances.Not suitable for imbalanced data.
PrecisionRatio of true positive predictions to total predicted positives.Useful in cases where false positives are costly.
RecallRatio of true positive predictions to total actual positives.Important in scenarios with high false negatives.
F1 ScoreHarmonic mean of precision and recall.Balances precision and recall.
Macro F1F1 score averaged across classes.Treats all classes equally.
Micro F1F1 score calculated globally.More influenced by larger classes.
AUC-ROCArea under the ROC curve, which plots true positive rate vs. false positive rate.Good for comparing models.

A confusion matrix is a table showing true positives, false positives, true negatives, and false negatives. It is a useful table in which you can calculate metrics like precision and accuracy from.

TaskMetricDescriptionCaveats/Notes
RegressionMSEMean of the squared differences between predicted and actual values.Sensitive to outliers.
MAEMean of the absolute differences between predicted and actual values.More robust to outliers.
R-squaredProportion of variance in the dependent variable explained by the model.Values range from 0 to 1.

Classification metrics

The confusion matrix is a powerful tool for visualizing the performance of a classification model. It tabulates the counts of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions, providing a detailed breakdown of the model's performance. From the confusion matrix, various classification metrics can be computed.

Accuracy

Accuracy measures the proportion of correctly classified instances out of total instances. It is calculated as the ratio of the sum of true positive and true negative predictions to the total number of predictions. Mathematically, accuracy is expressed as: Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} . Note that we should not use this metric with imbalanced data

Precision

Precision measures the proportion of true positive predictions out of all positive predictions made by the model. It quantifies the model's ability to avoid false positive predictions and is calculated as the ratio of true positives to the sum of true positives and false positives. Mathematically, precision is calculated as: Precision=TPTP+FP\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}

Recall

Recall, also known as sensitivity, measures the proportion of true positive predictions out of all actual positive instances in the dataset. It quantifies the model's ability to capture all positive instances and is calculated as the ratio of true positives to the sum of true positives and false negatives. Mathematically, recall is calculated as: Recall=TPTP+FN\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

F1 score

The F1 score is the harmonic mean of precision and recall, providing a balanced measure of a classifier's performance. It combines both precision and recall into a single metric, offering a comprehensive assessment of the model's effectiveness in both minimizing false positives and capturing true positives. Mathematically, the F1 score is calculated as: F1=2×Precision×RecallPrecision+RecallF1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

There are two types of F1 scores typically used:

Macro F1: Macro F1 calculates the F1 score for each class individually and then averages them, treating all classes equally. It is suitable for datasets with class imbalance, as it gives equal weight to each class's performance. Mathematically, macro F1 is calculated as the unweighted mean of F1 scores across all classes.

Micro F1: Micro F1 calculates the F1 score globally across all classes by considering the total number of true positives, false negatives, and false positives. It is suitable for datasets with class imbalance, as it considers the overall performance across all classes. Mathematically, micro F1 is calculated based on the total number of true positives, false positives, and false negatives across all classes

AUC-ROC

The area under the receiver operating characteristic curve (AUC-ROC) is a widely used evaluation metric for binary classification models. The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1 - specificity) at various classification thresholds. The AUC-ROC metric quantifies the overall discriminative power of the model across all possible threshold settings, with higher values indicating better performance. An AUC-ROC score of 1 signifies perfect discrimination, while a score of 0.5 suggests random guessing. AUC-ROC provides a comprehensive assessment of the model's ability to distinguish between positive and negative instances, making it a valuable tool for evaluating classification models in various domains.

Regression metrics

Evaluating regression models requires specific metrics to understand their accuracy and effectiveness. These metrics help in quantifying the difference between predicted and actual values, offering insights into the model's performance. The most commonly used regression metrics include Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R²). Each of these metrics provides a different perspective on the model's predictive power and error characteristics.

Mean squared error (MSE)

Mean squared error (MSE) is a commonly used metric for evaluating the performance of regression models. It measures the average of the squares of the errors, where the error is the difference between the predicted value and the actual value. Mathematically, MSE is expressed as:

MSE=1mi=1m(ytrue(i)ypred(i))2\text{MSE} = \frac{1}{m}\sum_{i=1}^m (y_{\text{true}}^{(i)} - y_{\text{pred}}^{(i)})^2 ymean=1mi=1mytrue(i)y_{\text{mean}} = \frac{1}{m}\sum_{i=1}^m y_{\text{true}}^{(i)} Total variance=1mi=1m(ymeanytrue(i))2\text{Total variance} = \frac{1}{m}\sum_{i=1}^m (y_{\text{mean}} - y_{\text{true}}^{(i)})^2

where ytrue(i)y_{\text{true}}^{(i)} is the actual value, ypred(i)y_{\text{pred}}^{(i)} is the predicted value and nn is the number of data points.

MSE is sensitive to outliers because the errors are squared, which can significantly impact the overall metric.

Mean absolute error (MAE)

Mean absolute error (MAE) is another metric used to evaluate regression models. It measures the average magnitude of the errors in a set of predictions, without considering their direction. It is the mean of the absolute differences between the predicted values and actual values. Mathematically, MAE is calculated as:

MAE=1ni=1nyiy^i\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

Unlike MSE, MAE is more robust to outliers, as it does not square the errors.

R-squared

A key metric for evaluating regression models e.g. linear regression on the supplied data is the R2R^2 metric, also known as coefficient of determination. R2R^2 metric is the ratio of explained variance (by the model) over the total variance in the data, and it is calculated as:

R2=1MSETotal varianceR^2 = 1 - \frac{\text{MSE}}{\text{Total variance}}

The R2R^2 value ranges from 0 to 1, where a value closer to 1 indicates that a larger proportion of the variance in the dependent variable is predictable from the independent variables.

Senior candidates are expected to understand how model goodness parameters are influenced by the number of parameters, particularly AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion). While you may not need to know the exact algebraic formulation, you should grasp the concepts. Model evaluation using AIC or BIC adjusts for the parameter count and the number of observations. AIC and BIC are closely related algebraically, with BIC computed as:

BIC=nlog(RSSn)+klog(n)\text{BIC} = n \cdot \log\left(\frac{\text{RSS}}{n}\right) + k \cdot \log(n)

where nn is the number of samples, RSS\text{RSS} is the residual sum of squares, and kk is the number of parameters.

AIC/BIC is useful for comparing two models with similar R2R^2 values. A lower AIC/BIC score indicates a better model. While R2R^2 assesses the goodness of fit (on the data) alone, AIC/BIC adds a penalty for the number of parameters and samples. Intuitively, a model with more parameters would score higher in AIC/BIC, suggesting a poorer fit.

The diagram below presents the statistical output from a regression analysis. Data scientists are generally expected to have a comprehensive understanding of the various fields and metrics reported in such regression results. However, for machine learning engineers, the focus is typically on understanding a subset of these metrics, including the R-squared value (which indicates the model's goodness of fit), the coefficients (which represent the estimated relationship between each predictor variable and the outcome), the standard errors (which measure the uncertainty or variability in the coefficient estimates), the p-values (which assess the statistical significance of each coefficient), and the confidence intervals (which provide a range of plausible values for each coefficient).

Statistical output from regression analysis