Supervised Model Evaluation
Evaluating machine learning models is crucial for understanding their performance and effectiveness. Different tasks, such as classification and regression, require specific metrics to assess how well a model performs. In classification, we often look at metrics like accuracy, precision, and recall, while in regression, metrics such as Mean Squared Error (MSE) and R-squared are commonly used. This lesson will delve into these evaluation metrics, helping you understand their applications and implications.
In supervised learning, models are typically evaluated using a set of standard metrics such as accuracy, precision, recall, F1-score, and ROC-AUC, which directly compare predicted labels to true labels. These metrics are well-defined because supervised models are trained with labeled data, providing a clear ground truth for evaluation. Thus we can summarize them in this one lesson.
In contrast, the evaluation of unsupervised models is more varied and often depends on the specific type of model and the task at hand. Unsupervised learning deals with unlabeled data, so there’s no direct ground truth to compare against. We will provide evaluation methods for each unsupervised model within the their respective lessons.
Overview
A confusion matrix is a table showing true positives, false positives, true negatives, and false negatives. It is a useful table in which you can calculate metrics like precision and accuracy from.
Classification metrics
The confusion matrix is a powerful tool for visualizing the performance of a classification model. It tabulates the counts of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions, providing a detailed breakdown of the model's performance. From the confusion matrix, various classification metrics can be computed.
Accuracy
Accuracy measures the proportion of correctly classified instances out of total instances. It is calculated as the ratio of the sum of true positive and true negative predictions to the total number of predictions. Mathematically, accuracy is expressed as: . Note that we should not use this metric with imbalanced data
Precision
Precision measures the proportion of true positive predictions out of all positive predictions made by the model. It quantifies the model's ability to avoid false positive predictions and is calculated as the ratio of true positives to the sum of true positives and false positives. Mathematically, precision is calculated as:
Recall
Recall, also known as sensitivity, measures the proportion of true positive predictions out of all actual positive instances in the dataset. It quantifies the model's ability to capture all positive instances and is calculated as the ratio of true positives to the sum of true positives and false negatives. Mathematically, recall is calculated as:
F1 score
The F1 score is the harmonic mean of precision and recall, providing a balanced measure of a classifier's performance. It combines both precision and recall into a single metric, offering a comprehensive assessment of the model's effectiveness in both minimizing false positives and capturing true positives. Mathematically, the F1 score is calculated as:
There are two types of F1 scores typically used:
Macro F1: Macro F1 calculates the F1 score for each class individually and then averages them, treating all classes equally. It is suitable for datasets with class imbalance, as it gives equal weight to each class's performance. Mathematically, macro F1 is calculated as the unweighted mean of F1 scores across all classes.
Micro F1: Micro F1 calculates the F1 score globally across all classes by considering the total number of true positives, false negatives, and false positives. It is suitable for datasets with class imbalance, as it considers the overall performance across all classes. Mathematically, micro F1 is calculated based on the total number of true positives, false positives, and false negatives across all classes
AUC-ROC
The area under the receiver operating characteristic curve (AUC-ROC) is a widely used evaluation metric for binary classification models. The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1 - specificity) at various classification thresholds. The AUC-ROC metric quantifies the overall discriminative power of the model across all possible threshold settings, with higher values indicating better performance. An AUC-ROC score of 1 signifies perfect discrimination, while a score of 0.5 suggests random guessing. AUC-ROC provides a comprehensive assessment of the model's ability to distinguish between positive and negative instances, making it a valuable tool for evaluating classification models in various domains.
Regression metrics
Evaluating regression models requires specific metrics to understand their accuracy and effectiveness. These metrics help in quantifying the difference between predicted and actual values, offering insights into the model's performance. The most commonly used regression metrics include Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R²). Each of these metrics provides a different perspective on the model's predictive power and error characteristics.
Mean squared error (MSE)
Mean squared error (MSE) is a commonly used metric for evaluating the performance of regression models. It measures the average of the squares of the errors, where the error is the difference between the predicted value and the actual value. Mathematically, MSE is expressed as:
where is the actual value, is the predicted value and is the number of data points.
MSE is sensitive to outliers because the errors are squared, which can significantly impact the overall metric.
Mean absolute error (MAE)
Mean absolute error (MAE) is another metric used to evaluate regression models. It measures the average magnitude of the errors in a set of predictions, without considering their direction. It is the mean of the absolute differences between the predicted values and actual values. Mathematically, MAE is calculated as:
Unlike MSE, MAE is more robust to outliers, as it does not square the errors.
R-squared
A key metric for evaluating regression models e.g. linear regression on the supplied data is the metric, also known as coefficient of determination. metric is the ratio of explained variance (by the model) over the total variance in the data, and it is calculated as:
The value ranges from 0 to 1, where a value closer to 1 indicates that a larger proportion of the variance in the dependent variable is predictable from the independent variables.
Senior candidates are expected to understand how model goodness parameters are influenced by the number of parameters, particularly AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion). While you may not need to know the exact algebraic formulation, you should grasp the concepts. Model evaluation using AIC or BIC adjusts for the parameter count and the number of observations. AIC and BIC are closely related algebraically, with BIC computed as:
where is the number of samples, is the residual sum of squares, and is the number of parameters.
AIC/BIC is useful for comparing two models with similar values. A lower AIC/BIC score indicates a better model. While assesses the goodness of fit (on the data) alone, AIC/BIC adds a penalty for the number of parameters and samples. Intuitively, a model with more parameters would score higher in AIC/BIC, suggesting a poorer fit.
The diagram below presents the statistical output from a regression analysis. Data scientists are generally expected to have a comprehensive understanding of the various fields and metrics reported in such regression results. However, for machine learning engineers, the focus is typically on understanding a subset of these metrics, including the R-squared value (which indicates the model's goodness of fit), the coefficients (which represent the estimated relationship between each predictor variable and the outcome), the standard errors (which measure the uncertainty or variability in the coefficient estimates), the p-values (which assess the statistical significance of each coefficient), and the confidence intervals (which provide a range of plausible values for each coefficient).
