Skip to main content

Logistic Regression Concepts

Premium

In the real world, while predicting continuous values is certainly a common problem, it is perhaps even more common to face a situation where a prediction must be made on a binary outcome, known as a classification task. This is simply because there are so many different ways in which this comes up. Some examples include:

  • Predicting whether a customer will convert (yes/no)
  • Determining if an email is spam (yes/no)
  • Classifying whether a transaction is fraudulent (fraud/no fraud)
  • Finding what an image contains (cat/no cat)
  • Diagnosing a disease (cancer/no cancer)

For such problems, logistic regression is the most commonly used type of model. In logistic regression, the target variable can be only one of two values (0 or 1) as opposed to a continuous value (can be any number). While linear regression is used to predict such continuous values (the price of a house, salary of a baseball player), logistic regression estimates the probability that an outcome will occur, making it ideal for classification.

Logistic regression model overview

In linear regression, we use the following formula to model the relationship between the independent variables and the dependent variable:

Y=β0+β1X1+β2X2++βnXnY = \beta_0+\beta_1X_1 +\beta_2X_2 + \dots+\beta_nX_n

Where

  • YY is the continuous dependent variable (e.g., sales)
  • X1,X2,...,XnX_1, X_2,...,X_n are the independent variables (predictors)
  • β0,β1,...,βn\beta_0, \beta_1,...,\beta_n are the coefficients (parameters)

In logistic regression, the dependent variable is binary, so the linear model is modified. Instead of predicting YY directly, we predict the log-odds of the outcome:

log(p1p)=β0+β1X1+β2X2++βnXn\log(\frac{p}{1-p}) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_nX_n

Where:

  • pp is the probability of the outcome (e.g., conversion)
  • p1p\frac{p}{1-p} is the odds of the outcome (think 3:1 odds, or 2:1 odds)
  • The rest of the equation remains similar, with X1,X2,...,XnX_1, X_2,...,X_n as the predictors and β0,β1,...,βn\beta_0, \beta_1, ..., \beta_n as the coefficients

Relationship to the sigmoid function

Working from the formula for logistic regression above, we can solve for our probability by taking advantage of the relationship of Euler’s number ee to the natural logarithm. Doing so, we come upon the following formula known as the sigmoid function:

p=11+e(β0+β1X1+β2X2++βnXn)p = \frac{1}{1+e^{-(\beta_0+\beta_1X_1 +\beta_2X_2 + \dots+\beta_nX_n)}}

Graphing the sigmoid function generates the following chart:

Sigmoid Function

There are a few important things to note here.

Firstly, our initial linear combination of features β0+β1X1+β2X2++βnXn\beta_0+\beta_1X_1 +\beta_2X_2 + \dots+\beta_nX_n has no upper or lower bound. Depending on the sizes of the values and coefficients the predictive space extends all the way from negative infinity to positive infinity for any given prediction. This is fine for predicting continuous values, like the price of a house, but doesn’t work if we want to predict binary outcomes. The sigmoid function changes this by transforming our linear combination of features into a predicted value bounded between 0 and 1, which represents the probability of a given outcome.

Secondly, the sigmoid function is an S-shaped curve with a few important characteristics:

  • Midpoint behavior: The midpoint of the sigmoid function occurs when the log-odds equal zero, which corresponds to a probability of 0.5. At this point, the model is essentially uncertain about the outcome—it’s equally likely to predict either class (0 or 1). The curve around this point is steep, meaning that small changes in the log-odds lead to rapid changes in predicted probability. This steep transition ensures that cases near the decision boundary (e.g., whether a user converts or not) are very sensitive to small shifts in input values.
  • Closer to the extremes: As the log-odds increase far above 0 or fall far below 0, the sigmoid function flattens out. This means that for very high positive or negative log-odds, the predicted probabilities approach 1 or 0, respectively, but they do so slowly. As the probability nears 0 or 1, the model becomes more confident in its prediction, and additional changes in the predictors have diminishing effects on the probability.

This transformation ensures that logistic regression outputs valid probability values for binary classification problems.

Assumptions of logistic regression

Like linear regression, logistic regression comes with its own set of assumptions:

  1. Linearity of independent variables and log-odds: In linear regression, the assumption is that the independent variables have a linear relationship with the dependent variable. In logistic regression, this assumption changes slightly—the independent variables should have a linear relationship with the log-odds of the dependent variable.
  2. Independence of observations: Both logistic and linear regression assume that observations are independent of each other.
  3. Absence of multicollinearity: Both models assume that the predictors should not be highly correlated with one another. High multicollinearity can make it difficult to interpret the impact of individual predictors.
  4. Large sample size: Logistic regression, like linear regression, benefits from large sample sizes. This ensures that the parameter estimates are reliable, particularly for small probabilities.

Evaluating model performance

In logistic regression, the goal is not to predict continuous values but to classify outcomes into one of two categories (e.g., conversion or no conversion). To evaluate a logistic regression model’s performance, we need classification-specific metrics. The traditional linear regression metrics consider “error” to be related to the distance of the predicted values to the actual values which, in the case of a binary classification, doesn’t make much sense. More specifically:

  • R-squared: Measures the proportion of variance in a continuous dependent variable that is explained by the independent variables. Logistic regression does not predict continuous values but probabilities, which are then classified as 0 or 1, so R-squared is not meaningful.
  • RMSE: Measures the average distance between predicted and observed continuous values. Since logistic regression deals with probabilities and classification, there is no continuous “distance” to measure in the same way as in linear regression.

Instead, in classification, there are two different types of “errors” we are concerned with, as well as two different ways in which a prediction can be correct. The metrics we use then take into account these 4 different possible categories a prediction can fall into:

  1. True Positives (TP): These are cases where the model correctly predicts the positive class. For example, predicting that a user will convert (1) when they actually do convert.
  2. False Positives (FP): These are cases where the model incorrectly predicts the positive class. For example, predicting that a user will convert (1) when they actually do not convert (0). False positives are important to consider, as they can lead to wasted efforts or resources (e.g., targeting a user who won’t convert).
  3. True Negatives (TN): These are cases where the model correctly predicts the negative class. For example, predicting that a user will not convert (0) when they actually don’t convert.
  4. False Negatives (FN): These are cases where the model incorrectly predicts the negative class. For example, predicting that a user will not convert (0) when they actually do convert (1). False negatives are critical to consider because they represent missed opportunities, such as failing to identify users who might have converted.

Metrics for model evaluation

Once we understand the classification outcomes, we can use them to calculate various metrics that help evaluate the model’s performance:

MetricDefinitionEquationBest Used When
AccuracyProportion of correctly predicted instances (both true positives and true negatives) out of total predictions.Accuracy=TP+TNTP+FP+TN+FN\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{TN} + \text{FN}}Overall correctness is important, but data is not imbalanced.
Balanced AccuracyAverage of the accuracy for each class, accounting for class imbalance.Balanced Accuracy=12(TPTP+FN+TNTN+FP)\text{Balanced Accuracy} = \frac{1}{2} \left( \frac{\text{TP}}{\text{TP} + \text{FN}} + \frac{\text{TN}}{\text{TN} + \text{FP}} \right)The dataset is imbalanced, and both classes are equally important to predict.
PrecisionProportion of positive predictions that are correct (focuses on minimizing false positives).Precision=TPTP+FP\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}False positives are costly or more critical than false negatives.
RecallProportion of actual positives that are correctly predicted (focuses on minimizing false negatives).Recall=TPTP+FN\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}False negatives are costly or more critical than false positives.
F1-ScoreHarmonic mean of precision and recall, balancing the two metrics.F1-Score=2PrecisionRecallPrecision+Recall\text{F1-Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}Both false positives and false negatives are important to minimize.

Common pitfalls & how to avoid them

  • Overfitting: Logistic regression models can overfit the data if too many predictors are included. Like linear regression, feature selection is important to improve model performance and prevent overfitting. In logistic regression, regularization techniques such as L1 (Lasso) and L2 (Ridge) are often used to shrink or eliminate less important features, helping the model generalize better to new data.
  • Misinterpreting coefficients: In linear regression, coefficients represent a direct change in the dependent variable. In logistic regression, coefficients represent changes in log-odds, so converting them to odds ratios helps in interpretation.
  • Imbalanced datasets: Logistic regression models can struggle with imbalanced datasets, where one class is much more common than the other. Metrics like precision, recall, and F1-score are more informative than accuracy in such cases.