Logistic Regression
Logistic regression is a supervised machine learning algorithm that is used for classification. It takes a linear combination of inputs and produces probabilities over one or multiple classes, which can be used to categorize data points. It’s a simple, yet effective algorithm that is often used to create a baseline and/or interpret the effect of input variables on the output.
By the end of this lesson, you'll have a better idea of how to answer the following commonly asked questions:
- How do you interpret the coefficients in logistic regression?
- Suppose you have a logistic regression model trained to classify spam emails. How would you evaluate this classifier?
- What’s the relationship between the cross entropy loss function and maximum likelihood?
- Why use minibatch stochastic gradient descent in logistic regression? What’s the effect of changing the batch size?
We'll provide the answers at the end of this lesson.
Overview
Algorithm
The following steps outline the process of training a logistic regression model:
- Initialize the weights (according to the number of input features) and a bias term to random numerical values
- Minimize the cross entropy loss function on the training set with respect to the weights / bias, predictions, and ground truth labels.
- This involves performing multiple iterations of gradient descent. This is dependent on how many training examples you have.
- Use the gradients of the cross entropy loss with respect to the weights and bias to perform gradient descent.
- Stop the training process based on one or more of the following criteria: A fixed number of epochs, convergence of the loss function, maximum allowed training time (this is a non-exhaustive list). The choice of stopping criterion is determined by the engineer or domain expert.
To predict a class from new data points:
- Once the weights and bias have been learned, plug in new data points to the forward pass and use the activation function (sigmoid for binary classification or softmax for multiclass classification) to output new probabilities for one or multiple classes.
- Predict the class that has the highest probability.
The training process for linear regression and logistic regression models shares many similarities, including weight initialization, optimization through gradient descent, and iterative weight updates based on the gradients of the loss function.
However, they differ primarily in their loss functions and final activation functions. Linear regression minimizes Mean Squared Error (MSE) to predict continuous values directly, while logistic regression minimizes Cross-Entropy Loss (log loss) to predict probabilities for binary or multiclass classification.
Logistic regression additionally employs a sigmoid or softmax activation function to transform the linear combination of inputs into probabilities. Despite these differences, both models follow a comparable iterative training process aimed at minimizing prediction errors and optimizing model parameters.
Equations
These equations below work together in an iterative process:
- Make predictions (forward pass)
- Evaluate performance (loss function)
- Calculate gradients (backward pass)
- Update parameters (gradient descent)
This cycle repeats until the model converges or a specified number of iterations is reached. At this point, your logistic regression’s model should be optimized.
Vectorized equation (forward pass): This equation computes the linear combination of input features and weights, plus a bias term.
Where
- : weight matrix for the forward pass
- : bias term
Cross entropy loss function: This function measures the difference between predicted probabilities and actual labels across all training examples.
Binary cross entropy is the loss function minimized to find optimal weights in logistic regression. It assigns higher penalties to predictions that deviate more from the true labels. Consider the following graph:

The graph illustrates two key properties:
- When the true label is 1: The loss approaches infinity as the predicted probability nears 0, and decreases as it approaches 1.
- When the true label is 0: The loss increases as the predicted probability nears 1, and decreases as it approaches 0.
This behavior aligns with our intuition: predictions far from the true label should incur higher losses.
Binary cross entropy is theoretically grounded:
- It's equivalent to maximizing the likelihood of the observed data, finding model parameters with the highest likelihood of producing the data.
- It's related to Kullback-Leibler (KL) divergence, measuring the difference between predicted and true probability distributions.
Practically, this loss function is advantageous because:
- It's convex for logistic regression, ensuring a single global minimum.
- It's smooth and differentiable, making it suitable for gradient descent optimization.
These properties make binary cross entropy an effective and widely used loss function for logistic regression.
Derivatives/gradients for backward pass: These equations calculate the gradients of the loss with respect to weights and bias, used for updating model parameters.
Where:
- : number of training exapmels
- : each training example
- : ground truth label
- : predicted probability for each training example
Gradient descent step equation: This equation updates the weights and bias using the computed gradients and a learning rate to minimize the loss function.
Where
- : updated weight
- : updated bias
- : learning rate for gradient descent
Activation functions
Activation functions transform the raw output of a neuron or layer into a format suitable for the task at hand, typically introducing non-linearity into the model. In logistic regression, the sigmoid function is used for binary classification tasks, while the softmax function is employed for multi-class classification problems.
Sigmoid activation function: The sigmoid function transforms the linear output into a probability between 0 and 1.
The sigmoid function is crucial in logistic regression as it transforms the unbounded linear output (logits) into a probability between 0 and 1. This transformation is essential because it allows the model to interpret its predictions as probabilities for binary classification tasks.

The sigmoid function's S-shaped curve provides a smooth transition between classes. As the input approaches positive infinity, the output nears 1 (high confidence in the positive class), while inputs approaching negative infinity yield outputs near 0 (high confidence in the negative class). At an input of 0, the output is 0.5, representing maximum uncertainty. This behavior makes sigmoid ideal for translating the linear combination of features into meaningful probability estimates for class membership.
Softmax activation function: While the sigmoid function is used for binary classification, the softmax is used for multiclass classification. The softmax function is defined as:
In practice, this will take the logits from logistic regression and output a number of probabilities equal to the number of classes to be predicted. The behavior of the softmax ensures that all the probabilities will sum to 1. The model will predict the class with the highest probability.
Code
From the outlined algorithm, we can implement logistic regression in Python. Note that in the code below, the stopping criterion is the number of epochs.
Pythonimport numpy as np
def sigmoid(z):
"""
Sigmoid activation function.
Args:
z (float): Linear combination of input features.
Returns:
float: Probability of the positive class.
"""
return 1.0 / (1.0 + np.exp(-z))
def predict(X, W, b):
"""
Make predictions using logistic regression model.
Args:
X (numpy.ndarray): Input features (NxD matrix).
W (numpy.ndarray): Weights vector (Dx1).
b (float): Bias term.
Returns:
numpy.ndarray: Predicted probabilities for positive class (Nx1).
"""
z = X.dot(W) + b
return sigmoid(z)
def calculate_gradients(X, y, W, b):
"""
Compute gradients of the loss function w.r.t. weights and bias.
Args:
X (numpy.ndarray): Input features (NxD matrix).
y (numpy.ndarray): Ground truth labels (Nx1).
W (numpy.ndarray): Weights vector (Dx1).
b (float): Bias term.
Returns:
numpy.ndarray: Gradients of weights.
float: Gradient of bias.
"""
N = len(X)
y_pred = predict(X, W, b)
weights_grad = X.T.dot(y_pred - y) / N
bias_grad = np.mean(y_pred - y)
return weights_grad, bias_grad
def binary_cross_entropy(y_pred, y):
"""
Compute binary cross-entropy loss.
Args:
y_pred (numpy.ndarray): Predicted probabilities for positive class (Nx1).
y (numpy.ndarray): Ground truth labels (Nx1).
Returns:
float: Binary cross-entropy loss.
"""
return -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
def train(X, y, W, b, learning_rate, batch_size, n_epochs):
"""
Train logistic regression model using gradient descent.
Args:
X (numpy.ndarray): Input features (NxD matrix).
y (numpy.ndarray): Ground truth labels (Nx1).
W (numpy.ndarray): Initial weights vector (Dx1).
b (float): Initial bias term.
learning_rate (float): Learning rate for gradient descent.
batch_size (int): Size of mini-batch for stochastic gradient descent.
n_epochs (int): Number of epochs for training.
"""
N = len(X) # Calculate the total number of training examples
for epoch in range(n_epochs): # Iterate over the specified number of epochs
for batch_start in range(0, N, batch_size): # Iterate over the training data in batches
batch_end = min(batch_start + batch_size, N) # Determine the end index of the current batch
X_batch = X[batch_start:batch_end] # Extract the input features for the current batch
y_batch = y[batch_start:batch_end] # Extract the ground truth labels for the current batch
# Calculate the gradients of the loss function with respect to the weights and bias for the current batch
weights_grad, bias_grad = calculate_gradients(X_batch, y_batch, W, b)
# Update the weights and bias using gradient descent with the specified learning rate
W -= learning_rate * weights_grad
b -= learning_rate * bias_grad
# After processing all batches in the current epoch, make predictions on the entire dataset
y_pred = predict(X, W, b)
# Calculate the loss function (binary cross-entropy) value for the current epoch
epoch_loss = binary_cross_entropy(y_pred, y)
N = 100
D = 5
X = np.random.randn(N, D)
y = np.random.randint(0, 2, size=(N, 1))
W = np.random.randn(D, 1)
b = np.random.randn()
train(X, y, W, b, batch_size=16, n_epochs=10, learning_rate=0.1)
Evaluation
The metrics used to assess the logistic regression model's performance are standard classification metrics applied across various classification models, such as decision trees. See our lesson on model evaluation to learn more.
Limitations
- Overfitting. Logistic regression, like other cross-entropy minimizing algorithms, is prone to overfitting. L1/L2 regularization can mitigate this by adding a weight penalty to the loss term, encouraging smaller weights. This reduces variance and smoothens the decision boundary, improving generalization.
- Reliance on key assumptions. Logistic regression assumes a linear relationship between inputs and log odds of prediction, absence of multicollinearity, independent observations, and consistent predictor-outcome relationships. Violating these assumptions can lead to biased estimates and unreliable predictions. When assumptions aren't met, alternative methods like decision trees or ensemble algorithms may be more suitable.
- Sensitivity to imbalanced data. Imbalanced training data can cause logistic regression models to perform poorly on unseen data or yield biased results. This can be addressed by undersampling the majority class, oversampling the minority class, or applying weights during loss calculation to prioritize certain examples, improving the model's ability to learn from imbalanced distributions.
Employing logistic regression
Using scikit-learn (Python):
Python# Import necessary libraries import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix, classification_report # Load dataset (using the Iris dataset as an example) from sklearn.datasets import load_iris # Load Iris dataset iris = load_iris() X = iris.data y = (iris.target == 0).astype(int) # Binary classification: Classify setosa (0) vs. not-setosa (1) # Split the dataset into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Initialize logistic regression model log_reg = LogisticRegression() # Train the model log_reg.fit(X_train, y_train) # Predict on the test set y_pred = log_reg.predict(X_test) # Evaluate the model accuracy = accuracy_score(y_test, y_pred) conf_matrix = confusion_matrix(y_test, y_pred) class_report = classification_report(y_test, y_pred) # Print evaluation metrics print("Accuracy:", accuracy) print("Confusion Matrix:\n", conf_matrix) print("Classification Report:\n", class_report)
Common questions
Q: How do you interpret the coefficients in logistic regression?
A: The coefficients in logistic regression indicate the change in the log-odds of the outcome for a one-unit increase in the predictor variable. Positive coefficients increase the likelihood of the outcome, while negative coefficients decrease it.
Q: Suppose you have a logistic regression model trained to classify spam emails. How would you evaluate this classifier?
A: We can use the precision, recall, and the f1 score to see how your classification model is performing. This will give us a sense of how well the model is classifying spam emails, rather than just non-spam emails. Since the vast majority of examples tend to be non-spam, using accuracy to evaluate this classifier will give a false sense of how “good” the model is - you can have a model achieve 99% accuracy if you just classified all of them as non-spam, but this certainly doesn’t mean the classifier is performing well.
Q: What’s the relationship between the cross entropy loss function and maximum likelihood?
A: Finding the parameters that minimize the cross entropy loss function is equivalent to maximizing the log-likelihood of the model’s probability distribution given some observed data. The maximum likelihood formulation is given as:
Since this is a product, we can turn this into a sum of the log of these terms:
We can see that this is equivalent to minimizing the negative of this value, which gets us the general cross entropy loss function (also known as the log loss):
Q: Why do use minibatch stochastic gradient descent in logistic regression? What’s the effect of changing the batch size?
A: In practice, using minibatch SGD offers a few advantages:
- Smaller batches of a larger training set can more easily fit into the CPU/GPU memory
- Smaller batch sizes can be used for regularization
- It can help the model converge more quickly, as compared to using the entire training set.p
Using larger batch sizes will give more accurate gradient estimates, and so may result in faster convergence as compared to smaller batch sizes. As mentioned above, using smaller batch sizes has a regularization effect and can help the model generalize better to unseen data.