Skip to main content

Linear Regression

Linear regression is a machine learning algorithm that predicts a scalar value from a set of features values by computing their weighted combinations with a bias term. Mathematically, the output is a linear combination of features, hence the name linear regression.

The weights and biases can be learnt in several ways depending on the size of the data. For small datasets, the weights can be learnt through direct application of pseudo inverses using scientific linear algebra libraries. For larger datasets, we use stochastic gradient descent (SGD) or related techniques where the weights are learned incrementally through batch (or mini-batch) updates.

Linear regression is extremely fast to compute and is used in various applications, including real estate property valuation and marketing sales prediction. It is widely embedded in utility modules for numerous tasks, such as data imputation, financial forecasting, and risk assessment.

By the end of this lesson, you'll have a better idea of how to answer the following commonly asked questions:

  1. What are some plots that can be run on your data to check whether linear modeling is going to be useful?
  2. What techniques can be used to determine if a linear model applied to a dataset violates any of its preconditions or requirements?
  3. What are some helpful ways to transform your data after doing a linear regression, before doing another linear regression?
  4. Why does L1 regularization in linear regression lead to sparsity in the resulting model in comparison to L2 regularization?
  5. Typically in a linear regression model, coefficient magnitudes are a proxy for feature importances. When is this supposition incorrect? How do we compute a proxy to feature importances in such cases?

We'll provide the answers at the end of this lesson.

Overview

Linear Regression
Supervised/UnsupervisedSupervised
InputA dataset where each data point is represented by a set of features or attributes. These features could be continuous or categorical, but they must be numerical in nature.
OutputA scalar value that is the prediction (or estimation) from the linear regression model.
Use casesPrediction of scalar values, e.g., rental prices, real estate property values.
LevelConcepts to master
JuniorR-square metric, Correlated features, Regularization
Mid-levelR-square metric and relationship to MSE, High condition number, Loss functions (L1, L2), Details of Regularization (Lasso, Ridge)
SeniorVariance Inflation Factor, Model selection through BIC and AIC, Contours of loss function, Stability

Algorithm

The training process is about learning the model parameters β\betas (weights):

  1. The dataset contains multiple rows (m rows) of labeled values. Each row is (y,x1,x2,...,xn)(y, x_1, x_2,..., x_n). Establish the linear formula for predicting yy as y~=β0+β1x1+β2x2+...+βnxn\tilde{y} = \beta_0 + \beta_1x_1+ \beta_2x_2+ ... + \beta_nx_n
  2. Initialize the weights, β\betas, to random values
  3. Compute y~\tilde{y} for each row
  4. Compute the loss for each row as (yy~)2(y - \tilde{y})^2 (squared loss). Other losses are possible such as absolute value difference, yy~| y - \tilde{y}|, or a weighted combination of quadratic and absolute value. Additionally, we typically add a regularization term to the loss value. The loss value is used to determine if we have converged (see step 8).
  5. Compute the gradient of loss with respect to each parameter :
    • Gradient of loss with respect to β0\beta_0 is 2(yy~)(1)2(y - \tilde{y})(-1)
    • Gradient of loss with respect to βk\beta_k is 2(yy~)(xk)2(y - \tilde{y})(-x_k)
  6. Update each of the weights by their gradients: βk:=βkη(yy~)xk\beta_k:= \beta_k - \eta(y - \tilde{y}) x_k, where η\eta is a single global learning rate hyperparameter (specified by the engineer based on domain expertise or trial and error)
  7. Perform steps 3 to 6 for every row of data
  8. Repeat step 7 until convergence, average loss over all mm datapoints is less than a predefined (tolerance or error rate), or for a certain number of repetitions (epochs) of the algorithm

Regularization is a technique used to prevent overfitting in machine learning models by adding a penalty term to the loss function. This penalty term discourages the model from becoming too complex by heavily weighting specific features, thereby promoting simpler models that generalize better to new data.

When predicting the outcome for a new set of features (x1,x2,...,xn)(x_1, x_2,..., x_n), it is a simple and quick step:

  1. Compute the prediction as y=β0+β1x1+β2x2+...+βnxny = \beta_0 + \beta_1x_1+ \beta_2x_2+ ... + \beta_nx_n

You may have noticed that the above learning is akin to solving a system of linear equations iteratively. Consequently, for small datasets, where all of the data can be held in memory, we can achieve the same outcome of learning the model parameters by computing matrix inverses. For overdetermined systems and underdetermined systems, we would include regularization terms to the loss.

More formally,

(1x1(1)x2(1)xn(1)1x1(2)x2(2)xn(2)1x1(m)x2(m)xn(m))(β0β1β2βn)parameters(y(1)y(2)y(m))\begin{pmatrix}1 & x_1^{(1)} & x_2^{(1)} & \cdots & x_n^{(1)} \\1 & x_1^{(2)} & x_2^{(2)} & \cdots & x_n^{(2)} \\\vdots & \vdots & \vdots & \ddots & \vdots \\1 & x_1^{(m)} & x_2^{(m)} & \cdots & x_n^{(m)}\end{pmatrix}\underset{\text{parameters}}{\underbrace{\begin{pmatrix}\beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_n\end{pmatrix}}}\approx\begin{pmatrix}y^{(1)} \\y^{(2)} \\\vdots \\y^{(m)}\end{pmatrix}

The above can be notated as:

XβyX\beta \approx y

where XX is the matrix of observations (also called features), yy is the vector of known labels, and β\beta is the vector of model parameters. Note that each row of XX is an observed data point. The algebraically and computationally appropriate way to estimate the weights is to compute the covariance matrix and invert it. So, we get β\beta as:

(XTX)1XTy(X^TX)^{-1}X^Ty

It is possible to prove that this gives rise to the best estimator for yy that minimizes the quadratic loss (yXβ)2(y-X\beta)^2.

Equations

Equation for computation of parameters (for small datasets):

(XTX)1XTy(X^TX)^{-1}X^Ty

Equation for computation of parameters (for large datasets) using stochastic gradient descent:

βk:=βkη(yy~)xk\beta_k:= \beta_k - \eta(y - \tilde{y}) x_k

where η\eta is a global learning rate hyperparameter.

When building a linear regression model, we need to choose a loss function. Here are some common options:

  1. Squared loss: Measures the squared difference between predictions and actual values
  2. Absolute error: Measures the absolute difference between predictions and actual values
  3. Huber loss: A combination of squared loss and absolute error
  4. Ridge loss: Squared loss plus a term that penalizes large parameter values
  5. Lasso loss: Squared loss plus a term that encourages some parameters to be zero

The loss function we choose significantly affects how our model learns its parameters. Let's look at two important examples:

Ridge regression: This uses ridge loss. It tends to make all model parameters smaller, which can help prevent overfitting.

Ridge regression modifies the standard linear regression equation. Its formula can be written as:

(XTX+λI)1XTy(X^TX + λI)^{-1}X^Ty

It adds a small value (λ\lambda) to the diagonal of the correlation matrix (XTX)(X^TX). This addition helps solve some common problems in linear regression:

  • It allows us to find solutions even when we have more features than data points (underdetermined systems).
  • It helps when our features are highly correlated (multicollinearity).
  • It stabilizes calculations when the correlation matrix is poorly conditioned.

The added λI\lambda I term ensures that we can always find a solution, even in challenging data situations. This approach results in a model that's more stable and often performs better on new data.

Lasso regression: This uses lasso loss. It often reduces less important parameters to zero, resulting in a simpler model that focuses on the most important features.

Lasso regression uses a different approach to regularization. While it doesn't have a closed-form solution like ridge regression, we can understand its effects as follows:

Minimize:yXβ2+λβ1\text{Minimize:} ||y-X\beta||^2 + \lambda||\beta||_1

where β1||\beta||_1 is the L1 norm (sum of absolute values) of the coefficients.

Lasso adds a penalty term to the standard linear regression equation based on the absolute values of the coefficients. This penalty encourages some coefficients to become exactly zero, effectively selecting a subset of the most important features.

This approach helps with:

  • Feature selection: It automatically identifies the most important predictors.
  • Creating simpler models: By setting some coefficients to zero, it produces more interpretable models.
  • Handling high-dimensional data: It's particularly useful when we have many features compared to the number of observations.

Lasso is often preferred when we believe only a few features are truly important for our prediction task.

By choosing between these loss functions, we can control whether our model has many small parameters (ridge) or fewer, more significant parameters (lasso).

Pseudocode

From the outlined algorithm, we can construct pseudocode for an iterative approach to linear regression using stochastic gradient descent (SGD) as follows:

Pseudocode
class MyLinearRegressor: def __init__(self, X, y, regularizer=0.1): self.reg = regularizer # assume that X contains m rows, n features # first column of X is 1, for the bias term self.m, self.n = len(X), len(X[0]) self.params = [random.rand() for _ in range(n)] def iterative_train(X, y, n): # until params convergence or count == n # compute regularized loss = (y_prediction - y_true)**2 # compute gradient # update each parameter/weight def predict(data): return sum([ x * beta for x, beta in zip(data, self.params)]) def compute_loss(): y_predicted = sum([ x * beta for x, beta in zip(data, self.params)]) loss = (y_true - y_predicted)*(y_true - y_predicted) loss += self.reg * sum([beta * beta for beta in self.params])

Requirements

There are several data conditions and requirements that must hold true for a linear regression model to be a useful predictor of the target variable:

  • Linearity: There should be a linear relationship between the features and the labels.
  • Normality of Residuals: The residuals (differences between observed and predicted values) should be normally distributed.
  • Non-collinearity: The features should not be highly collinear, meaning they should not be highly correlated with each other.
  • Homoscedasticity: The residuals should have constant variance at every level of the independent variables.
  • Similar Scales: The feature variables should be on similar scales.
  • Independence: Observed data rows should be independent of each other (this also reduces the chances of collinear rows of data)

Let’s dive into each of these requirements in more detail.

Linearity

We might assume that linearity is a given since it underpins the linear regression model. However, a priori, we cannot know whether a linear relationship truly exists between the input features and the target variable. This knowledge comes through studying the data, running a linear regression model and gathering the errors (also called residuals), or analyzing their distributions.

In the simplest case (2D data), we can plot the data and visually confirm whether there appears to be a linear relationship between the observations and targets. However, it's not as easy to visualize data at higher dimensions (>3D), so we will have to run linear regression models to confirm or deny this hypothesis.

Linearity

Normality of residuals

If the difference between a linear estimate and the target is non-normal, it indicates that applying a linear model may be suboptimal. This is illustrated below, where the horizontal black line represents a linear predictor. We can see that the residuals are not normally distributed around each predicted data point:

Normality of residuals

Non-collinearity

Multicollinearity occurs when two or more features are linearly correlated, such as when feature2 is consistently 10 times feature1. This situation affects linear regression by producing coefficients that are unreliable and/or unstable. Considering the following: the coefficient associated with feature1 can be adjusted by any amount as long as the coefficient for feature2 is correspondingly adjusted. As a result, it becomes challenging to discern the true relationship between each feature and the target variable, leading to potentially misleading model interpretations and predictions.

β1x1+β2x2=(β1+a)x1+(β2a/10)x2\beta_1x_1 + \beta_2x_2 = (\beta_1 + a)x_1 + (\beta_2 - a/10)x_2

where x1,x2x_1, x_2 refers to feature1 and feature2 respectively

Multicollinearity, in matrix algebra, can lead to non-invertible (singular) matrices and numerical issues. This arises because the presence of multicollinearity results in highly correlated columns in the design matrix, making it difficult or impossible to compute the matrix inverse accurately. Consequently, this can introduce numerical instability and inaccuracies in parameter estimation, posing challenges in solving the linear equations fundamental to linear regression.

Heteroscedasticity

In the two portions of data depicted below, the residuals (errors between the target and predictions) appear to vary differently than in the rest of the data. This phenomenon is known as heteroscedasticity. In instances of heteroscedasticity, the variability of the residuals is not constant across the range of predicted values. Consequently, linear regression may perform poorly in those regions, as it assumes constant variance of the errors. Addressing heteroscedasticity is crucial for ensuring the reliability and accuracy of linear regression predictions.

Heteroscedasticity

Similar scales

When our feature variables have vastly different scales, it can cause problems for linear regression. For example, imagine 'feature1' has values around 1 billion (10910^9), while 'feature2' has values around one billionth (10910^-9). This extreme difference can lead to unreliable and unstable results when we use numerical libraries to process these values in a linear function. To address this issue, we typically pre-process our input features to bring them to roughly similar scales. We can use techniques like Standard scaling or Min-max scaling for this purpose. While this scaling problem affects many machine learning algorithms, linear regression is particularly sensitive to it. In linear algebra terms, this issue manifests as high condition numbers, which often lead to unstable or unreliable outcomes. By scaling our features appropriately, we can avoid these numerical issues and improve the performance and reliability of our linear regression models.

Evaluation

A key metric for evaluating linear regression on the supplied data is the R2R^2 metric, also known as coefficient of determination. R2R^2 metric is the ratio of explained variance (by the model) over the total variance in the data, and it is calculated as:

R2=1MSETotal varianceR^2 = 1 - \frac{\text{MSE}}{\text{Total variance}}

where MSE is the Mean Squared Error of the model's predictions. The R2R^2 value ranges from 0 to 1, where a value closer to 1 indicates that a larger proportion of the variance in the dependent variable is predictable from the independent variables. Equations to calculate MSE and total variance are listed below:

MSE=1mi=1m(ytrue(i)ypred(i))2\text{MSE} = \frac{1}{m}\sum_{i=1}^m (y_{\text{true}}^{(i)} - y_{\text{pred}}^{(i)})^2 ymean=1mi=1mytrue(i)y_{\text{mean}} = \frac{1}{m}\sum_{i=1}^m y_{\text{true}}^{(i)} Total variance=1mi=1m(ymeanytrue(i))2\text{Total variance} = \frac{1}{m}\sum_{i=1}^m (y_{\text{mean}} - y_{\text{true}}^{(i)})^2

Senior candidates are expected to understand how model goodness parameters are influenced by the number of parameters, particularly AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion). While you may not need to know the exact algebraic formulation, you should grasp the concepts. Model evaluation using AIC or BIC adjusts for the parameter count and the number of observations. AIC and BIC are closely related algebraically, with BIC computed as:

BIC=nlog(RSSn)+klog(n)\text{BIC} = n \cdot \log\left(\frac{\text{RSS}}{n}\right) + k \cdot \log(n)

where nn is the number of samples, RSS\text{RSS} is the residual sum of squares, and kk is the number of parameters.

AIC/BIC is useful for comparing two models with similar R2R^2 values. A lower AIC/BIC score indicates a better model. While R2R^2 assesses the goodness of fit (on the data) alone, AIC/BIC adds a penalty for the number of parameters and samples. Intuitively, a model with more parameters would score higher in AIC/BIC, suggesting a poorer fit.

The diagram below presents the statistical output from a regression analysis. Data scientists are generally expected to have a comprehensive understanding of the various fields and metrics reported in such regression results. However, for machine learning engineers, the focus is typically on understanding a subset of these metrics, including the R-squared value (which indicates the model's goodness of fit), the coefficients (which represent the estimated relationship between each predictor variable and the outcome), the standard errors (which measure the uncertainty or variability in the coefficient estimates), the p-values (which assess the statistical significance of each coefficient), and the confidence intervals (which provide a range of plausible values for each coefficient).

Statistical output from regression analysis

Limitations

  1. Linear regression assumes a straightforward relationship between input features and the outcome, which can be limiting in complex scenarios. Stock price prediction exemplifies this limitation. Stocks are influenced by numerous factors (historical prices, trading volume, economic indicators, market sentiment) that interact non-linearly. They also exhibit time-related patterns (trends, seasonality, autocorrelation) and volatility clustering. Linear models struggle to capture these complex, dynamic behaviors, often resulting in poor predictive performance in financial markets.
  2. Linear regression doesn't inherently consider interactions between features. In house price prediction, for example, a linear model might not capture how neighborhood quality interacts with house size to affect price. However, we can improve linear models through feature engineering. This involves creating new features that capture interactions (like combining neighborhood data with square footage) or transforming existing ones (such as using inverse square of distance for location). These techniques can significantly enhance a linear model's predictive power. In essence, complex neural networks can be viewed as performing multiple layers of automated feature engineering to improve linear prediction at the final stage.

Linear regression forms the basis of the classification model called logistic regression.

Employing linear regression

Here are a few prevalent methods and libraries for utilizing linear regression in code:

Using scikit-learn (Python)

Python
import numpy as np from sklearn.linear_model import Ridge # Generate random data np.random.seed(42) X = np.random.rand(100, 2) y = 3 * X[:, 0] + 2 * X[:, 1] + np.random.randn(100) # Add constant term to X X = sm.add_constant(X) # Create and fit Ridge regression model model = Ridge(alpha=0.1) model.fit(X, y) # Print model coefficients print(model.coef_)

Using statsmodel (Python)

Python
import statsmodels.api as sm import numpy as np # Generate random data np.random.seed(42) X = np.random.rand(100, 2) y = 3 * X[:, 0] + 2 * X[:, 1] + np.random.randn(100) # Add constant term to X X = sm.add_constant(X) # Create and fit linear regression model model = sm.OLS(y, X).fit() # Print statistical summary print(model.summary())

statsmodel is preferred in traditional fields (statistics, social sciences, health sciences etc.) provided they choose to use Python libraries. Otherwise, such fields traditionally use custom packages written in R and custom-built statistical software tools with heavy reliance on statistical tests and frequentist perspective on modeling.

Common questions

Q: What are some plots that can be run on your data to check whether linear modeling is going to be useful?

A: There are several EDA (exploratory data analysis) plots that one might run. Correlation plots can immediately surface collinear feature variables.

example 1 correlation plot

This image shows a correlation heatmap for various features of Android devices from a Kaggle dataset. The heatmap uses colors to represent how strongly pairs of features are related to each other and to the target variable (price). Red indicates a strong positive correlation (close to 1.0), while blue shows a strong negative correlation (close to -1.0). Looking at the map, we can see some interesting relationships:

  • storage and ram have a strong positive correlation (0.81).
  • ram is also strongly correlated with main_camera_mp and front_camera_mp.
  • There's a weak negative correlation between storage and reviews (-0.096).

If we focus on just the device features (excluding the price, which is what we're trying to predict), the top three most correlated pairs are:

  • storage and ram (0.81)
  • battery_capacity and display_size (0.8)
  • display_size and rating (0.72)

As a data scientist or machine learning engineer, these strong correlations present an interesting challenge. We might consider creating several models, each time removing one feature from these highly correlated pairs. We could then compare these models based on their R2R^2 scores (a measure of how well the model fits the data) and their performance on new, unseen data. This approach helps us understand which features are most important for predicting price and avoid potential issues caused by including too many closely related features in our model.

An additional correlation plot of interest is between the target variable, price, and the features:

example 1 correlation plot 2

Another helpful plot displays the distribution of the target (price, in this case):

example 1 price distribution

The non-normal distribution of price as seen above should cause us to investigate whether the conditions of linear models will be violated after fitting.

Q: What techniques can be used to determine if a linear model applied to a dataset violates any of its preconditions or requirements?

A: There are several plots one might use. A few important ones include: residual vs. fitted values, QQ plots of residuals, histogram and density plots of the target variable (price), residuals vs. feature (for each feature).

First, let’s look at the linear model summary:

example 2 regression analysis

R-square values in the 60% range imply that there are signals that are explainable by a linear model. The “notes”, suggest the presence of multicollinearity (i.e. linearly related features). Large condition number is indicative of the sensitivity of the model to small changes in observed data. Numerical problems could arise from the scales of feature values (a set of features being orders of magnitude higher than another) or strong collinearity among a few features.

Second, a plot of the residuals against the fitted values tell us visually whether independence of residuals, homoscedasticity are exhibited by the data after applying a linear model:

example 2 residual vs fitted

As is visually obvious, the distribution of residuals at a fitted value of 10,000 is different from the distribution around 30,000. For a homoscedastic distribution of price, the spread of residuals must be the same for each fitted value. Residuals are unequally distributed around the zero line (one would anticipate the equal number of negative residuals as positive residuals). Therefore, the residuals are not independently distributed. The curved line (known as the lowess curve) suggests that there might exist some non-linear and/or interaction between features that are not captured in the linear model.

Third, checking the distribution of target variable is helpful, though the target is not expected to be normally distributed:

example 2 price distribution

Fourth, after we fit a linear model, we can inspect the residual and compare to the expected normality for residuals (using a QQ plot):

example 2 qq plot

When the quantiles do not line up with the diagonal (red) line, that is in indication of non-normality of residuals. This is an indication that linear modeling is a poor choice for this dataset. A few possible venues of exploration here include outcome variable (price) transformation (use log, price exponentiated - box cox transformation etc.), feature engineering (polynomial features, feature interactions), alternate modeling techniques (spline models, decision trees, random forests etc.).

Fifth, residual vs predictor plots can be useful indicators regarding linearity, independence, homoscedasticity, and outliers. For instance, here we notice that the residuals have a long tail in the positive direction compared to the negative direction across several features. These suggest a non-linear relationship between the target and the feature. Varying spread between the residuals across varying values of a feature, say, RAM, will directly impact the estimated coefficients (less stable and unreliable). Widely varied outliers in residuals will unduly affect the learnt coefficients of the linear model, leading to a biased model - this suggests the use of robust models (ridge, lasso, L1, L2 regularization) as a remediation.

example 2 residual plots

Q: What are some helpful ways to transform your data after doing a linear regression, before doing another linear regression?

A: Several options are possible. One could conduct box cox tests to identify transformations of the output variables so that it exhibits normality - a few of the transformations of the output variable, y, include square root, log, or inverse transforms. The specific choice is based on lookup tables against the outcomes of BoxCox tests. With Boxcox transformations, we see that the ideal transformation is (yλ1)/λ(y^\lambda - 1)/\lambda, with λ=0.1228349765682389\lambda = -0.1228349765682389.

example 3 price distribution

With this transformation, we can better fit the data. Here, we notice model improvements (note the improved R2R^2 score and AIC/BIC scores):

example 3 regression analysis

Residuals are still non-normally distributed, nevertheless better than previous modeling:

example 3 qq plot

In such an instance, clearly one could explore several other modeling techniques such as splines, hierarchical modeling, decision trees, random forests etc., with higher order features, interaction features, and scaled features.

Q: Why does L1 regularization in linear regression lead to sparsity in the resulting model in comparison to L2 regularization?

A: In L1 regularization, we use the absolute values of the weights as the regularization term. Consequently, the loss function becomes larger than without regularization, driving the model weights to be smaller. In contrast to L2 regularization, L1 regularization tends to drive the model parameters more drastically toward zero.

This can be understood intuitively by plotting the contour plots of the loss function with respect to the weights. With L1 regularization, the contour lines are composed of line segments, forming a hypercube in higher dimensions. This hypercube intersects the coordinate axes more frequently than the smoother hypersphere formed by L2 regularization contours. As a result, L1 regularization drives smaller coefficients (weights) that do not contribute significantly to the model fit to zero.

Below is a pictorial depiction to aid intuitive understanding. The hyperparameter α\alpha is varied from 0.1, 10, to 100. The left-hand column shows L1 regularization, and the right-hand column shows L2 regularization. During the learning phase, the optimization engine takes a path from the lighter regions to the darkest (where the loss value is zero). In doing so, the optimization engine encounters more occurrences of zero weights in the L1 regime than in the L2 regime.

Symbolically, one can observe this by noting that the absolute value function is not differentiable at zero, causing smaller weights to be set exactly to zero. In contrast, with L2 regularization, there are non-zero gradient values everywhere, including at zero, allowing the weights to settle at non-zero values.

example 4 regularization

Q: Typically in a linear regression model, coefficient magnitudes are a proxy for feature importances. When is this supposition incorrect? How do we compute a proxy to feature importances in such cases?

A: Coefficient magnitudes reflect feature importances only when the features are of comparable scales. This can be forced in a linear model by scaling all the features using a standard scaler or min-max scaler etc. However, in instances when this is not feasible or where a linear modeling has already been performed, and we are interested in the feature importances, we can compute the t-statistic (which is the coefficient value divided by the standard error of the coefficient). The magnitude of the t-statistic reflects the feature importance for the linear model.