Linear Regression Concepts
Data scientists apply linear regression in predictive modeling and data analysis to understand the relationship between different variables or trends in the data. Interviewers are evaluating your ability to interpret and apply regression analyses within business contexts.

In this lesson, we review the main regression concepts you should have a good understanding of when preparing for interviews.
These concepts include:
- Ordinary least squares
- Feature selection
- Model interpretation
If you’re applying to a role that focuses heavily on building machine learning models, refer to Selecting a Model for ML Systems.
Ordinary least squares
Ordinary least squares (OLS) is a method for estimating the unknown parameters in a linear regression model. It is the most common method for fitting a linear regression model to data.
In a linear regression model, the relationship between the dependent variable (Y) and one or more independent variables (X) is represented as:
Where
- = dependent variable (response variable)
- = independent variables (predictor variables)
- = regression coefficients (parameters)
- = error term (residuals)
The goal of OLS is to find the "best-fitting" line that minimizes the sum of the squared differences between the observed values of the dependent variable and the values predicted by the linear model, also known as the sum of squared errors (SSE), mathematically represented as:
Where
- = the number of observations
- is the observed value of the dependent variable for observation
- is the predicted value of the dependent variable for observation
Key assumptions of OLS
- Linearity: The relationship between the independent variables and the dependent variable is linear. This means that the change in the dependent variable for a one-unit change in an independent variable is constant.
- Independence of errors: The errors (residuals) are independent of each other. In other words, the error term for one observation does not predict the error term for another observation.
- Homoscedasticity: The variance of the errors is constant across all levels of the independent variables. This ensures that the spread of the residuals is consistent throughout the range of the independent variables.
- Normality of errors: The errors are normally distributed. While the normality assumption is not strictly necessary for large sample sizes due to the central limit theorem, it is important for small sample sizes to ensure the validity of statistical tests and confidence intervals.
- Absence of severe multicollinearity: Multicollinearity refers to a situation in which two or more independent variables in a regression model are highly correlated with each other, making it difficult for the model to differentiate the individual effects of each variable on the dependent variable.
When using regression analysis, it is important to test whether the assumptions of OLS hold, for which various diagnostic tests and graphical methods can be employed. Some common techniques you should know include:
- Residual analysis:
- Plot residuals vs fitted values: Residuals should be randomly scattered around zero with no discernible pattern. A pattern may indicate violation of the independence or homoscedasticity assumption.
- Plot the residuals against each independent variable: Again, no clear pattern should emerge, indicating no violation of linearity or independence.
- Q-Q (Quantile-Quantile) plot of residuals: This plot compares the distribution of residuals against the expected normal distribution. Deviation from a straight line indicates non-normality of residuals.
- Cook's distance: Cook's distance identifies influential observations that may unduly influence the regression results. Observations with large Cook's distances may bias OLS results.
- Variance inflation factor (VIF): VIF measures the severity of multicollinearity among independent variables. High VIF values (usually greater than 10) suggest multicollinearity issues, which can lead to an inaccurate estimation and interpretation of the regression model.
Feature selection
Feature selection in linear regression is the process of choosing a subset of relevant independent variables, or features, to include in the regression model while excluding irrelevant or redundant ones.
Effective feature selection can improve model interpretability, reduce overfitting, and enhance predictive performance.
Common techniques include:
- Univariate feature selection: Univariate feature selection methods evaluate the relationship between each independent variable and the dependent variable independently. Some examples of this method are:
- F-test: assesses the significance of each feature's contribution to the model's performance.
- Mutual information: measures the amount of information gained about the dependent variable by knowing the value of each feature.
- Chi-square test: assesses the independence of categorical features with respect to the dependent variable.
- Stepwise selection: Stepwise selection methods iteratively add or remove features from the model based on certain criteria (e.g. forward selection, backward elimination, bidirectional elimination). These methods use statistical tests or performance metrics to decide which features to include or exclude at each step.
- Dimensionality reduction: Techniques like principal component analysis (PCA) and singular value decomposition (SVD) transform the original features into a lower-dimensional space while preserving most of the variance in the data. The transformed components can be used as predictors in the regression model.
Model interpretation
Interpreting a linear regression model involves a combination of understanding the coefficients, assessing their significance and direction, and evaluating the overall fit and validity of the model. It's essential to interpret the results in the context of the specific research question and the characteristics of the data.
Here are the key steps to interpreting a model:
Interpreting coefficients
represents the change in the dependent variable for a one-unit change in the corresponding independent variable, holding all other variables constant. Example: if , for every one-unit increase in , is expected to increase by 0.5 units, assuming all other variables remain constant.
Assessing the significance of coefficients
A significant coefficient indicates that the independent variable has a statistically significant effect on the dependent variable. Use hypothesis tests such as T-tests or F-tests to make this assessment.
Evaluating the magnitude and direction of coefficients
A positive coefficient indicates a positive relationship between the independent variable and the dependent variable, while a negative coefficient indicates a negative relationship. The magnitude of the coefficient indicates the strength of the relationship. Larger coefficients suggest a stronger effect of the independent variable on the dependent variable. The intercept term β₀ represents the expected value of the dependent variable when all independent variables are zero.
Assessing the model fit
Assessing model fit involves evaluating how well the model explains the variation in the dependent variable and how accurately it predicts outcomes.
Common techniques include:
- R-squared (coefficient of determination): measures the proportion of variance in the dependent variable that is explained by the independent variables in the model. It ranges from 0 to 1, where 1 indicates that the model explains all the variability in the dependent variable, and 0 indicates that the model explains none of the variability. Higher R-squared values indicate a better fit of the model to the data, although it does not necessarily mean that the model has predictive power.
- Adjusted R-squared: modified version of R-squared that adjusts for the number of independent variables in the model. It penalizes the inclusion of irrelevant variables and prevents overfitting. Adjusted R-squared tends to be lower than R-squared when there are many predictors in the model.
- Root mean squared error (RMSE): measures the average deviation between observed and predicted values. It provides a measure of the absolute fit of the model to the data, with lower values indicating better fit. RMSE is useful for assessing the predictive accuracy of the model, particularly when comparing different models or evaluating the performance on new data.
- F-statistic: tests the overall significance of the regression model by comparing the fit of the full model with the fit of a null model (intercept-only model). A significant F-statistic indicates that the regression model as a whole is useful for predicting the dependent variable.
- Akaike information criterion (AIC) and Bayesian information criterion (BIC): penalize the inclusion of additional variables in the model. Lower values of AIC or BIC indicate better fit, with the tradeoff between model complexity and fit.
Overfitting occurs when a model learns to capture noise or random fluctuations in the training data rather than the underlying patterns or relationships. As a result, an overfitted model performs well on the training data but generalizes poorly to new, unseen data. It can occur when the model includes too many predictors (independent variables) relative to the amount of data available. Including unnecessary predictors may lead to spurious relationships being captured by the model.
Bias variance tradeoff
There’s usually a tradeoff between bias and variance. Bias refers to the error introduced by approximating a real-world problem with a simplified model. A high-bias model may underfit the data by oversimplifying the underlying relationships.
Variance refers to the model's sensitivity to small fluctuations in the training data. A high-variance model may overfit the data by capturing noise rather than the underlying patterns.
In linear regression, increasing the complexity of the model (e.g., adding more predictors) tends to decrease bias but increase variance, while decreasing model complexity tends to decrease variance but increase bias.
Feature selection methods can help identify the most relevant predictors and reduce the risk of overfitting by excluding irrelevant or redundant variables from the model.
Regularization techniques, such as ridge regression and lasso regression, can help mitigate overfitting by penalizing large coefficients and reducing model complexity.
Watch Exponent’s mock interviews to see how these concepts get assessed in interviews.