Modeling Linear Regression
You are modeling marketing return on investment (ROI). You have each month’s revenue on the Y axis and spend on the X axis.

You decide to use a simple linear regression model to evaluate whether spending more would generate more revenue. You find your linear intercept (b) is $1.5MM and gradient (a) is 2.1. Your residual standard error is 79.1 and your adjusted R-squared is 0.72 with a p-value of 1.09e-9.
a. How much of your data’s variance has your model explained and can the result be called significant?
b. Our problem requires more accuracy in modeling the data. How can we alter the linear equation to better fit the data? What regression model would you pick and why?
c. Your new model explains 98% of the data variance. How would you determine if your model is overfitting? How would you evaluate the model overall fit and parameters fit?
The p-value is very small. What does that tell you?
What would increase the complexity of the model?
Make sure to check that the features aren’t collinear.
How do the residuals look?
For part (b): Let's say our model failed because of multicollinearity. What model would you chose to get more accuracy?
For part (c): Can you name more than five checks you would use to determine if the model is overfitting?
Part A
a. How much of your data’s variance has your model explained and can the result be called significant?
The data variance explained is the amount of difference from the average of the predicted values: how accurate the model was at forecasting revenue based on how well it modeled spend. One easy way to represent this is the R-squared value, so 72%. R-squared is a statistical measure of how close the data are to the fitted regression line and also known as the coefficient of determination. An R-squared value of 100% would indicate that the model explains all the variability of the response data around its mean. See the below figure illustrating high and low R-squared values (the graph on the left has 4x the R-squared value as the one of the right):

Significance is saying that the result is likely to happen many times if the same model is applied to different data points. Set your desired significance to be above 95% or 98%. Such a small p value of 1.09e-9 means it’s very unlikely that that data used is unusual and that you have an extremely low chance of the result being based solely random sampling error. Yes, let’s call this one significant.
Part B
b. Our problem requires more accuracy in modeling the data. How can we alter the linear equation to better fit the data? What regression model would you pick and why?
The short answer: pick a more complex model to better fit the fact that a straight line requires linear combination of dependent variables: stepwise, polynomial, or ridge regression models would probably all work well.
The better answer: our goal is to introduce higher order to capture more of the variance of them model and to pick the model type based on analysis of why the original linear model failed. Let’s say it failed because of multicollinearity (we checked this using the variance inflation factor (VIF) to identify correlation between independent variables and the strength of that correlation), we would have chosen ridge regression to lower the correlation between your features and produce a better result. We chose ridge regression in this case over lasso regression since we want to lower our correlation without removing any of the features as lasso regression does. This would also be the best time to go back and evaluate if you left out any features that would provide useful data for the model to pickup. For example, adding the month of the year and the number of holidays per month are 2 simple features that could be added in addition to spend to help predict revenue. Once you have a list of features that all have high p-values when evaluating using the t-test of significance, see if we can add composite features such as number of holidays in the month divided by the number of weekends in the month to weight the numerator. Be creative here to make certain you don’t leave out helpful features. Remember when adding composite features to recheck the error residuals (i.e. KSE residual plot) to make certain we haven’t added a correlated feature.
Part C
c. Your new model explains 98% of the data variance. How would you determine if your model is overfitting? How would you evaluate the model overall fit and parameters fit?
To determine whether or not the model is overfitting, we check that we don’t have model bias or high sensitivity to each feature point. Then, we can run an analysis of variance, or ANOVA test: a statistical method that separates observed variance data into different components to use for additional tests to see if the extra terms are benefiting the model. Next check the F-test to verify your model is better than the intercept model and check your MSE to see how large your residual is. (RSE: Closer to zero the better, R-Squared: Higher the better, F-statistic: Higher the better)
-
Check the errors per feature. Plot a KSE of all the residuals to check that they are not correlated.
-
Check the t-statistic and the p-value of the dependent variable betas to make sure they are significant.
-
Check to see if the adjusted R-squared goes up with each new feature and stop when it stops.
-
Check that you minimize your residual standard error.
-
Check your VIF is below 5 to show your features aren’t correlated.
-
Try cross-validation to see how the model behaves with different data.