Skip to main content

How to Answer Regression Questions

Premium

In this lesson, we'll teach you how to answer conceptual and applied regression interview questions.

Conceptual questions

Conceptual questions test your ability to apply regression concepts in real-world scenarios. The key is to communicate clearly and use practical examples to illustrate the theory.

Imagine you’re asked, “What’s the best way to handle multicollinearity in a regression analysis?” The answer would involve specifying the purpose of the regression:

  • If it is a statistical analysis to determine which features are most predictive of a given outcome, you would need to remove collinear features that conflict with the important features you are testing for significance.
  • If the purpose is solely prediction, you might leave in collinear features and simply test different feature sets and see which feature sets give the best cross-validated result.

Applied questions

Applied statistics interview questions allow interviewers to evaluate your approach and execution of statistical techniques and formulas. In addition to mathematical accuracy, interviewers are assessing your ability to communicate your approach in an organized way.

Below, we describe a 6-step framework for answering numerical and applied questions.

  • Step 1: Define the problem. Ask clarifying questions and present the problem statement.
  • Step 2: Identify assumptions and variables. Identify relevant outcomes and any conditions or criteria.
  • Step 3: Apply the relevant regression model, and check assumptions. Implement a regression model to analyze the relationship.
  • Step 4: Examine the coefficients. Determine the effect of the features on the target variable.
  • Step 5: Relate back to the business case. Interpret the result in the context of the problem.
  • Step 6: Check in with the interviewer. Be open to feedback and constructive criticism.

Regression Framework (updated)

As you practice with this framework, remember to review Rubric for Statistics & Experimentation Questions to understand how interviewers are evaluating your answer.

Say you’re given the following interview question:

“Imagine you're working for a tech company that wants to analyze the relationship between website metrics (page load time, traffic source, and time on site) and conversion rate.

How would you analyze this and identify opportunities to improve conversions? Each row in the dataset would represent a single user interaction with the site.”

Step 1: Define the problem

When appropriate, ask clarifying questions and define the scope before starting to think about the solution. Defining the scope is particularly important to clarify what you will be focusing on, and what will be deprioritized.

In our example question, you could say,

“My goal is to understand how various website metrics—such as page load time, traffic source, and time on site—impact the conversion rate. Once I understand the influence of these metrics on conversion rate, I can suggest strategies to improve it. This analysis will focus solely on the relationship between these metrics and conversion rate, without extending to other areas like tracking how these metrics have changed over time. Does this approach seem appropriate?”

Step 2: Identify assumptions and variables

Identify the relevant variables and any underlying assumptions.

When answering regression case questions, it is best practice to include stating the assumptions inherent in a regression analysis as shown in the example response below.

Some other assumptions might relate to the problem at hand, for example, what the business priority is, or what metric we are trying to optimize for. If you're unclear at this step, be sure to clarify with your interviewer as you shape your answer.

You could say,

“Our independent variables here are page load time, traffic source, and time on site. Our dependent variable is conversion rate.

Since we are using multiple features to predict our target, we’ll have four main assumptions for our logistic regression:

  1. We assume a linear relationship between the predictors and the target
  2. We assume all observations are independent
  3. We assume homoscedasticity
  4. We assume a normal distribution of the errors

Furthermore, we assume that the predictors are not multi-collinear with one another, and we can check this using a Pearson’s correlation test for continuous variables. If we do find multicollinearity, we would need to run regression analyses separately for these predictors."

Step 3: Apply the regression model, and check assumptions

Implement a regression model to analyze the relationship between the variables.

If the target variable is continuous, you can use a linear regression. If the target variable is categorical, you are best off using a logistic regression. While both are common, logistic regression may come up more often since many targets are categorical.

You could say,

“As conversion is categorical and not continuous binary, a logistic regression would be appropriate.

Assuming there are three different traffic sources: Organic Search\text{Organic Search}, Paid Ads\text{Paid Ads}, Referral Traffic\text{Referral Traffic}, the regression formula is:

log(p1p)=β0+β1(Page Load Time)+β2(Time on Site)+β3(Organice Search)+β4(Paid Ads)+β5(Referral Traffic)\log(\frac{p}{1-p}) = \beta_0+\beta_1(\text{Page Load Time}) + \beta_2(\text{Time on Site}) + \\\beta_3(\text{Organice Search}) + \beta_4(\text{Paid Ads}) + \beta_5(\text{Referral Traffic})

Where pp represents the probability of conversion.

Using this formula, we can determine the effect that each feature has on our target variable.”

Lastly, examine the residuals to ensure assumptions are met and check the model's R-squared value to assess its explanatory power.

In logistic regression, the equation predicts the probability that a given data point will result in a conversion (a binary outcome). However, because probabilities are bounded between 0 and 1, directly modeling them using a linear equation can lead to predictions outside this range.

To handle this, logistic regression uses the log-odds, or logit function, log(p1p)\log\left(\frac{p}{1-p}\right), where pp is the probability of conversion. This transformation maps the probability to a continuous scale from -\infty to ++\infty, allowing the regression model to output a value that can be converted back into a probability between 0 and 1.

The logit function ensures that the relationship between the predictors and the probability of conversion is modeled appropriately, without violating the constraints of probability values.

Step 4: Examine the coefficients

After running the regression, examine your coefficients and determine the predictive power of each feature. This is the heart of the question.

These coefficients inform which features are most and least important, and also explain how they affect the underlying business case. Keep that in mind as examine them, and use that as the core information to underlie your business recommendations you'll share at the end.

If you are using linear regression, and the coefficients are significant without multicollinearity, you can infer that a 1 unit increase in the feature results in a coefficient-sized increase to the target.

If you are using logistic regression, you must calculate the Odds Ratio to determine the effect of a feature on the target variable.

Let’s say our logistic regression yields the following results.

FeatureCoefStd ErrorzP>|z|
Intercept-3.202.323-9.0450.000
Page Load Time (100 ms)-.0113.0028.1750.000
Time On Site (min)1.1980.4822.4870.013
Organic Search.0809.02086.5640.000
Paid Ads-1.4800.3054.9120.009
Referral Traffic3.9421.96410.120.000

First, we notice that all of these features appear to have a statistically significant effect on conversion while controlling for all of the other features.

Next, we can see that our most positive coefficient is TimeOnSiteTime On Site, and PaidAdsPaid Ads is the most negative.

But what does that mean more specifically? To answer that, we would look at the odds ratio.

The Odds Ratio for a given feature tells us that a one-unit increase in the feature will multiply the probability of the target variable by that percent. The formulate for the odds ratio is given as:

Odds Ratio=ecoef\text{Odds Ratio} = e^{\text{coef}}

So in our example, we can calculate the odds ratio for each of the features in our regression. This would return the following table:

FeatureCoefecoefe^{\text{coef}}
Intercept-3.2020.041
Page Load Time (100 ms)-0.01130.989
Time On Site (min)1.1983.313
Organic Search0.08091.084
Paid Ads-1.480.228
Referral Traffic2.2429.412

As we can see, it looks like traffic that comes from a referral is almost 10x as likely to lead to a conversion. Furthermore, for every 100ms of page load time, the conversion rate drops by over 1% - which quickly adds up! We also notice that for every minute of time spent on the site, customers or 3x more likely to convert.

Step 5: Relate back to the business case

Interpret the results in relation to the initial problem. Confirm that your analysis answers the original question.

You might say,

“The initial business problem focused on how to improve conversions. From our analysis here, we see that the 3 biggest levers we can pull are: increasing our referral traffic, reducing page load time, and increasing time on site.

Surprisingly, our paid ads tend to lead to a significantly lower conversion rate, so I would recommend looking into the ads providers we are using to see if we can improve that conversion rate, as well as shifting our marketing spend more into referral traffic.

Page load time also may be a problem. To know that, I’d need to do further analysis on the variance in page load time, but it does seem that if it isn’t optimal it could be driving down conversions significantly.

Lastly, we see that increasing time on site can have a large effect on conversion rate as well. Perhaps making the site more engaging, or increasing the offers available to give people more to look at, would improve our conversions as well.

It should be noted that this could be related to a confounding factor such as the time it takes to purchase, so we would need to do more analysis and monitor the effect of time on site on conversion as we make these changes to confirm it is indeed working as intended.”

Step 6: Check in with the audience

Be prepared to discuss your findings and consider alternative approaches or further refinements. For instance, you could explore additional metrics that might affect conversions or consider non-linear relationships if warranted by the data.