Skip to main content

How to Answer Hypothesis Test Questions

Premium

Hypothesis testing is a statistical method used to infer population parameters based on sample data. You formulate two competing hypotheses, commonly referred to as “null” and “alternative” hypotheses, about the population parameter of interest, which is typically the population mean μ\mu. You then use sample data to assess the evidence in favor of one hypothesis over the other.

This lesson will cover:

  • How to answer interview questions about designing a hypothesis test
  • Important factors to consider when choosing a significance level
  • An example test design that shows how to apply the answer framework

When asked to design a complete hypothesis test, follow the 7-step framework below to guide your answer. Articulate each step to share your thought process with the interviewer.

  • Step 1: Formulate the hypotheses. Specify both the null hypothesis and alternative hypothesis.
  • Step 2: Select the statistical test. Select the test most compatible with the nature of the data and the hypothesis being tested.
  • Step 3: Choose a significance level. Consider the scenario and discuss with the interviewer to pick a reasonable α\alpha.
  • Step 4: Calculate the test statistic. Gather relevant data from the sample, and compute the value of the chosen test statistic.
  • Step 5: Determine the critical region. Determine the region(s) where the null hypothesis will be rejected.
  • Step 6: Check evidence against the null hypothesis. Use the critical region or the p-value to assess the evidence against the null hypothesis.
  • Step 7: Interpret results and make recommendations. Explain the implications of the findings in the context of the research question or problem being investigated.

Hypothesis Test Framework

Below, we’ll elaborate on each step and provide an example interview question to show how you’d use the framework to structure your answer.

Let’s say your interviewer gives you the following prompt: “A manufacturer of pain relief medication claims that its new formula provides faster pain relief than the existing formula. Design a hypothesis test to understand if the claim is true.”

The interviewer also provides this context: “The trial involves 50 participants randomly assigned to two groups: one receiving the new formula and the other receiving the existing formula. The manufacturer wants to determine if there is a statistically significant difference in the mean time to pain relief between the two groups.

For the group receiving the new formula, the mean time to pain relief is found to be xˉnew\bar{x}_{\text{new}} minutes with a standard deviation of snews_{\text{new}} = 4 minutes. For the group receiving the existing formula, the mean time to pain relief is found to be xˉexisting\bar{x}_{\text{existing}} = 20 minutes with a standard deviation of sexistings_{\text{existing}} = 5 minutes.”

Step 1: Formulate the hypotheses

Specify both the null hypothesis and alternative hypothesis.

The null hypothesis (H0H_0) is the default assumption or the hypothesis of no effect. It states that there is no significant difference or relationship.

The alternative hypothesis (H1H_1 or HaH_a) is the statement that contradicts the null hypothesis. It suggests that there is a significant difference or relationship.

In the example interview question, you could say,

“The null hypothesis (H0H_0) is that the mean time to pain relief for the new formula is the same as the mean time to pain relief for the existing formula. H0H_0 : μnew\mu_{\text{new}} = μexisting\mu_{\text{existing}}

The alternative hypothesis (HaH_a) is that the mean time to pain relief for the new formula is different from the mean time to pain relief for the existing formula.

Ha:μnewμexisting H_a : \mu_{\text{new}} \ne \mu_{\text{existing}}

Step 2: Select the statistical test

The statistical test depends on the nature of the data and the hypothesis being tested. The most commonly used statistical tests include:

  • Z-test
  • T-test
  • Chi-square test
  • Analysis of variance (ANOVA)

Refer to the table below for the uses, statistical formulas, and examples of these statistical tests.

Statistical Tests

In the example interview question, you could say,

“Since the sample size is small, and we want to compare the means of 2 groups, we will use a T-test.”

Step 3: Choose a significance level (α)

This is the threshold for how much evidence is required to reject the null hypothesis (aka Type I error). Common significance levels are 0.05 (5%) and 0.01 (1%).

Consider the scenario and discuss it with the interviewer to pick a reasonable α. Refer to the “Choosing a significance level” section below to guide this discussion.

In the example interview question, you could say,

“The significance level (α) is set at 0.05, indicating that we're willing to accept a 5% chance of incorrectly rejecting the null hypothesis when it's actually true.”

Step 4: Calculate the test statistic

Gather relevant data from the sample, and compute the value of the chosen test statistic using the formula from the table above.

In the example interview question, you’d make the following calculations:

Sample mean difference: xˉnew\bar{x}_{\text{new}} - xˉexisting\bar{x}_{\text{existing}} = 15−20 = −5 minutes

Standard error:

SE=snew2nnew+sexisting2nexisting=4250+5250=0.872SE = \sqrt{\frac{s^2_{\text{new}}}{n_{\text{new}}}+\frac{s^2_{\text{existing}}}{n_{\text{existing}}}} = \sqrt{\frac{4^2}{50}+\frac{5^2}{50}} = 0.872

= 0.872 minutes

t=(xˉnewxˉexisting)μ0SE=500.872=5.73t=\frac{(\bar{x}_{\text{new}}-\bar{x}_{\text{existing}})-\mu_0}{SE} = \frac{-5-0}{0.872}= -5.73

Step 5: Determine the critical region

Based on the significance level (α) and the chosen test statistic, determine the critical region(s) where the null hypothesis will be rejected.

Critical regions are regions of extreme values of the test statistic that would lead to rejection of the null hypothesis.

Critical values are specific threshold values that define the boundary of the critical region.

Critical Region Null Hypothesis

In the example interview question, you could say,

“Let’s determine the critical values:

Degrees of freedom (df) = nnewn_{\text{new}} + nexistingn_{\text{existing}} − 2 = 50 + 50 − 2 = 98

The critical values from the t-distribution table with df degrees of freedom and α/2 significance level is: tcriticalt_{\text{critical}}, upper = 1.984 and tcriticalt_{\text{critical}}, lower = -1.984.”

Step 6: Check evidence against the null hypothesis

The critical region and the p-value are both ways to assess the evidence against the null hypothesis.

  1. Compare the calculated test statistic to the critical value(s) from the distribution under the null hypothesis: If the calculated test statistic falls within the critical region, reject the null hypothesis. Otherwise, fail to reject the null hypothesis.
  2. Compare the p-value with the significance level: Use the appropriate probability distribution associated with your test statistic to find the p-value. If the p-value is smaller than the significance level (α), the result is statistically significant and we can reject the null hypothesis. Otherwise, fail to reject the null hypothesis.

The p-value is more commonly used when sharing results with stakeholders since it doesn’t vary based on the test statistic distribution.

In the example interview question, you could say,

“Since t=−5.73 falls outside the range of critical values [-1.984 to 1.984], we reject the null hypothesis. There is sufficient evidence to conclude that there is a statistically significant difference in the mean time to pain relief between the new formula and the existing formula.”

Step 7: Interpret the results and make recommendations

Explain the implications of the findings in the context of the research question or problem being investigated. Discuss any follow-up analysis, limitations, or assumptions of the hypothesis test.

In the example interview question, you could say,

“It can be concluded that the new formula indeed provides a different mean time to pain relief compared to the existing formula. The new formula reduces mean time to pain relief by 25% on average. A follow-up analysis could include looking at metrics for certain user segments of interest, such as users who have pain due to chronic diseases, to see if the effects are similar.”

Choosing a significance level (α)

While 0.05 and 0.01 are widely used, it’s good practice to discuss the multiple factors that can affect which significance level to choose. The main factors to consider are:

  • Type I and Type II errors
  • One-tailed vs. two-tailed tests

Type I and Type II errors

Data scientists should carefully consider the consequences of both types of errors and choose appropriate significance levels to minimize the risk of both types of errors based on the specific context.

Type I error (false positive) occurs when the null hypothesis (H0H_0) is incorrectly rejected when it is actually true. The probability of committing a Type I error is denoted by α (alpha), which is the significance level of the test. This error concludes that there is a significant effect or difference when, in reality, there is none.

In a criminal trial, convicting an innocent person (rejecting the null hypothesis of innocence) when the person is actually innocent.

Type II error (false negative) occurs when the null hypothesis (H0H_0) is incorrectly not rejected when it is actually false. The probability of committing a Type II error is denoted by β (beta). It fails to detect a real effect or difference, when one actually exists.

In a medical test, failing to diagnose a disease (not rejecting the null hypothesis of the absence of disease) when the person actually has the disease.

Type I and Type II errors are inversely related. Decreasing the probability of one type of error typically increases the probability of the other. There is often a trade-off between Type I and Type II errors. For example, in medical testing, lowering the threshold for diagnosing a disease (reducing the occurrence of Type II errors) may lead to an increase in false positives (Type I errors).

One-tailed vs. two-tailed tests

One-tailed tests: also known as a directional test, are used when the hypothesis specifies the direction of the effect (e.g. greater than, less than). It determines whether a sample statistic is significantly greater than or less than a population parameter, but not both. The critical region is located entirely in one tail of the distribution. It is typically used when there is a clear directional prediction, or when only one direction of effect is of interest.

A light bulb manufacturer claims that its new manufacturing process results in longer-lasting bulbs compared to the previous process. The company wants to verify this claim. Since the manufacturer is specifically interested in whether the new process results in longer-lasting bulbs, a one-tailed test is appropriate.

Two-tailed tests: also known as a nondirectional test, are used when the hypothesis does not specify the direction of the effect. It focuses on determining whether a sample statistic is significantly different from a population parameter, regardless of the direction of the difference. The critical region is split between both tails of the distribution. It is typically used when you want to detect any significant difference, regardless of whether it's an increase or decrease.

A researcher is investigating whether a new teaching method affects students' test scores, and doesn't have a specific hypothesis about whether the new method will lead to higher or lower scores; they just want to determine if there is any significant difference. In this case, since the researcher wants to determine if there is any significant difference in test scores, regardless of whether it's an increase or decrease, a two-tailed test is appropriate.

One vs. Two Tailed Test

A one-tailed test has higher power compared to a two-tailed test for the same significance level (α). The choice between one-tailed and two-tailed tests depends on the situation and the directional prediction of the hypothesis being tested.

In general, a two-tailed test is preferred, because while a one-tailed test provides more power to detect an effect in one direction, it can lead to more incorrect decisions. For example, a researcher investigating whether a new medication reduces anxiety levels in patients may hypothesize that it will significantly decrease anxiety levels. By only considering the possibility of a decrease in anxiety levels, the researcher may overlook the possibility that the new medication could potentially increase anxiety levels or have no effect at all. If the new medication actually increases anxiety levels or has no effect, a one-tailed test would fail to detect this.

Common pitfalls

  • Misinterpreting the p-value. The actual definition of p-value can be convoluted and difficult to understand, making it susceptible to misinterpretation. P-value is not the probability that the null hypothesis is true or the probability of rejecting the null hypothesis. P-value is the probability of obtaining test results at least as extreme as the result actually observed, assuming that the null hypothesis is correct. In simpler terms, a smaller p-value indicates stronger evidence against the null hypothesis, while a larger p-value suggests weaker evidence against it.
  • Senior candidates should better identify relevant information (by drawing on past experience and domain knowledge) and formulate reasonable hypotheses. For example, given a case study focusing on improving the effectiveness of the company's recommendation algorithm, they should easily be able to identify (multiple) relevant population parameters of interest for a hypothesis test and explain the pros and cons of each.
  • Senior candidates should demonstrate more experience in interpreting statistical results in the context of real-world problems and sharing results with stakeholders. They should provide more insightful interpretations of findings and articulate their implications for decision-making or further analysis. For example, in addition to the overall average impact, they should include findings about relevant user segments and provide recommendations for each segment when analyzing a hypothesis test.
  • Senior candidates should be better equipped to handle complex scenarios or data situations that might require adjustments to standard hypothesis testing procedures. For example, if a dataset has the potential for heterogeneous treatment effects, i.e. the feature being tested has significantly different effects on different user segments, they should discuss advanced statistical techniques such as propensity score matching.