Bias
In data science, bias refers to systematic errors or inaccuracies in data that lead to incorrect or misleading conclusions. Bias can arise in the data collection, preprocessing, analysis, and interpretation stages of a project.
What to expect
Example questions include:
- What are the types of biases that can occur during sampling?
- When sampling data for analysis, what would you consider to verify that the samples are good?
- How do you evaluate if the samples are biased or not?
- What hypotheses would you use to detect bias? How would you test them?
- What happens to regression coefficients if you have omitted variable bias?
This lesson will explain common types of biases and provide examples for each.
Selection bias
Selection bias occurs when the process of selecting data introduces non-randomness, leading to a sample that is not representative of the population.
In a survey conducted online, respondents who have internet access may be overrepresented, leading to biased results.
Sampling bias
Sampling bias occurs when the method of sampling results in a sample that does not accurately reflect the population of interest.
If a study only samples patients from urban areas, the findings may not generalize to rural populations.
Response bias
Response bias occurs when the responses provided by participants in a study are influenced by factors other than the variables of interest.
A survey where respondents provide answers they believe are socially acceptable rather than truthful.
Confirmation bias
Confirmation bias in data science refers to the tendency of data scientists to favor information or evidence that confirms their existing beliefs, hypotheses, or expectations, while disregarding or downplaying contradictory evidence. It can lead to distorted interpretations of data and biased conclusions, ultimately undermining the validity and reliability of research findings.
Applying certain transformations or filtering techniques to highlight data trends that confirm their hypotheses while suppressing those that do not.
Survivorship bias
Survivorship bias in data science refers to the error that arises when only the surviving or successful instances in a dataset are analyzed, while the unsuccessful or eliminated instances are ignored. It occurs when the data used for analysis only includes observations that have ‘survived’ a certain process or selection criteria, leading to biased conclusions and inaccurate insights.
Analysis that only considers customers who have churned and ignores those who are still active subscribers may lead to biased conclusions about the factors influencing churn.
Omitted variable bias
Omitted variable bias, also known as confounding bias, occurs in statistical analysis when a relevant variable that should be included in a model is left out, which can lead to biased estimates of the effects of other variables included in the model.
Excluding a key variable ‘customer support quality’ when creating a model to analyze customer satisfaction.