Skip to main content

Data Cleaning

Premium

Data cleaning is the process of identifying and correcting issues within a dataset. To assess your skillset, the interviewer may ask you about a scenario or present you with a sample dataset to evaluate. In a statistics interview round, you’re expected to verbally walk through the data cleaning approach you’d take. In contrast, you’re more likely to write code to clean the dataset in a technical/coding interview round.

What to expect

Example questions include:

  • Imagine you're analyzing a dataset about customer transactions for an e-commerce platform. You discover that some of the records have missing values for key variables such as purchase amount, product category, and customer demographics. How would you handle these missing values?
  • Suppose you're analyzing a dataset about financial transactions for a banking institution. During your analysis, you notice some transactions with unusually large or small amounts, compared to the rest of the data. How would you identify these outliers, and what strategies would you use to address them effectively?

This lesson will discuss:

  • Handling missing data
  • Outlier detection

For each topic, we’ll provide a brief description of each issue and list common mitigation methods.

Handling missing data

Assessing the missing data mechanism minimizes bias and ensures the validity of your analysis.

The main types of missing data are:

  • Missing completely at random (MCAR)
  • Missing at random (MAR)
  • Missing not at random (MNAR)
Missing Data TypeDescriptionExampleHandling Strategy
Missing Completely at Random (MCAR)In MCAR, the probability of data being missing is random and unrelated to any observed or unobserved variable in the dataset.A survey on customer satisfaction where some respondents accidentally skip questions due to page flipping.Removing observations with missing values (if <20% of total observations) or any imputation method (simple: mean/median/mode imputation, advanced: regression, K Nearest Neighbors (KNN), multiple imputation)  can be used as the missingness is unrelated to the observed data.
Missing at Random (MAR)In MAR, the probability of data being missing may depend on observed variables but not on the missing values themselves. The missingness can be systematically related to other variables.A study on income where respondents with higher income levels are less likely to disclose their earnings.Multiple imputation or model-based imputation methods can be used, incorporating information from other observed variables to impute missing values. Incorporate domain knowledge to guide the choice of imputation method and assess the suitability of imputed values. Create indicator variables to flag missing values, allowing models to account for the missingness as a separate category.
Missing Not at Random (MNAR)In MNAR, the probability of data being missing depends on the missing values themselves, even after accounting for observed variables.A clinical trial where participants with severe side effects from a medication are more likely to drop out of the study, leading to missing data on adverse reactions.MNAR data are more challenging to handle, and advanced techniques such as pattern mixture models or selection models may be required to account for the missingness mechanism.

Outlier detection

Outlier detection is the process of identifying observations or data points that deviate significantly from the majority of the data in a dataset.

Outliers can arise due to various reasons such as measurement errors, data entry mistakes, natural variability, or rare events. Outlier detection is an essential step in data analysis and modeling that ensures the accuracy, reliability, and robustness of insights derived from the data.

Common outliers include:

  • Univariate
  • Multivariate
  • Contextual
  • Collective

Univariate

Outliers in a single variable (e.g. anomaly in temperature sensor readings).

Detection methods

  • Visual inspection:
    • Box plots
    • Histograms
    • Scatter plots
  • Statistical methods:
    • Z-score: Calculates the number of standard deviations an observation is away from the mean. Observations with a z-score above a certain threshold (e.g., 3) are considered outliers.
    • Modified Z-Score: Similar to the z-score but more robust to outliers. It uses the median and median absolute deviation (MAD) instead of the mean and standard deviation.
    • IQR (Interquartile Range): Defines the range between the first quartile (Q1) and the third quartile (Q3). Observations outside a certain multiple of the IQR (e.g., 1.5 times the IQR) from the quartiles are considered outliers.

Treatment methods

  • Data transformation:
    • Winsorization: Replace outliers with nearest non-outlier value.
    • Trimming: Remove extreme values beyond a certain percentile.
    • Logarithmic transformation: Useful for reducing the influence of large outliers on skewed data distributions.
  • Imputation: For datasets with missing values, outliers can be treated as missing data and imputed using appropriate techniques mentioned above. Imputation can help retain valuable information from outliers while filling in missing values and preserving the overall structure of the data.

Multivariate

Outliers involving multiple variables (e.g. Anomaly in credit card transactions).

Detection methods

  • Distance-based methods:
    • Mahalanobis distance: Measure distance from centroid.
    • K-nearest neighbors (KNN): Identify points with unusually large distances from neighbors.
  • Machine Learning Algorithms:
    • Isolation forest: Constructs an ensemble of decision trees to isolate outliers efficiently by partitioning the feature space. Outliers are expected to require fewer partitions to be isolated.
    • One-class SVM (Support Vector Machine): Trains a model on the majority class (normal data) to define the region of normality. Observations lying outside this region are considered outliers.

Treatment methods

  • Clustering: Group similar data points together, isolating outliers, and perform analysis separately for each of these clusters. 
  • Robust methods: Use algorithms less sensitive to outliers.

Contextual

Outliers based on context or domain knowledge (e.g. A sudden surge in website traffic)

Detection methods

  • Expert judgment: Consult domain experts to identify unusual data points.
  • Time-series analysis: Detect anomalies based on temporal patterns.

Treatment methods

  • Domain-specific treatment: Handle outliers based on domain-specific rules or requirements.
  • In some cases, outliers may be valid data points that represent rare or extreme events. If the outliers are genuine observations and do not significantly affect the analysis, it may be appropriate to leave them in the dataset.
  • One straightforward approach is to remove outliers from the dataset. However, this approach should be used with caution as removing outliers can lead to a loss of valuable information and potentially bias the analysis if the outliers are not truly erroneous.

Collective

Groups of outliers occurring together (e.g. a cluster of defective products)

Detection methods

  • Clustering: Identify clusters of data points deviating from the norm. Example: DBSCAN (Density-Based Spatial Clustering of Applications with Noise)identifies clusters of data points based on density. Outliers are data points that do not belong to any cluster.
  • Association rule mining: Identify patterns of co-occurring outliers.

Treatment methods

  • Investigate root cause: Determine if outliers are due to data collection errors or genuine anomalies.