Data Cleaning
Data cleaning is the process of identifying and correcting issues within a dataset. To assess your skillset, the interviewer may ask you about a scenario or present you with a sample dataset to evaluate. In a statistics interview round, you’re expected to verbally walk through the data cleaning approach you’d take. In contrast, you’re more likely to write code to clean the dataset in a technical/coding interview round.
What to expect
Example questions include:
- Imagine you're analyzing a dataset about customer transactions for an e-commerce platform. You discover that some of the records have missing values for key variables such as purchase amount, product category, and customer demographics. How would you handle these missing values?
- Suppose you're analyzing a dataset about financial transactions for a banking institution. During your analysis, you notice some transactions with unusually large or small amounts, compared to the rest of the data. How would you identify these outliers, and what strategies would you use to address them effectively?
This lesson will discuss:
- Handling missing data
- Outlier detection
For each topic, we’ll provide a brief description of each issue and list common mitigation methods.
Handling missing data
Assessing the missing data mechanism minimizes bias and ensures the validity of your analysis.
The main types of missing data are:
- Missing completely at random (MCAR)
- Missing at random (MAR)
- Missing not at random (MNAR)
Outlier detection
Outlier detection is the process of identifying observations or data points that deviate significantly from the majority of the data in a dataset.
Outliers can arise due to various reasons such as measurement errors, data entry mistakes, natural variability, or rare events. Outlier detection is an essential step in data analysis and modeling that ensures the accuracy, reliability, and robustness of insights derived from the data.
Common outliers include:
- Univariate
- Multivariate
- Contextual
- Collective
Univariate
Outliers in a single variable (e.g. anomaly in temperature sensor readings).
Detection methods
- Visual inspection:
- Box plots
- Histograms
- Scatter plots
- Statistical methods:
- Z-score: Calculates the number of standard deviations an observation is away from the mean. Observations with a z-score above a certain threshold (e.g., 3) are considered outliers.
- Modified Z-Score: Similar to the z-score but more robust to outliers. It uses the median and median absolute deviation (MAD) instead of the mean and standard deviation.
- IQR (Interquartile Range): Defines the range between the first quartile (Q1) and the third quartile (Q3). Observations outside a certain multiple of the IQR (e.g., 1.5 times the IQR) from the quartiles are considered outliers.
Treatment methods
- Data transformation:
- Winsorization: Replace outliers with nearest non-outlier value.
- Trimming: Remove extreme values beyond a certain percentile.
- Logarithmic transformation: Useful for reducing the influence of large outliers on skewed data distributions.
- Imputation: For datasets with missing values, outliers can be treated as missing data and imputed using appropriate techniques mentioned above. Imputation can help retain valuable information from outliers while filling in missing values and preserving the overall structure of the data.
Multivariate
Outliers involving multiple variables (e.g. Anomaly in credit card transactions).
Detection methods
- Distance-based methods:
- Mahalanobis distance: Measure distance from centroid.
- K-nearest neighbors (KNN): Identify points with unusually large distances from neighbors.
- Machine Learning Algorithms:
- Isolation forest: Constructs an ensemble of decision trees to isolate outliers efficiently by partitioning the feature space. Outliers are expected to require fewer partitions to be isolated.
- One-class SVM (Support Vector Machine): Trains a model on the majority class (normal data) to define the region of normality. Observations lying outside this region are considered outliers.
Treatment methods
- Clustering: Group similar data points together, isolating outliers, and perform analysis separately for each of these clusters.
- Robust methods: Use algorithms less sensitive to outliers.
Contextual
Outliers based on context or domain knowledge (e.g. A sudden surge in website traffic)
Detection methods
- Expert judgment: Consult domain experts to identify unusual data points.
- Time-series analysis: Detect anomalies based on temporal patterns.
Treatment methods
- Domain-specific treatment: Handle outliers based on domain-specific rules or requirements.
- In some cases, outliers may be valid data points that represent rare or extreme events. If the outliers are genuine observations and do not significantly affect the analysis, it may be appropriate to leave them in the dataset.
- One straightforward approach is to remove outliers from the dataset. However, this approach should be used with caution as removing outliers can lead to a loss of valuable information and potentially bias the analysis if the outliers are not truly erroneous.
Collective
Groups of outliers occurring together (e.g. a cluster of defective products)
Detection methods
- Clustering: Identify clusters of data points deviating from the norm. Example: DBSCAN (Density-Based Spatial Clustering of Applications with Noise)identifies clusters of data points based on density. Outliers are data points that do not belong to any cluster.
- Association rule mining: Identify patterns of co-occurring outliers.
Treatment methods
- Investigate root cause: Determine if outliers are due to data collection errors or genuine anomalies.
