Data Transformation
Data transformation helps convert raw datasets into usable, uniform formats for improved analysis and insights. Answering these interview questions effectively requires a solid understanding of how and when different methods are implemented.
What to expect
Example questions include:
- Explain how scaling and normalization affect the distribution and scale of the data.
- When would you use Box-Cox transformation over other types of transformations?
- When can one-hot encoding be a problem?
This lesson will discuss:
- Scaling, standardization, and normalization
- Transformation
- Encoding categorical variables
For each topic, we’ll provide a brief description and list common mitigation methods.
Scaling, standardization, and normalization
Scaling, standardization, and normalization are data preprocessing techniques used to rescale and transform the features of a dataset to a common scale.
Scaling
Scaling rescales the features to a specific range, such as [0, 1] or [-1, 1]. Scaling ensures that all features contribute equally to the analysis and prevents features with larger magnitudes from dominating the model.
Standardization
Standardization transforms the features to have a mean of 0 and a standard deviation of 1. This centers and scales the data but does not make the distribution Gaussian.
Normalization
Normalization rescales the features to a specific range, such as [0, 1] or [-1, 1]. This ensures that all features contribute proportionally, which is especially useful for algorithms that rely on distances, like k-nearest neighbors or clustering.
In some contexts, “normalization” refers to scaling each data point (vector) so that its length is 1. This is especially important in algorithms that measure distances or angles between data points, such as k-nearest neighbors, clustering, or working with embeddings.
Transformation
Data transformation involves converting the original data into a different format or representation to make it more suitable for analysis or modeling. The table below illustrates common types of transformations.
Encoding categorical variables
Encoding categorical variables involves converting categorical data, which represents categories or labels, into numerical representations that can be used in machine learning algorithms.
Categorical variables can be of two types: ordinal and nominal.
Ordinal variables have a natural order or ranking among their categories. For example, a variable representing educational attainment might have categories like ‘High School Diploma’, ‘Bachelor's Degree’, and ‘Master's Degree’, which have a clear order from lowest to highest.
Nominal variables do not have a natural order or ranking among their categories. For example, a variable representing colors might have categories like ‘Red’, ‘Blue’, etc., which do not have a meaningful order. Common techniques for encoding include:
- Label encoding: assigns a unique integer to each category of the categorical variable. This is suitable for ordinal variables, but should be used with caution for nominal variables, as it may inadvertently introduce order where none exists.
- One-hot encoding: creates binary dummy variables for each category of the categorical variable. Each category is represented by a column, and a value of 1 indicates the presence of that category, while a value of 0 indicates its absence. One-hot encoding is suitable for both ordinal and nominal variables and avoids the issue of introducing unintended order.
- Dummy encoding: similar to one-hot encoding but creates n−1 dummy variables for n categories, where n is the number of categories in the variable. This helps avoid multicollinearity issues in regression models while still capturing all the necessary information.