How to Answer ML Data Handling Questions
Overview
Data handling questions assess your knowledge of preparing data for machine learning. Some sample questions include:
- What are some common transformations for categorical data?
- What is an imbalanced dataset? Can you list some ways to deal with it?
- Describe what you do when you have erroneous or corrupted data.
To prepare for these types of questions, review the terms under the “Data handling” section of our ML Interview Glossary.
How to answer
You don’t need to know all the different ways to deal with categorical variables or all the different imputation methods. Instead, you should know how data handling methods solve real-world problems. Additionally, you should know how to explain the main methods, their pitfalls, and their advantages.
Often, the role could be aimed at natural language processing (NLP), computer vision (CV), or another specific area of the company (e.g. risk, finance). Try to mention the problems and datasets they’re actually working with. Often, companies have blogs where you can learn about their problems and methods. For example, Stitch Fix’s has an open blog that discusses their projects on machine learning, software engineering, and data science. Uber, DoorDash, AirBnB, and many other tech companies also have open-source blogs.
Let’s say your interviewer asks, “What is imbalance in a dataset? How does it impact your machine learning models?”
You could say,
“Imbalance is when a categorical variable (often binary) has more of one class than another. For example, in converting on an email, we might see that 5% convert and 95% do not. This is imbalanced, given how far away this is from a 50-50 split.
Imbalance cases often require a different optimization metric for our model and careful consideration of how we fit/train our model to meet company needs. When training on data with a high imbalance, the algorithm learns only to classify the majority class well and doesn't know how to handle data points from the minority class. Using oversampling/undersampling techniques in the training set is one way to combat this issue.”
Common pitfalls
- Rambling off-topic. In many of these cases, you will receive a specific problem. Stick to solving the problem in front of you. For example, if you don’t have text data, don’t bring up those methods. If there are no categorical variables, don’t discuss those methods.
- Failing to clarify the question. For example, if you’re asked how you’d handle missing data, consider clarifying the type of data, the availability, and the context of the problem. These are critical factors that would impact the method you choose.
Senior candidates
As expected, senior candidates have slightly different performance expectations. The more senior the role, the more you’re expected to demonstrate your ability to:
- Build the model infrastructure from end to end
- Gauge the pros and cons of using a particular algorithm
- Integrate your domain knowledge from previous roles
- Describe your experience productionizing ML models in previous roles
- Work cross-functionally with both technical and non-technical stakeholders.