Explain “Training" and “Testing” Data
In this mock interview, Angie asks Raj (MLE @ Snapchat) to “explain the terms ‘training data’ and ‘testing’ data in the context of machine learning.” Below is a supplemental written solution that shows how to approach the question and follow-up questions.
Answer
Training data generally refers to the portion of the data that a machine learning algorithm uses to learn patterns. So for example, the parameters of a logistic regression algorithm rather can be chosen such that the error is minimized on the training set. The testing set is data that is not seen by the actual algorithm and is used purely to gauge the algorithm's performance.
Let’s say your interviewer wants to continue the conversation through a follow-up question. For example, assume you’re asked, “You mentioned that you want to minimize the error on the training data set. What if your algorithm involves parameters (e.g. number of layers, size of your neural network, or learning rate) that you need to tune?”
A strong response would be, “Generally those are called hyperparameters. People will typically take out a portion of the training data and call it a validation set. Then they will tune those hyperparameters to maximize the performance on the validation set.”
What makes this answer effective
The answer clearly distinguishes between the training set and the testing set. It provides a high-level description of the training process to explain what the training set is used for. It gives a relevant example using logistic regression. It explicitly mentions that the test set is not used for training but to evaluate performance.