Discuss Batch, Mini-Batch, Stochastic Gradient Descent
In this mock interview, Angie asks Raj (MLE @ Snapchat) to discuss “the differences between batch gradient descent, mini-batch gradient descent, and stochastic gradient descent.” Below is a supplemental written solution that shows how to approach the question and follow-up questions.
Answer
Gradient descent is an optimization technique that is used to find the minimum of a loss function. So specifically, the gradient can be calculated by taking the derivative of the loss with respect to the parameters of a particular algorithm. Since the gradient descent actually represents the direction of the steepest descent, it can be used to take gradual steps towards the minimum of that loss function.
The terms refer to different but related ways of dividing up the trading set, computing the actual gradient, and then performing the actual parameter updates. Batch gradient descent is when you use the entire training set in one go. You compute the gradient and then you do a single step of gradient descent. Mini-batch gradient descent is when you divide up the training set into what are called mini batches, and you typically choose a batch size. Then you will separately compute the gradients on those mini batches. Then you will take a step in that direction for each one of those mini batches.
Stochastic gradient descent is related to both batch and mini-batch gradient descent, but it mostly refers to shuffling up the training set randomly. And then similarly, you would divide that up into smaller batches and then compute the gradients on those batches, and then perform the respective parameter updates.
Let’s say your interviewer wants to continue the conversation through a follow-up question. For example, assume you’re asked, “Why are there different techniques, and when might you choose one over the other?”
A strong response would be, “People typically choose to split up the data into batches because of memory requirements. If you have a data set that has millions of data points, for example, unfortunately, you cannot usually fit that into memory when actually doing gradient descent. In practice, people will divide these up into many batches so they can fit into your RAM of a GPU, and then periodically you can compute these updates and gradually lower the loss function. Mini batch gradient descent is also used as a regularizer to prevent overfitting on the training set, because it adds a little bit of noise to the actual gradient that you're computing on these many batches. For stochastic gradient descent, let's say that you have a particular training set that has patterns that are underlying in the order of the training set. You don't want to overfit the training of your model to any order that could be representative within your training set. So people will use stochastic gradient descent to ensure that that shuffling takes out that variable of the order within the training data set."
What makes this answer effective
The answer gives a comprehensive overview of the different methods stated by the interviewer. It gives an introduction to the terms by simply explaining the concept of gradient descent without going too deep into the mathematics, which is generally not expected for a more conversational ML concepts interview. It further clarifies the context of the terms by stating that they refer to different ways of performing gradient descent. It correctly identifies the relevant differences between all of the methods.