Stochastic Gradient Descent and Mini-Batch Gradient Descent

Original Source: https://www.coursera.org/learn/machine-learning

In batch gradient descent we’ve seen before, we used all of our training data to perform one gradient descent.

But this is slow and inefficient if number of training data grows.

To solve this problem, we can scan through training examples and perform gradient descent for every $n$ examples.

If we used all of our examples once, we say that one ‘epoch’ has passed. So in batch gradient descent we perform one gradient descent for every epoch, and in the gradient descents that we will look here we perform multiple gradient descents for every epoch.

Stochastic Gradient Descent

We call ‘stochastic gradient descent’ when $n$ is 1.

We perform the following for each epoch.

For $i = 1\dots m$:

\[\Theta_j := \Theta_j - \alpha (h_{\Theta}(x^{(i)}) - y^{(i)}) \cdot x^{(i)}_j\]

This algorithm will only try to fit one training example at a time. This way we can make progress in gradient descent without having to scan all m training examples first.

Stochastic gradient descent will be unlikely to converge at the global minimum and will instead wander around it randomly, but usually yields a result that is close enough. One strategy for trying to actually converge at the global minimum is to slowly decrease α over time.

Mini-Batch Gradient Descent

Instead of using all m examples as in batch gradient descent, and instead of using only 1 example as in stochastic gradient descent, we will use some in-between number of examples b. Typical value for b is around 100.

Mini-batch gradient descent can sometimes be even faster than stochastic gradient descent as we can use vectorized implementations over the b examples.

For example, with b=10 and m=1000, we perform the following for each epoch.

For $i = 1,11,21,31,\dots,991$:

\[\theta_j := \theta_j - \alpha \dfrac{1}{10} \displaystyle \sum_{k=i}^{i+9} (h_\theta(x^{(k)}) - y^{(k)})x_j^{(k)}\]

Leave a Comment