Bias/Variance and Regularization/Dropout

Original Source: https://www.coursera.org/specializations/deep-learning

Diagnosing and Solving Bias/Variance

Diagnosing

High train set error -> High bias

If dev set error is much higher than train set error -> High variance

Solving

High bias

  • bigger network
  • train more epochs

High Variance

  • more data
  • regularization

Regularization Techniques

Dropout

dropout

Dropout is the common regularization technique in deep learning.

We set hyperparameter $p$, typically as a value between 0 and 0.5.

Then during training, we deactivate random (number of units)$\times p$ units on a particular layer.

This improve generalization because it forces your layer to learn with different units the same concept.

Number of our units on a particular layer has decreased, so we multiply $\frac{1}{1-p}$ to units of that layer to compensate.

L2(Frobenius) Regularization

L2 regularization is the common regularization technizue in machine learning.

Frobenius Norm

\[\|W^{[l]}\|_F = \sqrt{\sum_{i=1}^{n^{[l]}}\sum_{j=1}^{n^{[l-1]}}{W_{ij}^{[l]}}^2}\] \[J(w,b)=\frac{1}{m} \sum_{i=1}^mL(\hat{y}, y)+\frac{\lambda}{2m}\sum_{l=1}^L{\|W^{[l]}\|_F}^2\]

Then, term $\frac{\lambda}{m}W^{[l]}$ is added to our original $dW^{[l]}$

L1 Regularization

You don’t use L1 regularization much in practice

L1 Norm

\[\|W^{[l]}\|_1 = \sum_{i=1}^{n^{[l]}}\sum_{j=1}^{n^{[l-1]}}\|W_{ij}^{[l]}\|\]

The rest is same as L2 regularization.

Why does L2 and L1 regularization prevent overfitting?

Regularization decays weight(makes weight close to 0), thus restrains the impact of each units.

Also, it makes $z$ close to 0. Since part of tanh or sigmoid function near 0 is similar to a linear function, it reduces complexity.

Other ways to reduce variance

Data Augumentation

In a image recognition task, flipping or rotating image and adding them to training set will reduce variance.

Early Stopping

training for less epochs might reduce dev set error. However, Andrew Ng doesn’t recommend early stopping. He says that if you do early stopping, training set error is not sufficiently low thus problem of high bias rises. He recommends addressing high variance and high bias seperately, not sacrificing one to get another.

Leave a Comment