Initialization of Weights

Original Source: https://www.coursera.org/specializations/deep-learning

We need initial weights to start gradient descent, just like we need to be somewhere on a mountain to descend from it.

First Problem

Let’s think of one layer in neural network and say that $W$ is a weight matrix of that layer and $n$ is number of units of that layer.

Say we initialize weight so that all $W_i$(ith column of $W$)s for $i=1,…,n$ are the same.

Then by forward propagation $Z^{[l]} = A^{[l-1]}W^{[l]}+b^{[l]}$, all $Z^{[l]}_i$(ith column of $Z^{[l]}$)s will be the same.

All $A^{[l]}_i$s will be the same too.

With backpropagation

\[dA^{[l]} = dZ^{[l+1]}{W^{[l+1]}}^T\] \[dZ^{[l]} = dA^{[l]}*g{[l]}'(Z^{[l]})\] \[dW^{[l]} = \frac{1}{m}{A^{[l-1]}}^TdZ^{[l]}\]

columns of $dW^{[l]}$ s will be duplicates.

Thus even after gradient descents, $W_i$s will be same and $A_i$s, too. This is a waste of units.

Example

Let’s assume a binary classification neural network with one hidden layer.Number of examples $m=3$ number of features $n=2$, number of units in hidden layers $n^{[1]}=3$.

\[Z^{[1]} = XW^{[1]} + b^{[1]}\] \[A^{[1]} = g^{[1]}(Z^{[1]})\] \[Z^{[2]} = A^{[1]}W^{[2]} + b^{[2]}\] \[\hat{Y} = \sigma(Z^{[2]})\] \[X=\left[ {\begin{array}{} 1 & 2\\ 0 & 3\\ 4 & 1\\ \end{array} } \right]\]

Now let’s initialize $W$ so that columns are duplicates.

\[W^{[1]} = \left[ {\begin{array}{} 1 & 1 & 1\\ 2 & 2 & 2\\ \end{array} } \right]\] \[W^{[2]} = \left[ {\begin{array}{} 3\\ 1\\ 2\\ \end{array} } \right]\] \[b=0\]

As activation function $g$ is applied element-wise, it doesn’t change the symmetry of a matrix, so let’s just assume $g$ is a identity function.

We forward propagate.

\[Z^{[1]}=A^{[1]}= \left[ {\begin{array}{} 5 & 5 & 5\\ 6 & 6 & 6\\ 6 & 6 & 6\\ \end{array} } \right]\] \[Z^{[2]} = \left[ {\begin{array}{} 30\\ 36\\ 36\\ \end{array} } \right]\] \[\hat{Y}= \left[ {\begin{array}{} \sigma(30)\\ \sigma(36)\\ \sigma(36)\\ \end{array} } \right]\]

We backpropagate. We know that $\hat{Y}$ is composed of different values. So $dZ^{[2]}$ is composed of different values. Let’s just assume like follows.

\[dZ^{[2]} = \left[ {\begin{array}{} 1\\ 2\\ 3\\ \end{array} } \right]\] \[dW^{[2]} = \left[ {\begin{array}{} 5 & 6 & 6\\ 5 & 6 & 6\\ 5 & 6 & 6\\ \end{array} } \right] \left[ {\begin{array}{} 1\\ 2\\ 3\\ \end{array} } \right] = \left[ {\begin{array}{} 35\\ 35\\ 35\\ \end{array} } \right]\] \[dA^{[1]} = \left[ {\begin{array}{} 1\\ 2\\ 3\\ \end{array} } \right] \left[ {\begin{array}{} 35 & 35 & 35 \end{array} } \right] = \left[ {\begin{array}{} 35 & 35 & 35\\ 70 & 70 & 70\\ 105 & 105 & 105 \\ \end{array} } \right]\] \[dZ^{[1]} = \left[ {\begin{array}{} 35 & 35 & 35\\ 70 & 70 & 70\\ 105 & 105 & 105 \\ \end{array} } \right] \left[ {\begin{array}{} 5 & 5 & 5\\ 6 & 6 & 6\\ 6 & 6 & 6\\ \end{array} } \right] = \left[ {\begin{array}{} 595 & 595 & 595\\ 1190 & 1190 & 1190\\ 1785 & 1785 & 1785\\ \end{array} } \right]\] \[dW^{[1]} = \left[ {\begin{array}{} 1 & 0 & 4\\ 2 & 3 & 1\\ \end{array} } \right] \left[ {\begin{array}{} 595 & 595 & 595\\ 1190 & 1190 & 1190\\ 1785 & 1785 & 1785\\ \end{array} } \right] = \left[ {\begin{array}{} 7735 & 7735 & 7735\\ 1190 & 1190 & 1190\\ 6545 & 6545 & 6545\\ \end{array} } \right]\]

We can see that columns of $dW^{[1]}, dW^{[2]}$ are duplicates.

Second Problem

If we initialize weights to large numbers, $Z$ will be large, thus gradient of sigmoid or tanh function will be close to 0. If gradients of activation functions are close to 0, gradient descent will be slower.

Conclusion

To prevent these two problems from happening, we initialize weights randomly to values close to 0.

Leave a Comment