31

Initialization Techniques for Neural Networks

 5 years ago
source link: https://www.tuicool.com/articles/hit/Nn6rIjf
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

In this blog, we will see some initialization techniques used in Deep Learning. Anyone that even has little background in Machine Learning must know that we need to learn weights or hyperparameters to make the model. These parameters govern how well our algorithm will perform on unseen data. To learn the model we need to initialize the parameters, apply the loss function and then optimize it. In this blog, we will focus on the initialization part of the network.

If you ever have built any machine learning algorithm you must have heard that we need to “randomly” initialize our weights as the starting point and then start the learning process. The word random is quite vague in itself. We will see what actually goes behind this word random and what are different initialization techniques.

  1. Zero Intiliation

It is one of the easiest ways of initialization the weights by putting all of them as equal to zero. Now let us visualize the implication of this technique using a simple two-layered network.

B732EzB.jpg!web

As we have set all the weights ie w= 0, we can easily see that during the forward pass:

a1 = w1*x1 + w2*x2 + w3*x3 , h1 = g(w1*x1 + w2*x2 + w3*x3)

a2 = w4*x1 + w5*x2 + w6*x3, h2 = g(w4*x1 + w5*x2 + w6*x3)

a3 = w7*x1 + w8*x2 + w9*x3, h3 = g(w7*x1 + w8*x2 + w9*x3)

y = g(h1*w10 + h2*w11 + h3*w12)

It is clear to see a1 = a2 = a3 = 0 and h1 = h2 = h3 . -> (1)

Now let’s see what happens in backpropagation

w1 = ∂L(w)̇ / ∂w1 = ∂L(w)̇/∂L(y)̇ * ∂L(y)̇/∂L(h1)̇ * ∂L(h1)̇/∂L(a1)̇ * ̇x1

w4 = ∂L(w)̇ / ∂w4 = ∂L(w)̇/∂L(y)̇ * ∂L(y)̇/∂L(h2)̇ * ∂L(h2)̇/∂L(a2)̇ * ̇x1

w7 = ∂L(w)̇ / ∂w7 = ∂L(w)̇/∂L(y)̇ * ∂L(y)̇/∂L(h3)̇ * ∂L(h3)̇/∂L(a3)̇ * ̇x1

Thus we can see that ∇w1 = ∇w4 = ∇w7 (from 1), Similarly ∇w2= ∇w5= ∇w7 and ∇w3 = ∇w6 = ∇w9

From the above argument, we can see that at each step the change in weight is the same. Hence all the nodes in the hidden layer are learning the same parameters wrt to the input which is causing redundancy and making our network less flexible and hence less accurate. This is also called the Symmetry problem . Thus zero initialization is not a good technique.

2. Initialization with the same Random value

In this technique, we initialize all the weights with the same random value. I hope you already got the problem with this technique as it is quite similar to zero initialization instead we just used a random value but again the same problem persists as updation of weights will again be in same order. Hence this technique is also not used.

3. Initialization with small Random values

In this technique, we initialize all the weights randomly from a univariate “Gaussian” (Normal) distribution having mean 0 and variance 1 and multiply them by a negative power of 10 to make them small. We can do this in Python using numpy as follows

W = np.random.randn(input_layer_neurons, hidden_layer_neurons)*0.01

By plotting the values of the gradient, we can expect a Normal curve similar to.

mqIr6jI.png!web2q63A3A.png!web

Now, we will run the learning algorithm and see how the distribution changes with different eopchs

NvyuayY.png!webjUZbQfR.png!web
After 10 epochs
VnqeMjZ.png!webmuQjEnV.png!web
After 20 epochs
vIBZvie.png!webbIvAVrf.png!web
After 50 epochs

From the above plots, one can easily see that variance is decreasing and gradient are saturating to 0. This is known as Gradient Vanishing problem. One can also visualize this as each gradient is obtained as a result of a chain of multiplication of derivative and with each value much smaller than 1 thus gradient vanishes to zero.

When these gradients are forward propagated with neurons having sigmoid activation, the output of a neuron is close to 0.5 as sigmoid(0) = 0.5, while in case of tanh it will be 0 centered same as the gradient graph.

RRrQrim.png!webZZVb2aU.png!web
Applying sigmoid activation

Thus we can conclude that if we take small random values the gradient vanishes on repeated chain multiplication and neurons get saturated to a value of 0.5 in case of sigmoid and 0 in case of tanh. Hence we cannot use small random values as initialization.

4. Initialization with large Random values

We just saw that in case of small random values, gradient vanishes. Now let us see what happens when we initialize weights with large random values. We can do this in Python using numpy as follows

W = np.random.randn(input_layer_neurons, hidden_layer_neurons)

When we initialize weights to be large values the absolute sum WiXi will be very large and neurons get saturated to extremes during the forward pass as shown below.

7BVBVjY.png!webE3ai6fi.png!web
Saturation with sigmoid activation
JJ3MRrn.png!webbUbQrmR.png!web
Saturation with tanh activation

Below figure shows that at saturation the derivative of sigmoid vanishes to 0. A similar argument can be made for tanh as well.

iIBbUvy.png!webV32ERbQ.png!web

Now when we backpropagate through the network the derivates will tend to zero and hence the gradient will vanish in this case as well. So if you were thinking that we have initialized large weights so the gradient will explode instead of vanishing, it is not the case with sigmoid and tanh as activation due to saturation at large values.

The above two arguments showed us that in both the cases either initializing weights as small values or large they tend to vanish. In small values, the gradient vanishes because of repeated chain multiplication while in large it vanishes because the derivative itself becomes zero. Hence they both can’t be used.

Before trying any new approach we will try to build some intuition why it is happening mathematically.

Let’s talk about the inputs to your neural network, you must know that we normalize our input before feeding into the network. For the sake of argument, let’s consider that our input is coming from mean 0 and variance 1 normal distribution. We generalize the equation of a1 from the above network for n input as

a1= w1*x1 + w2*x2 + w3*x3 + ….. + wn*xn

Now we will calculate the variance of a1

Var(a1) = Var(∑WiXi)

= ΣVar(WiXi)

= Σ[ (E[Wi])²Var(Xi) + (E[Xi])²Var(Wi) + Var(Wi)Var(Xi) ]

Consider both input and weights as zero-mean first two terms will cancel out.

= ΣVar(Wi)Var(Xi)

Since all WiXi are identically distributed we can write

=nVar(Wi)Var(Xi)

We found that Var(a1) = (nVar(Wi))Var(Xi) or we can say that our input Xi is scaled to (nVar(Wi)) times variance. Some more maths and we would able to prove that kth hidden layer, the variance of Var(ak) = ([(nVar(Wi))]^k)Var(Xi). The physical significance of this statement is that any neuron in hidden layer can now vary n times the variation of input(which is again n times variance of the input of the previous layer) or if we plot the distribution we will find that Var(ak) is much more spread than that of Var(Xi)

rEZbUz7.png!webfaeENzY.png!web

Now let’s see what happens to (nVar(Wi))^k at different values of (nVar(Wi))

If (nVar(Wi))>>1 , The Gradient will explode

If (nVar(Wi))<<1, The Gradient will vanish

Thus our job is to restrict (nVar(Wi)) = 1 which avoid the problem of exploding or vanishing gradients and spread of variance will remain constant throughout the network.

(nVar(Wi)) = 1

Var(Wi) = 1/n

So if we scale the weights obtained from Gaussian distribution with mean 0 and variance 1 to 1/ √n, then we have

nVar(Wi) = nVar(W/√n)

= n * 1/n Var(W)

=1

So finally our task is to initialize weight from Normal Distribution with variance 1 and scale it to 1/ √n, where n is the number of nodes in the previous layer. In Python, we could do this using

W = np.random.randn(input_layer_neurons, hidden_layer_neurons)* sqrt(1/input_layer_neurons)

This is also known as Xavier Initialization or Glorot Initialization http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

In the case of ReLU activation function, we multiply by 2/ √n to account for the negative half (x<0) which do not contribute to any variance. This is also known as He initialization. This was proposed in https://arxiv.org/pdf/1502.01852v1.pdf

W = np.random.randn(input_layer_neurons, hidden_layer_neurons)* sqrt(2/input_layer_neurons)

Some of the other variants of Xavier Initialisation includes dividing by the sum of the number of input layer neurons and current hidden layer neurons. ie

Var(Wi) = 2/(input_layer_neurons + hidden_layer_neurons)

Tensorflow implementation docs https://www.tensorflow.org/api_docs/python/tf/contrib/layers/xavier_initializer

High-level API such as Keras also use Glorot Initialisation though the underlying distribution can be either Gaussian or Uniform. Below is the GitHub link of initializer function in Keras.

Lets Summarise

If you were able to follow along with some mind-boggling maths. Awesome. So we first saw that we cant use zero or same initialization as all the weights tend to update by the same magnitude, hence hindering the learning process. Also, we saw that if we initialize weights to either too small or large values, then they tend to saturate and gradient falls to 0. Thus there is the need to initialize weights such that the variation across neurons in hidden layer remains constant and Xavier initialization allows us to do so and thus it is the most obvious choice for initialization for any network.

There are techniques such as Batch Normalization , which tends to normalize neurons at each hidden layer before propagating it to the next layer, the same as we do with our inputs before feeding them into the network. This reduces the strong dependence on weight initialization and allows us to be a bit careless about initialization.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK