Neural Networks: From Hero to Zero

Neural Networks: From Zero to Hero

Learn about neural networks’ most important parameters!

Victor Roman

Feb 8 ·15min read

fEbU73I.png!web

https://www.vectorstock.com/royalty-free-vector/super-hero-brain-cartoon-vector-17919284

Introduction

Throughout this article will be covered the following topics:

Optimization theory: Gradient Descent and its variations
Learning Rate & Batch Size
Loss & Activation functions
Weight initialization

Optimization theory: Gradient Descent and its variations

If you want a detailed explanation of gradient descent, I recommend that you check outthis article in which there is an in-depth study of the mathematics on which neural networks are based.

In summary, gradient descent calculates the error of every sample in the training set and then updates the weights in the direction that points the gradient.

In other words, for every epoch, we need to :

Calculate every prediction (forward pass).
Calculate every error.
Propagate the errors backward, to evaluate how important is each weight in that error.
And finally, update the weight consequently.

Let’s imagine that we have:

A dataset of 100.000 samples
Each forward pass takes 2 ms
Each error calculation takes 1ms
Each backpropagation takes 3ms

If we do the calculation:

One regular neural network may need hundreds, or even thousands, epochs to converge appropriately. Let’s assume that we need 100 epochs, which is a low number.

How much will it take to our neural network to be trained?

This is so much time. And we were being nice by assuming we only had 100,000 samples. ImageNet, for example, consists of 1.2 million images, and it would take 2h per epoch, or in other words, 8.3 days. More than a week to see the behavior of a network.

One way to drastically reduce the time needed to train a neural network would be to use a single sample chosen randomly every time we want to update the weights.

This method is called Stochastic Gradient Descent (SDG). With SDG, we would simply have to calculate the predictions, errors, and backpropagation of one sample to update the weights.

This would reduce the total training time to:

This is a huge improvement. But this method has one very important disadvantage.

Out of these two paths, which one do you think that is the one followed by Gradient Descent? And which one by the Stochastic Gradient Descent?

BjAvyaR.png!web

https://www.fromthegenesis.com/gradient-descent/

The red path is the one that follows gradient descent. It calculates the gradient (the descending path) using all the samples of the dataset and gets consistent updates always in the direction that minimizes the error.

The purple path is the one followed by the SGD. What’s going on here? Each weight update is done to minimize the error by taking into account only one sample, so what we minimize is the error for that particular sample.

That’s why it has a more chaotic behavior and it costs more to converge, although, in return, it runs much faster, so in the time that the GD needs to run an epoch, the SGD can run thousands.

It seems like the best option would be to have a balance with both approaches. If we take a look at the previous picture, the best path is the green line.

To calculate this path, let’s review the methods discussed up to now:

A method that calculates the predictions and errors of all the elements of our training set: (Vanilla) Gradient Descent
A method that calculates the predictions and errors of 1 randomly chosen element from our training set: Stochastic Gradient Descent

What if instead of 1 element, we chose K elements? In this way:

We increase the stability of the algorithm since we do not only look at one element but at K (that is, we decrease the abrupt and chaotic changes of direction that the magenta line has)
We decrease the execution time concerning the traditional descent gradient since we went from the N samples that our training set has, to K , where K << N

This method is known as Mini-batch Stochastic Gradient Descent and is the most popular one in practice.

Kis usually chosen to be a power of 2, as this allows you to take advantage of some optimizations that have GPUs implemented for these cases. A typical K might be K =32, but in the end, this is limited by the memory of the GPU.

The lower the K is, the more it will resemble pure SGD, and the more epochs it will need to converge, although it is also true that it will calculate them faster.

On the other hand, the higher the K is, the more it will resemble pure GD, and the more trouble it will have to calculate each epoch, but it will need less time to converge.

Learning Rate & Batch Size

Learning rate and batch size are two parameters directly related to the descent gradient algorithm.

Learning Rate

As you may know, (or if you don’t,you can check here) the way to update neural’s weights is through these formulas:

AzYfy2v.png!web

https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

What multiplies to ∂Etotal/∂wn is called η , which is the learning rate. The learning rate indicates the importance we give to the error to update each weight. That is, how fast or how abrupt are the changes in the weights.

Thus, a very high η , will make changes in the weights in huge steps from one iteration to another, which can cause to skip the minimum.

6NrUvay.png!web

https://www.quora.com/In-neural-networks-why-would-one-use-many-learning-rates-in-decreasing-steps-rather-than-one-smooth-learning-rate-decay

Another possibility is to establish a very low η , which would make our network to need too many epochs to reach an acceptable minimum. Also, we would take the risk of being trapped in a worse minimum than the best we could achieve with a higher η .

ym2UFbb.png!web

https://www.researchgate.net/publication/226939869_An_Improved_EMD_Online_Learning-Based_Model_for_Gold_Market_Forecasting/figures?lo=1

Let’s talk about the minimums: what we achieve with a neural network, normally, is not the global minimum of our function, but a local minimum good enough to correctly perform the task we are developing.

What we want is an optimal learning rate, which allows us to reduce the error as time goes by, until we reach our minimum. In the graph, this learning rate would be the red line.

r2YRfyi.png!web

https://towardsdatascience.com/useful-plots-to-diagnose-your-neural-network-521907fa2f45

And to get our learning rate to be optimal, we can apply a decay to our learning rate. This decay decreases the learning rate with time, so when we are reaching the minimum it will be small enough to avoid skipping it.

emMJJn3.png!web

https://www.pinterest.es/pin/825918019139095502/

Thus, we avoid both long waits to converge by choosing a very low learning rate and skipping our minimum because the closer we are to it, the smaller the steps we take towards it.

Batch Size

Recalling from the previous section, SGD is a Mini-batch SGD where K=1 .

And the Mini-batch SGD the K indicates the number of samples used to update the weights each time. This is not a critical parameter and it is usually set as the maximum number of samples that can fit in our GPU.

We have a GPU with 8GB of memory, how many samples can we fit if each image occupies 1MB?

Well, it’s not that easy! It depends on the architecture of the network. The Dense or Fully Connected layers (which are the traditional ones in which all the neurons are interconnected with all the neurons in the next layer) are the ones that have more parameters, and therefore, the ones that occupy more memory.

We also have convolutional layers, pooling layers, dropout layers, and many other types. So in practice, it is difficult to calculate by hand the maximum number of samples we can use.

What we do is try to set batch sizes of multiples of 2 and decrease them if we have a memory error. For example, we would start with 512, and if we have an error we would go down to 256, 128, 64, 32, 16, 8, 4, 2 and even 1.

Depending on the architecture of your network, you may have to use K=1, and therefore SGD. Although it is often preferable to reduce the image size, for example, from 512x512 to 256x256 or 128x128 pixels, and use a larger K.

Learning Rate & Batch Size Relationship

It is very important to keep in mind that the learning rate is related to the batch size.

If we approach to K=1, we must lower the learning rate so that the updates of the weights have less importance, since it is closer to the SGD, in other words, to calculations of the gradient with single random samples.

So, in summary, if we use a lower batch size, it is recommended to use a lower learning rate but we would also increase the number of epochs, as the latter conditions would make our neural network to take more time to converge.

Loss & Activation Functions

Loss functions

The loss function is the one that tells us how wrong our predictions have been.

Imagine that we have to guess how much a house costs just by looking at a picture. Our neural network would have as input the pixels of the photo and as output a number indicating the price.

For example, let’s say we want to predict the price of a house so we are training the network and this house is in our training set. When the picture passes by, a prediction is calculated, which is that it is worth 323,567$. The truth is that the house costs 600,000$, so it seems obvious that a proper loss function could be:

Taking this into account, the most common loss functions are:

For Regression problems

Mean Squared Error
Mean Absolute Error

Classification problems

Binary Cross-Entropy
Categorical Cross-Entropy

As I wrote in aprevious article focused just on regression problems, let’s a look at each one of them:

Mean Squared Error:

zIfINnI.png!web

https://towardsdatascience.com/supervised-learning-basics-of-linear-regression-1cbab48d0eba

Mean Square Error or MSE, is the average of the squared difference between the real data points and the predicted outcome. This method penalizes more the bigger the distance is, and it is the standard in regression problems.

Mean Average Error

qEna6zB.png!web

https://towardsdatascience.com/supervised-learning-basics-of-linear-regression-1cbab48d0eba

Mean Absolute Error or MAE, is the average of the absolute difference between the real data points and the predicted outcome. If we take this as the strategy to follow, each step of the gradient descent would reduce the MAE.

What is Cross-Entropy?

We need first to understand what is entropy. Let’s try to illustrate this with a couple of examples:

Example 1

Imagine that we are playing a game: we have a bag with different colored balls, and the goal of the game is to guess which color is the one that a volunteer draws with the minimum number of questions.

In this case, we have a blue ball, a red ball, a green ball, and an orange ball:

https://www.quora.com/Whats-an-intuitive-way-to-think-of-cross-entropy

Which means every ball has a 1/4 chance of getting out.

One of the best strategies would be to first ask if the ball you have served is blue or red. If it is, we would ask if the ball is blue. If not, we would ask if it is green. So we need 2 questions.

BVrUnqJ.png!web

https://www.quora.com/Whats-an-intuitive-way-to-think-of-cross-entropy

Example 2

This time we have a bag with balls in which 1/2 are blue, 1/4 are red, 1/8 are green and 1/8 are red. Now, the optimal strategy would be to ask if it’s blue first since there’s a better chance of a blue one coming out. If it is, we’re done. If not, we could ask if it’s red, which is the next most likely class. If it is, we’re done. If not, we could ask if it’s green (or if it’s orange).

eiQ3u2A.png!web

https://www.quora.com/Whats-an-intuitive-way-to-think-of-cross-entropy

Now what happens is that half of the time (1/2) is blue, and it costs us 1 question to guess. 1/4 of the time is red, and it costs us 2 questions. 1/8 is green, and it costs us 3 questions, and the same if it is orange.

Therefore, the expected number of questions to guess a ball are: 1/2⋅1 (blue) +1/4⋅2 (red) +1/8⋅3 (green) +1/8⋅3 (orange) =1.75

Example 3

Imagine now that we have a bag full of blue balls. How many questions do we need to find out what color ball they take out? None, 0.

From these examples, we can come up with an expression that allows us to calculate the number of questions depending on the probability of the ball. Thus, a ball with probability p costs log2(1/p) questions.

For example, for the ball that has p=1/8, we need n_quest =log2(1/(1/8)= 3

So the expected number of questions in total is:

In the end, one way to understand entropy is the following:

If we were to follow the optimal strategy, what is the expected number of questions that would allow us to guess the color of the ball?

So, the more complicated the game is, the higher the entropy . In this case, Example 1 > Example 2 > Example 3.

Okay, so now that we know what entropy is, let’s see what cross-entropy is.

Cross-Entropy

Imagine that we had followed the strategy of Example 1 for Example 2:

rE7neyy.png!web

https://www.quora.com/Whats-an-intuitive-way-to-think-of-cross-entropy

So, we would have had to ask two questions to find out if it’s any color. If we calculate the total number of questions needed, taking into account that this time each ball has a probability, it gives us that:

n_total_quest=1/2⋅2(blue)+1/4⋅2(red)+1/8⋅2(green)+1/8⋅2(orange)=2

So this strategy is worse than the strategy we followed in Example 1.

In the end, intuitively, entropy is the number of questions expected using the best possible strategy, and cross entropy is the number of questions expected when you do not use the best possible strategy.

For this reason, what we try to do is to minimize the cross-entropy.

Formally, it is defined as:

Where:

pi is the actual probability of the balls (in our example, 1/2 for blue, 1/4 for red, and 1/8 for green and orange)
pi^ is the probability we have assumed when using our strategy, which in this case is that all balls are equally probable ( 1/4 for all colors)

One way to remember what goes where in the formula, is to remember that what we want to do is to find out how many questions are needed following our strategy, which is pi^, so inside the log2 goes pi^.

Activation Functions

If we didn’t have activation functions we would have the following:

EjYBFj6.png!web

https://towardsdatascience.com/neural-representation-of-logic-gates-df044ec922bc

We would have to y(x)=Wx+b . This is a linear combination that would be unable to even solve a problem like XOR.

JfQbYn2.png!web

http://www.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-html/node19.html

Therefore, we need a way to introduce non-linearity, and that is what the activation function does. In the following image you can see some of the most typical ones, and where they intervene in the network:

7rQ3EnU.png!web

https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f

Here you can see the most used ones:

eaAB7fn.png!web

https://towardsdatascience.com/complete-guide-of-activation-functions-34076e95d044

It is difficult to know with which of them our network will behave better, but there is one that usually gives good results almost always: the ReLU.

Therefore, whenever we start, we will use the ReLU, and once we get some results that we consider good, we can try with the Leaky ReLU or any other that you want. Every day new ones come out, and a simple google search can lead you to some interesting ones: like SELU, for example ( https://towardsdatascience.com/selu-make-fnns-great-again-snn-8d61526802a9 ).

Many of these activation functions need specific methods of weight initialization, so that lay between a specific range of values and that the gradient descent works properly.

In the case of the output layers, the softmax activation function is the one most used, since it is capable of giving a probability to each class, making all of them add up to 1.

As this may seem a little complicated, find below the recommended recipe as a summary of all the above:

Recipe

Start using the ReLU with a learning rate of 0.01 or 0.001, and watch what happens.
If the network trains (converges) but is slow, you can try to increase the learning rate a little
If the network does not converge and behaves chaotically, decrease the learning rate
Once you have your network up and running, try the Leaky ReLU, Maxout or ELU
Do not use the sigmoid, in practice it does not usually give good results

Weight Initialization

As you have seen before, weights and biases initialization is very important to achieve the convergence of our network to an adequate minimum. So let’s look at some ways to initialize the weights.

If we follow the MNIST dataset ( as we did in this article ), our weight matrix would be 768 (inputs) x 10 (outputs).

Constant Initialization

We can pre-set our weights to

Zero:

W = np.zeros((768, 10))

One:

W = np.ones((768, 10))

To a constant C :

W = np.ones((768, 10)) * C

Normal and uniform distribution

We can also initialize the weights using a uniform distribution, where an [upper_bound,lower_bound] is defined and all numbers within the range have the same probability of being chosen.

For example, for a distribution between [-0.2,0.2][-0.2,0.2] :

W = np.random.uniform(low=-0.2, high=0.2, size=(768, 10))

With this instruction, we will initialize the WW weight matrix with values extracted from the range between [-0.2,0.2][-0.2,0.2] where they all have the same probability of being extracted.

We can also do it with a normal or Gaussian distribution, which is defined as

Where, as you know:

μ is the average
σ is the standard deviation and σ2σ2 the variance

So we could initialize our weights with a normal distribution with μ=0 and σ=0.2 , for example:

W = np.random.normal(0.0, 0.2, size=(768, 10))

Initialization: Normal and uniform LeCun

Another, more advanced, method is the LeCun method, also known as “Efficient backprop”.

This method defines 3 parameters:

Fin: number of entries to the layer (in our example, 768)
Fout: number of outlets of the layer (in our example, 10)
limit: is defined according to Fin and Fout as:

The code for initializing W by this method using a uniform distribution would be:

W = np.random.uniform(low=-limit, high=limit, size=(F_in, F_out))

And for a normal one:

W = np.random.normal(low=-limit, high=limit, size=(F_in, F_out))

Initialization: Glorot/Xavier normal and uniform

This is perhaps the most widely used method for initializing weights and biases. It’s the default when using Keras.

In this case, also the same parameters are defined as with LeCun, but the calculation of the limit varies:

The code to initialize WW using this method would be the same as with LeCun.

For a uniform distribution it would be:

W = np.random.uniform(low=-limit, high=limit, size=(F_in, F_out))

And for a normal one:

W = np.random.normal(low=-limit, high=limit, size=(F_in, F_out))

Initialization: He et al./Kaiming/MSRA normal and uniform

This method is named after Kaiming He, the first author of Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.

Normally, this method is used when we are training very deep neural networks that use a particular type of ReLU as activation like the Parametric ReLU.

The code in the case of the uniform is this:

limit = np.sqrt(6 / float(F_ini))W = np.random.uniform(low=-limit, high=limit, size=(F_in, F_out))

And in the case of the normal one, this one:

limit = np.sqrt(2 / float(F_ini))W = np.random.uniform(low=-limit, high=limit, size=(F_in, F_out))

Recipe:

The initialization of the weights is usually not a determining factor in the training of a net, but sometimes it can cause the net to not be able to train because it fails to converge.

Therefore, the recommended advice is to use Glorot’s, and if that day you feel lucky and want to see if you can improve in the accuracy, try with others.

Final Words

As always, I hope you enjoyed the post, that you are now a pro on neural networks!

If you liked this post then you can take a look at my other posts on Data Science and Machine Learning here .

If you want to learn more about Machine Learning, Data Science and Artificial Intelligence follow me on Medium , and stay tuned for my next posts!

Neural Networks: From Zero to Hero

Learn about neural networks’ most important parameters!

Introduction

Optimization theory: Gradient Descent and its variations

Learning Rate & Batch Size

Learning Rate

Batch Size

Learning Rate & Batch Size Relationship

Loss & Activation Functions

Loss functions

For Regression problems

Classification problems

What is Cross-Entropy?

Example 1

Example 2

Example 3

Cross-Entropy

Activation Functions

Recipe

Weight Initialization

Constant Initialization

Normal and uniform distribution

Initialization: Normal and uniform LeCun

Initialization: Glorot/Xavier normal and uniform

Initialization: He et al./Kaiming/MSRA normal and uniform

Recipe:

Final Words

Recommend

Rich Hickey Doesn't Know Types (2019)

The tools you'd miss if you left a company

The Many Uses of Input Gradient Regularization

Analyzing attacks on my web server

Open-source, configurable HDMI output for FPGAs

Philosophy as Math-Like Thinking

UNIX History Timeline

Effects of “The Work” Meditation on Psychological Symptoms and Quality of Life

LG V60 ThinQ has a 5,000mAh battery & Quad cameras

JEP draft: Pattern Matching for instanceof (Preview 2)

About Joyk