Deep Learning Made Easy: Activation Functions, Parameters and Hyperparameters an...

This is the third part of the series Deep Learning Made Easy . Check out the previous articlehere.

In Part 2 , I introduced you with binary classification (0 vs 1), Logistic Regression, Cost function and Loss function, Gradient Descent and Forward and Backward Propagation. In this part of the series, we’ll be discussing -

Activation Functions
Random Initilaziton
Parameters vs Hyperparameters

The aim of this article is to be able to apply a variety of activation functions in a neural network and understand the role of hyperparameters in deep learning

#Activation Functions :

A neural network without an activation function is just a linear regression model

Neural network activation functions are a crucial component of deep learning. Activation functions determine the output of a deep learning model, its accuracy, and also the computational efficiency of training a model, which can make or break a large scale neural network. Activation functions are mathematical equations that determine the output of a neural network. The activation function is attached to each neuron in the network and determines whether it should be activated or not, based on whether each neuron’s input is relevant for the model’s prediction. Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1.

#Linear Activation Function :

A linear activation function takes the form: A = CX

A linear function is just a polynomial of degree one. A linear equation is easy to solve but they are limited in their complexity and have less power to learn complex functional mappings. While using a linear function it is not possible to use backpropagation (gradient descent) to train the model as the derivative of the function is a constant, and therefore, has no relation to the input X. So it’s not possible to go back and understand which weights in the input neurons can provide a better prediction.

Also, all layers of the neural network collapse into one with linear activation functions — no matter how many layers in the neural network, the last layer will be a linear function of the first layer (because a linear combination of linear functions is still a linear function). So a linear activation function turns the neural network into just one layer.

#Non-linear Activation Functions :

Non-linear functions are those which have a degree more than one and a curvature when plotted on a plane. The benefits of using them are that they allow backpropagation because of having a derivative function which is related to the inputs. They also allow “stacking” of multiple layers of neurons to create a deep neural network. Multiple hidden layers of neurons are needed to learn complex data sets with high levels of accuracy.

Most common types of activation functions are:

Sigmoid
Tanh
ReLU
Leaky ReLU

#Sigmoid :

qaUfe2Y.png!web

Fig 2: Sigmoid Function, Source — Wikipedia

Derivation of Sigmoid activation function:

g(z) = 1 / (1 + np.exp(-z))

g’(z) = g(z) * (1 — g(z))

Advantages

Smooth gradient
Output values bound between 0 and 1, normalizing the output of each neuron.
Clear predictions: For X above 2 or below -2, tends to bring the Y value (the prediction) to the edge of the curve, very close to 1 or 0. This enables clear predictions.

Disadvantages

Vanishing gradient: For very high or very low values of X, there is almost no change to the prediction, causing a vanishing gradient problem. This can result in the network being too slow to reach an accurate prediction while backpropagation.
Outputs are not zero centred.
They can be computationally expensive

#Tanh :

Fig 3: Tanh function, Source — Wikipedia

Derivation of Tanh activation function:

g(z) = (e^z — e^-z) / (e^z + e^-z)

g’(z) = 1 — np.tanh(z)² = 1 — g(z)²

Advantages

Outputs are zero centred: Making it easier to model inputs that have strongly negative, neutral, and strongly positive values.
Otherwise, it’s the same as the Sigmoid function.

Disadvantages

Same as the Sigmoid function

#ReLU :

nU7JzaN.png!web

Fig 4: ReLU function, Source — Wikipedia

Derivation of ReLU activation function:

g(z) = np.maximum(0,z)

g’(z)={ 0 if z < 0

1 if z > = 0 }

Advantages

Computationally efficient: Allows the iterations to converge very quickly
Non-linear: Although it looks like a linear function, ReLU has a derivative function and allows for backpropagation

Disadvantages

The Dying ReLU problem: When inputs approach zero or are negative, the gradient of the function becomes zero, the network cannot perform backpropagation and cannot learn.

#Leaky ReLU :

Fig 5: Leaky ReLU function, Source — Wikipedia

Derivation of leaky RELU activation function:

g(z) = np.maximum(0.01 * z, z)

g’(z)={0.01 if z<0

1 if z > = 0 }

Advantages

Prevents the Dying ReLU problem: This variation of ReLU has a small positive slope in the negative area, so it does enable backpropagation, even for negative input values
Otherwise same as ReLU

Disadvantages

Results not consistent: Leaky ReLU does not provide consistent predictions for negative input values.

A better choice of activation functions requires an understanding of the problem at hand, as different functions work differently on different problems.

#Random Initialization :

In logistic regression,( previous article ) it wasn’t important to initialize the weights randomly, while in NN we have to initialize them randomly. If we initialize all the weights with zeros in NN it won’t work (initializing bias with zero is ok). If initialized with zeros, then all hidden units would become completely identical (symmetric) and hence compute the exact same function in every iteration. On each gradient descent iteration, all the hidden units will always update in the same way.

To solve this we initialize the W’s with small random numbers:
W = np.random.randn((2,2)) * 0.01 # 0.01 to make it small enough
b = np.zeros((2,1)) # it’s ok to have b as zero, it won’t get us to the symmetry-breaking problem

Why initialize weights with small random numbers?

We need small values because in the sigmoid (or tanh), for example, if the weight is too large, you are more likely to end up, possibly, at the very start of training procedure with very large values of Z. This would cause your activation function to be saturated, thus slowing down the learning. If you don’t have any sigmoid or tanh activation functions throughout your neural network, then this is less of an issue.

#Parameters vs Hyperparameters :

Model Parameters: These are the parameters in the model that must be determined using the training data set. These are the fitted parameters.

Hyperparameters: These are adjustable parameters that must be tuned in order to obtain a model with optimal performance.

Parameters of a standard Neural Network are W and b.

Hyperparameters for same are, however, the following:

Learning rate
The number of iteration
The number of hidden layers L
The number of hidden units n
Choice of activation functions

That’s all for this article folks, if you’ve made it this far, please comment below your experience while reading and provide feedback and also add me on LinkedIn . Also, if you want me to write on some particular topic, you can suggest it in the comment box.

#Activation Functions :

#Linear Activation Function :

#Non-linear Activation Functions :

#Sigmoid :

#Tanh :

#ReLU :

#Leaky ReLU :

#Random Initialization :

Why initialize weights with small random numbers?

#Parameters vs Hyperparameters :

Recommend

使用GoAdmin极速搭建golang应用管理后台(二)——自定义登录页面

在线教育按下快进键：VIPKID连续两月新增用户超100万

[译] Go 的泛型真的要来了：如何使用以及它们是怎么工作的

不止我们，全世界都被Switch涨价给逼疯了

编程语言趋势报告：1200 万开发人员使用 JavaScript，Kotlin 增速最快

财务造假、暂时停牌的瑞幸咖啡，目前还值多少钱？

业绩暴跌、主业不振，比亚迪想靠芯片“恰饭”？

线上渠道成功突围后来者步步紧逼珀莱雅还有增长空间吗

试用了几天 mobaxterm，最终还是转向了 xshell

武汉两医生被ECMO救回后脸色发黑，专家解释原因

About Joyk