16

Deep Learning Made Easy: Activation Functions, Parameters and Hyperparameters an...

 4 years ago
source link: https://towardsdatascience.com/deep-learning-made-easy-activation-functions-parameters-and-hyperparameters-and-weight-c7bcfeb9af24?gi=af1c275ad87a
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

This is the third part of the series Deep Learning Made Easy . Check out the previous articlehere.

In Part 2 , I introduced you with binary classification (0 vs 1), Logistic Regression, Cost function and Loss function, Gradient Descent and Forward and Backward Propagation. In this part of the series, we’ll be discussing -

  1. Activation Functions
  2. Random Initilaziton
  3. Parameters vs Hyperparameters

The aim of this article is to be able to apply a variety of activation functions in a neural network and understand the role of hyperparameters in deep learning

#Activation Functions :

A neural network without an activation function is just a linear regression model

Neural network activation functions are a crucial component of deep learning. Activation functions determine the output of a deep learning model, its accuracy, and also the computational efficiency of training a model, which can make or break a large scale neural network. Activation functions are mathematical equations that determine the output of a neural network. The activation function is attached to each neuron in the network and determines whether it should be activated or not, based on whether each neuron’s input is relevant for the model’s prediction. Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1.

#Linear Activation Function :

A linear activation function takes the form: A = CX

A linear function is just a polynomial of degree one. A linear equation is easy to solve but they are limited in their complexity and have less power to learn complex functional mappings. While using a linear function it is not possible to use backpropagation (gradient descent) to train the model as the derivative of the function is a constant, and therefore, has no relation to the input X. So it’s not possible to go back and understand which weights in the input neurons can provide a better prediction.

Also, all layers of the neural network collapse into one with linear activation functions — no matter how many layers in the neural network, the last layer will be a linear function of the first layer (because a linear combination of linear functions is still a linear function). So a linear activation function turns the neural network into just one layer.

#Non-linear Activation Functions :

Non-linear functions are those which have a degree more than one and a curvature when plotted on a plane. The benefits of using them are that they allow backpropagation because of having a derivative function which is related to the inputs. They also allow “stacking” of multiple layers of neurons to create a deep neural network. Multiple hidden layers of neurons are needed to learn complex data sets with high levels of accuracy.

Most common types of activation functions are:

  1. Sigmoid
  2. Tanh
  3. ReLU
  4. Leaky ReLU

#Sigmoid :

qaUfe2Y.png!web

Fig 2: Sigmoid Function, Source — Wikipedia

Derivation of Sigmoid activation function:

g(z) = 1 / (1 + np.exp(-z))

g’(z) = g(z) * (1 — g(z))

Advantages

  • Smooth gradient
  • Output values bound between 0 and 1, normalizing the output of each neuron.
  • Clear predictions: For X above 2 or below -2, tends to bring the Y value (the prediction) to the edge of the curve, very close to 1 or 0. This enables clear predictions.

Disadvantages

  • Vanishing gradient: For very high or very low values of X, there is almost no change to the prediction, causing a vanishing gradient problem. This can result in the network being too slow to reach an accurate prediction while backpropagation.
  • Outputs are not zero centred.
  • They can be computationally expensive

#Tanh :

iI7neyQ.jpg

Fig 3: Tanh function, Source — Wikipedia

Derivation of Tanh activation function:

g(z) = (e^z — e^-z) / (e^z + e^-z)

g’(z) = 1 — np.tanh(z)² = 1 — g(z)²

Advantages

  • Outputs are zero centred: Making it easier to model inputs that have strongly negative, neutral, and strongly positive values.
  • Otherwise, it’s the same as the Sigmoid function.

Disadvantages

  • Same as the Sigmoid function

#ReLU :

nU7JzaN.png!web

Fig 4: ReLU function, Source — Wikipedia

Derivation of ReLU activation function:

g(z) = np.maximum(0,z)

g’(z)={ 0 if z < 0

1 if z > = 0 }

Advantages

  • Computationally efficient: Allows the iterations to converge very quickly
  • Non-linear: Although it looks like a linear function, ReLU has a derivative function and allows for backpropagation

Disadvantages

  • The Dying ReLU problem: When inputs approach zero or are negative, the gradient of the function becomes zero, the network cannot perform backpropagation and cannot learn.

#Leaky ReLU :

Qveuuuq.png!web
Fig 5: Leaky ReLU function, Source — Wikipedia

Derivation of leaky RELU activation function:

g(z) = np.maximum(0.01 * z, z)

g’(z)={0.01 if z<0

1 if z > = 0 }

Advantages

  • Prevents the Dying ReLU problem: This variation of ReLU has a small positive slope in the negative area, so it does enable backpropagation, even for negative input values
  • Otherwise same as ReLU

Disadvantages

  • Results not consistent: Leaky ReLU does not provide consistent predictions for negative input values.
A better choice of activation functions requires an understanding of the problem at hand, as different functions work differently on different problems.

#Random Initialization :

In logistic regression,( previous article ) it wasn’t important to initialize the weights randomly, while in NN we have to initialize them randomly. If we initialize all the weights with zeros in NN it won’t work (initializing bias with zero is ok). If initialized with zeros, then all hidden units would become completely identical (symmetric) and hence compute the exact same function in every iteration. On each gradient descent iteration, all the hidden units will always update in the same way.

To solve this we initialize the W’s with small random numbers:
W = np.random.randn((2,2)) * 0.01 # 0.01 to make it small enough
b = np.zeros((2,1)) # it’s ok to have b as zero, it won’t get us to the symmetry-breaking problem

Why initialize weights with small random numbers?

We need small values because in the sigmoid (or tanh), for example, if the weight is too large, you are more likely to end up, possibly, at the very start of training procedure with very large values of Z. This would cause your activation function to be saturated, thus slowing down the learning. If you don’t have any sigmoid or tanh activation functions throughout your neural network, then this is less of an issue.

#Parameters vs Hyperparameters :

Model Parameters: These are the parameters in the model that must be determined using the training data set. These are the fitted parameters.

Hyperparameters: These are adjustable parameters that must be tuned in order to obtain a model with optimal performance.

Parameters of a standard Neural Network are W and b.

Hyperparameters for same are, however, the following:

  • Learning rate
  • The number of iteration
  • The number of hidden layers L
  • The number of hidden units n
  • Choice of activation functions

That’s all for this article folks, if you’ve made it this far, please comment below your experience while reading and provide feedback and also add me on LinkedIn . Also, if you want me to write on some particular topic, you can suggest it in the comment box.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK