Deep Learning Made Easy: Neural Networks with Gradient Descent

This is the second part of the series Deep Learning Made Easy . Check out part 1 here .

In Part 1 , I introduced you with topics like What is Neural Networks, Supervised and Unsupervised learning and Why Deep learning is becoming so popular. In this 2nd part of the series, we’ll be discussing -

What is a binary classification (0 vs 1)
Logistic Regression
Cost function and Loss function
Gradient Descent
Forward and Backward Propagation

The aim of this and the next article is to learn to set up a basic machine learning problem with the help of the neural network architecture.

#What is Binary Classification :

Binary Classification generally falls under the domain of Supervised Learning since the training dataset is labelled.

eyqauyB.png!web

FIg 1: Binary Classification

And as the name suggests it is simply a special case in which there are only two classes. For example, let us say given some pens and pencils of different types and makes, we can easily separate them into two classes, namely pens and pencils.

#Logistic Regression :

I want to convey the idea of binary classification using logistic regression in order to make it easier to understand. Logistic Regression is used when the dependent variable(y) is categorical.

For example, to predict whether an email is spam (1) or not (0). Consider this scenario of classifying whether an email is a spam or not. If we use linear regression for this problem, there is a need for setting up a threshold on the predicted value based on which, the classification can be done. Say if the actual class is spam, predicted continuous value 0.4 and the threshold value is 0.5, the data point will be classified as not spam, which can lead to erroneous predictions.

#Simple Logistic Regression Model :

Here are some notations which will be used in all the upcoming articles:
M is the number of training vectors
Nx is the size of the input vector
Ny is the size of the output vector
X(1) is the first input vector
Y(1) is the first output vector
X = [x(1) x(2).. x(M)]
Y = [y(1) y(2).. y(M)]

Output = 0 or 1

Equations:

Simple equation: y = wx + b
If x is a vector: y = w(transpose)x + b
If we need y to be in between 0 and 1 (probability): y=sigmoid(w(transpose ) x + b)

Where the parameters w is a vector of Nx dimension and b is a real number.

Rzmm6nF.png!web

Fig 2: Sigmoid Function, Source — Wikipedia

The above equation can be written as Y = sigmoid (Z), where Z = w ( transpose ) x + b

If ‘Z’ goes to infinity, Y(predicted) will become 1 and if ‘Z’ goes to negative infinity, Y(predicted) will become 0.

The output from the equation is the estimated probability. This is used to find how confident can predicted value be actual value when given an input X.

#Cost Function and Loss Function :

Loss Function:It computes the error between the predicted value and the expected value of the output variable ‘y’ for each single training example.

Cost Function:It is a function that measures the performance of a Machine Learning model for a given set of data. Cost Function computes the error between predicted values and expected values of the output ‘y’ averaged over the entire dataset and represents it in the form of a single real number. Depending on the problem the Cost Function can be formed in many different ways.

Loss Function of Logistic Regression :

The function that we will use: L(y’,y) = -[y*log(y’) + (1-y)*log(1-y’)]

Where y = expected value, y’= predicted value

At first, it looks very complicated but let me simplify it for you.

If y = 1, then (1-y) term will become zero and the cost function will be y*log(y’)
If y = 0, then the first term will become zero and the cost function will be (1-y)*log(1-y’)

To explain the function lets see:
if y = 1 ==> L(y’,1) = -log(y’) ==> we want y’ to be the largest ==> y’ biggest value is 1
if y = 0 ==> L(y’,0) = -log(1-y’) ==> we want 1-y’ to be the largest ==> y’ to be smaller as possible because it can only has 1 value

The Cost function of Logistic Function will be: J(w,b) = (1/m)*Sum(L(y’[i],y[i] ))

So in training the logistic regression model, we’re going to try to find parameters W and B that minimize the overall cost function J(w,b) . So, you’ve just seen the set up for the logistic regression algorithm, the loss function for training examples and the overall cost function for the parameters of your algorithm. It turns out that logistic regression can be viewed as a very very small neural network. Let’s see how!

#Gradient Descent :

Gradient descent algorithm is an iterative process that takes us to the minimum of a function. Here we want to predict w and b that minimize the cost function. Let’s see how we can use the gradient descent algorithm to train or to learn, the parameters w and b on our training set. The formula below sums up the entire Gradient Descent algorithm in a single line.

Intuition

Consider that you are on the graph below, and you are at the green dot. Your aim is to reach the minimum i.e the red dot, but from your position, you are unable to perceive where the red one is.

umeENjM.png!web

Function Y = X² , Source — Wikipedia

There are two things you can do to reach red dot

You may go upward or downward
You may take a bigger step or a little step

Gradient Descent helps us to make these decisions effectively with the use of derivatives. A derivative is calculated as the slope of the graph at a particular point. The slope is described by drawing a tangent line to the graph at the point(here the green dot). So, if we are able to compute this tangent line, we might be able to compute the desired direction to reach the minima.

First, we initialize w and b to 0,0 or initialize them to a random value(weight initialization techniques) and then try to improve the values that reach minimum value.

The gradient descent algorithm repeats: w = w — alpha * dw

where alpha is the learning rate (or the step we want to take in order to reach the red dot) and dw is the derivative of w (Change to w) The derivative is also the slope of w and hence, gives us the direction to move to so that we can improve our parameters. More on the significance of the learning rate in the next section, but here we see the algorithm for gradient descent.

The actual equations we will implement:

w = w — alpha * d(J(w,b) / dw) (how much the function slopes in the w direction)
b = b — alpha * d(J(w,b) / db) (how much the function slopes in the d direction)

New weight = old weight — Derivative * learning rate

#The Learning rate(alpha) :

The size of steps taken to reach the minimum is called Learning Rate . We can cover more distance with larger steps/higher learning rate but are at the risk of overshooting the minima. On the other hand, small steps/smaller learning rates will consume a lot of time to reach the lowest point.

There are two very important and widely used terminologies in Neural Networks without which no Deep Learning class can be complete; these are Forward Propagation and Backward Propagation . Forward Propagation is how NN makes predictions of input to subsequent layers and Backward Propagation adjusts each weight in the network in proportion to determine how much it contributes to the overall error. Let’s learn more about them.

#Forward Propagation:

The input X provides the initial information that then propagates to the hidden units at each layer and finally produces the output y^. The architecture of the network entails determining its depth, width, and activation functions used on each layer. Depth is the number of hidden layers and width is the number of units (nodes) on each hidden layer.

#Backward Propagation :

While designing a Neural Network, we initialize weights with some random values. It’s not necessary that whatever weight values we have selected will be correct, or it fits our model the best. So we need to somehow change the parameters (weights), such that error becomes minimum, i.e we need to train our model. One way to train our model is called Backpropagation.

The Backpropagation algorithm looks for the minimum value of the error function in weight space using a technique called the delta rule or gradient descent. The weights that minimize the error function is then considered to be a solution to the learning problem.

Let me summarize all the above steps for building NN in 3–4 lines :

Define the model structure (such as number of input features and outputs)

Initialize the model’s parameters.

In Loop {

1. Calculate current loss (forward propagation)

2. Calculate current gradient (backward propagation)

3. Update parameters (gradient descent)

}

#What is Binary Classification :

#Logistic Regression :

#Simple Logistic Regression Model :

#Cost Function and Loss Function :

Loss Function of Logistic Regression :

#Gradient Descent :

Intuition

#The Learning rate(alpha) :

#Forward Propagation:

#Backward Propagation :

Recommend

Kernel trace tools（一）：中断和软中断关闭时间过长问题追踪

Filecoin – Snark as a Service数据量分析

Deep Learning Made Easy: Activation Functions, Parameters and Hyperparameters an...

使用GoAdmin极速搭建golang应用管理后台(二)——自定义登录页面

在线教育按下快进键：VIPKID连续两月新增用户超100万

[译] Go 的泛型真的要来了：如何使用以及它们是怎么工作的

不止我们，全世界都被Switch涨价给逼疯了

编程语言趋势报告：1200 万开发人员使用 JavaScript，Kotlin 增速最快

财务造假、暂时停牌的瑞幸咖啡，目前还值多少钱？

业绩暴跌、主业不振，比亚迪想靠芯片“恰饭”？

About Joyk