24

Minimizing the cost function: Gradient descent

 3 years ago
source link: https://mc.ai/minimizing-the-cost-function-gradient-descent/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

So we know gradient descent is an optimization algorithm to find the minimum of a function. How can we apply the algorithm to our linear regression?

To apply gradient descent, the key term here is the derivative.

  1. Take the cost function and take a partial derivative with respect to theta zero and theta one, which looks like this:

To take the partial derivative, we hold all of the other variables constant. Let’s say, we want to take the partial derivative with respect to theta zero, we just treat theta one as a constant and vice versa.

But why do we use partial derivatives in the equation? So that we’ll have a way of measuring how well our hypothesis function fits the data. We need to estimate the parameters (theta zero and theta one) in the hypothesis function — that is, we want to know the rate of change value for theta zero and theta one. In calculus, partial derivatives represent the rate of change of the functions as one variable change while the others are held constant. We apply the partial derivatives with respect to theta zero and theta one to the cost function to point us to the lowest point.

2. Plug them back into our gradient descent algorithm

To find the best minimum, repeat steps to apply various values for theta zero and theta one. In other words, repeat the steps until convergence.

The process of finding the optimal values for theta zero and theta one is to then minimize our derivatives.

Hence, to solve for the gradient at the next step of the iteration, we iterate through our data points using our updated theta zero and theta one values and compute their partial derivatives. This new gradient tells us the slope of our cost function at our current position and the direction we should move to update our parameters. The size of our update is controlled by the learning rate.

Pros and cons of gradient descent

  • A simple algorithm that is easy to implement and each iteration is cheap; we just need to compute a gradient
  • However, it’s often slow because many interesting problems are not strongly convex
  • Cannot handle non-differentiable functions (biggest downside)

Gradient Descent variants

There are three types of gradient descent methods based on the amount of data used to calculate the gradient:

  • Batch gradient descent
  • Stochastic gradient descent
  • Mini-batch gradient descent

Batch Gradient Descent

  • In batch gradient descent, to calculate the gradient of the cost function, we calculate the error for each example in the training dataset and then take the sum. The model is updated only after all examples have been evaluated.
  • What if we have 1000 samples or in a worst-case scenario, one million samples? The gradient descent algorithm would need to run one million times. So batch gradient descent is not a good fit for large datasets.

As we see, batch gradient descent is not an optimal solution here. It requires a large number of computational resources, as the entire dataset needs to remain in memory. So, if we just need to move a single step towards the minimum, should we calculate the cost a million times?

Stochastic Gradient Descent (SGD)

  • In SGD, we use one training sample at each iteration instead of using the whole dataset to sum all for every step, that is — SGD performs a parameter update for each observation. So instead of looping over each observation, it just needs one to perform the parameter update.

N ote: In SGD, before for-looping, we need to randomly shuffle the training examples.

  • SGD is usually faster than batch gradient descent, but its path to the minima is more random than batch gradient descent since SGD uses only one example at a time. But it’s ok as we are indifferent to the path, as long as it gives us the minimum and shorter training time.
  • SGD is widely used for larger dataset training, computationally faster, and can allow for parallel model training.

Mini-Batch Gradient Descent

  • Mini-batch gradient descent is a combination of both bath gradient descent and stochastic gradient descent.
  • Mini-batch gradient descent uses n data points (instead of one sample in SGD) at each iteration.

A case study:

We have learned all we need to implement Linear Regression. Now it’s time to see how it works on a dataset. I have learned so much by implementing a simple linear regression in Python. I hope you will learn a thing or two after reading my note.

I downloaded the Boston weather reports from the National Oceanic and Atmospheric Administration . You can search on Kaggle for competitions, datasets, and other solutions. Our dataset contains information on weather conditions recorded on each day at a weather station. Information includes average temperature (TAVG), cooling degree days season to date (CDSD), extreme maximum temperature for the period (EMXT), heating degree days season to date (HDSD), maximum temperature(TMAX), minimum temperature (TMIN). In this example, we want to predict the maximum temperature taking input feature as the minimum temperature.

Let’s get our hands dirty with Python, shall we?

Source: giphy.com
  1. Import all the required libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as seabornInstance
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
%matplotlib inline

2. Import the CSV dataset using pandas:

df = pd.read_csv(‘climate.csv’)
df.dropna(inplace = True)

We use the dropna() function to remove missing values.

3. Check the number of rows and columns in our datasets.

df.shape

We should receive output as (903,9), which means our data contains 903 rows and 9 columns.

We can see the statistical detail of our dataset by using describe() function:

df.describe()

4. Visualize our dataset to see if we can manually find any relationship between the data.

fig,(ax1) = plt.subplots(1, figsize = (12,6))
ax1.scatter (X, y, s = 8)
plt.title (‘Min vs Max Temp’)
plt.xlabel(‘TMIN’)
plt.ylabel(‘TMAX’)
plt.show()

5. Divide the data into “attributes” and “labels”.

Attributes are the independent variables while labels are dependent variables whose values are to be predicted. In our dataset, we only have two columns. We want to predict TMAX depending upon the TMIN recorded. Therefore our attribute set will consist of the “TMIN” column which is stored in the X variable, and the label will be the “TMAX” column which is stored in y variable.

X = df[‘TMIN’].values.reshape(-1,1).astype(‘float32’)
y = df[‘TMAX’].values.reshape(-1,1).astype(‘float32’)

6. Split 80% of the data into the training set while 20% of the data go into the test set.

The test_size variable is where we specify the proportion of the test set.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

7. Train our algorithm.

For this, we need to import the LinearRegression class, instantiate it, and call the fit() method along with our training data.

h = LinearRegression()
h.fit(X_train,y_train)
print(h.intercept_) # to retrieve theta_0
print(h.coef_) # to retrieve theta_1

The result should be approximately 16.25 for theta_0 and 1.07 for theta_1.

8. Make some predictions.

To do so, we will use our test data and see how accurately our algorithm predicts the percentage score.

y_pred = h.predict(X_test)
compare = pd.DataFrame({‘Actual’: y_test.flatten(), ‘Predicted’: y_pred.flatten()})
compare
comparison of Actual and Predicted value

As we can see from the table above, the predicted percentages are close to the actual ones. Let’s plot a straight line with the test data :

fig,(ax1) = plt.subplots(1, figsize = (12,6))
ax1.scatter (X_test, y_test, s = 8)
plt.plot(X_test,y_pred, color = ‘black’, linewidth = 2)\
plt.show()

The predictions are pretty close to the actual plot, which indicates a small value of the variance.

9. Implement Linear Regression

#pick some random value to start with
theta_0 = np.random.random()
theta_1 = np.random.random()def hypothesis (theta_0,theta_1,X):
    return theta_1*X + theta_0def cost_function (X,y,theta_0,theta_1):
    m = len(X)
    summation = 0.0
    for i in range (m):
        summation += ((theta_1 * X[i] + theta_0) - y[i])**2
    return summation /(2*m)def gradient_descent(X,y,theta_0,theta_1,learning_rate):
    t0_deriv = 0
    t1_deriv = 0
    m = len(X)

    for i in range (m):
        t0_deriv += (theta_1 * X[i] + theta_0) - y[i]
        t1_deriv += ((theta_1 * X[i] + theta_0) - y[i])* X[i]theta_0 -= (1/m) * learning_rate * t0_deriv
    theta_1 -= (1/m) * learning_rate * t1_deriv

    return theta_0,theta_1def training (X, y, theta_0, theta_1, learning_rate, iters):
    cost_history = [0]
    t0_history = [0]
    t1_history = [0]

    for i in range(iters):
        theta_0,theta_1 = gradient_descent(X, y, theta_0, theta_1, learning_rate)
        t0_history.append(theta_0)
        t1_history.append(theta_1)
        cost = cost_function(X, y, theta_0, theta_1)
        cost_history.append(cost)
          if i%10 == 0:
              print ("iter={}, theta_0={}, theta_1={}, cost= {}".format(i, theta_0, theta_1, cost))return t0_history, t1_history, cost_history

We choose learning rate equals 0.01 for 2000 iterations, and plot our cost function J

t0_history, t1_history, cost_history = training (X, y, theta_0, theta_1, 0.01, 2000)#Plot the cost function
plt.title('Cost Function J')
plt.xlabel('No. of iterations')
plt.ylabel('Cost')
plt.plot(cost_history)
plt.ylim(ymin=0)
plt.xlim(xmin=0)
plt.show()

I found a cool way to visualize our data using Animations with Matplotlib. It takes 449 iterations for the model to come quite close to the best fit line.

import matplotlib.animation as animationfig = plt.figure()
ax = plt.axes()# set up our plot
plt.ylabel(“TMAX”)
plt.xlabel(“TMIN”)
plt.title(‘Linear Regression’)
plt.scatter(X, y, color =’gray’,s =8)
line, = ax.plot([], [], lw=2)
plt.close()#Generate the animation data,
def init():
    line.set_data([], [])
    annotation.set_text('')
    return line, annotation# animation function.  This is called sequentially
def animate(i):
    #print(i)
    x = np.linspace(-5, 20, 1000)
    y = past_thetas[i][1]*x + past_thetas[i][0]
    line.set_data(x, y)
    annotation.set_text('Cost = %.2f e10' % (past_costs[i]/10000000000))
    return line, annotationanim = animation.FuncAnimation(fig, animate, init_func=init, frames=np.arange(1,400), interval=40, blit=True)from IPython.display import HTML
HTML(anim.to_html5_video())

That’s it!

In this note, we studied the most fundamental machine learning algorithm — gradient descent. We implemented a simple linear regression with the help of the Scikit-Learning machine learning library. In the next note, we will focus on multiple linear regression.

The hardest part of any endeavor is the beginning, and you have passed that, so don’t stop!

Lucky for us, linear regression is well-taught in almost every machine learning curriculum, and there are a decent number of solid resources out there to help us understand the different parts of a linear regression model, including the mathematics behind. Below are some more resources if you find yourself wanting to learn even more.

  1. Optimizing learning rate
  2. Elimination of all bad local minima in deep learning (Cornell University)
  3. Elimination of all bad local minima in deep learning (MIT)
  4. What is convergence?
  5. Why Visualize Gradient Descent Optimization Algorithms?
  6. A case study — Moneyball — Linear Regression
  7. Split your data into training and testing (80/20)
  8. Partial derivative in gradient descent for two variables
  9. Free data to work on: Lionbridge AI , Dataset Search

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK