20

End-to-End Machine Learning Project Tutorial — Part 1

 3 years ago
source link: https://towardsdatascience.com/end-to-end-machine-learning-project-tutorial-part-1-ea6de9710c0?gi=12b5fd886bf
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The perpetual question with regards to Data Science that I come across:

What is the best way to master Data Science? What will get me hired?

My answer remains constant: There is no alternative to working on portfolio-worthy projects . Even after clearing the TensorFlow Developer Certificate Exam, I’d say no certificates, no courses, nothing, you can only prove your competency with projects that showcase your research, programming skills, mathematical background, etc.

In my post, how to build an effective Data Science Portfolio , I shared many project ideas and other tips to prepare a kickass portfolio. This post is dedicated to one of those ideas where I mentioned about end-to-end data science/ML projects.

Agenda

This tutorial is intended to walk you through all the major steps involved in completing an End-to-End Machine Learning project. For this project, I’ve chosen a supervised learning regression problem.

Major topics covered:-

  • Pre-requisites and Resources
  • Data Collection and Problem Statement
  • Exploratory Data Analysis with Pandas and NumPy
  • Data Preparation using Sklearn
  • Selecting and Training a few Machine Learning Models
  • Cross-Validation and Hyperparameter Tuning using Sklearn
  • Deploying the Final Trained Model on Heroku via a Flask App

Let’s start building…

Pre-requisites and Resources

This project and tutorial expect familiarity with Machine Learning algorithms, Python environment setup, and common ML terminologies. Here are a few resources to get you started:

That’s it, make sure you have an understanding of these concepts and tools and you’re ready to go!

Data Collection and Problem Statement

A3EVR3Q.png!web

The first step is to get your hands onto the data but if you have access to data(as in most product-based companies) then, the first step is to define the problem that you want to solve. We don’t have the data yet, so we are going to collect the data first.

We are using the Auto MPG dataset from the UCI Machine Learning Repository . Here is the link to the dataset:

The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

Once you have downloaded the data, move it to your project directory, activate your virtualenv, start the jupyter local server.

  • You can download the data into your project from the notebook as well using wget :
!wget "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"

MJZ7jqa.png!web

  • The next step is to load this .data file into a pandas datagram, for that, make sure you have pandas and other general use case libraries installed. Import all the general use case libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
  • Reading and loading the file into a dataframe using read_csv() method:
  • Looking at a few rows of the dataframe and reading the description of each attribute from the website helps you define the problem statement.

2eu6fyN.png!web

Problem Statement —The data contains the MPG(Mile Per Gallon) variable which is continuous data and tells us about the efficiency of fuel consumption of a vehicle in the 70s and 80s.

Our aim here is to predict the MPG value for a vehicle given we have other attributes of that vehicle.

Exploratory Data Analysis with Pandas and NumPy

For this rather simple dataset, the exploration is broken down into a series of steps:

  1. Check for Data type of columns
##checking the data info
data.info()

2. Check for null values.

##checking for all the null values
data.isnull().sum()

ziM3Ify.png!web

The horsepower column has 6 missing values. We’ll have to study the column a bit more.

3. Check for outliers in horsepower column

##summary statistics of quantitative variables
data.describe()##looking at horsepower box plot
sns.boxplot(x=data['Horsepower'])

ry6riqQ.png!web

Since there are a few outliers, we can use the median of the column to impute the missing values using the pandas median() method.

##imputing the values with median
median = data['Horsepower'].median()
data['Horsepower'] = data['Horsepower'].fillna(median)
data.info()

4. Look for the category distribution in categorical columns

##category distribution
data["Cylinders"].value_counts() / len(data)data['Origin'].value_counts()

The 2 categorical columns are Cylinders and Origin which only have a few categories of values. Looking at the distribution of the values among these categories will tell us how the data is distributed:

ZNz2ayI.png!web

5. Plot for correlation

##pairplots to get an intuition of potential correlations
sns.pairplot(data[["MPG", "Cylinders", "Displacement", "Weight", "Horsepower"]], diag_kind="kde")

yQNJneQ.png!web

The pair plot gives you a brief overview of how each variable behaves with respect to every other variable.

For example, the MPG column(our target variable) is negatively correlated with Displacement, weight, and horsepower features.

6. Set aside the test data set

This is one of the first things we should do as we want to test our final model on unseen/unbiased data.

There are many ways to split the data into training and testing sets but we want our test set to represent the overall population and not just a few specific categories. Thus, instead of using simple and common train_test_split() method from sklearn, we use stratified sampling.

Stratified Sampling — We create homogeneous subgroups called strata from the overall population and sample the right number of instances to each stratum to ensure that the test set is representative of the overall population.

In task 4, we saw how the data is distributed over each category of the Cylinder column. We’re using the Cylinder column to create the strata:

from sklearn.model_selection import StratifiedShuffleSplitsplit = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["Cylinders"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

Checking for the distribution in training set:

##checking for cylinder category distribution in training set
strat_train_set['Cylinders'].value_counts() / len(strat_train_set)
FvMFFnB.png!web

Testing set:

strat_test_set["Cylinders"].value_counts() / len(strat_test_set)
MvYrInr.png!web

You can compare these results with the output of train_test_split() to find out which one produces better splits.

7. Checking the Origin Column

The Origin column about the origin of the vehicle and has discrete values that look like the code of a country.

To add some complication and make it more explicit, I converted these numbers to strings:

##converting integer classes to countries in Origin column
train_set['Origin'] = train_set['Origin'].map({1: 'India', 2: 'USA', 3 : 'Germany'})
train_set.sample(10)

IfYr22n.png!web

We’ll have to preprocess this categorical column by one-hot encoding these values:

##one hot encoding
train_set = pd.get_dummies(train_set, prefix='', prefix_sep='')
train_set.head()

8. Testing for new variables — Analyze the correlation of each variable with the target variable

## testing new variables by checking their correlation w.r.t. MPG
data['displacement_on_power'] = data['Displacement'] / data['Horsepower']
data['weight_on_cylinder'] = data['Weight'] / data['Cylinders']
data['acceleration_on_power'] = data['Acceleration'] / data['Horsepower']
data['acceleration_on_cyl'] = data['Acceleration'] / data['Cylinders']corr_matrix = data.corr()
corr_matrix['MPG'].sort_values(ascending=False)

JJ7JfqY.png!web

We found acceleration_on_power and acceleration_on_cyl as 2 new variables which turned out to be more positively correlated than the original variables.

This brings us to the end of the Exploratory Analysis. We are ready to proceed to our next step of preparing the data for our Machine Learning.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK