End-to-End Machine Learning Project Tutorial — Part 1 - JOYK Joy of Geek, Geek News, Link all geek

The perpetual question with regards to Data Science that I come across:

What is the best way to master Data Science? What will get me hired?

My answer remains constant: There is no alternative to working on portfolio-worthy projects . Even after clearing the TensorFlow Developer Certificate Exam, I’d say no certificates, no courses, nothing, you can only prove your competency with projects that showcase your research, programming skills, mathematical background, etc.

In my post, how to build an effective Data Science Portfolio , I shared many project ideas and other tips to prepare a kickass portfolio. This post is dedicated to one of those ideas where I mentioned about end-to-end data science/ML projects.

Agenda

This tutorial is intended to walk you through all the major steps involved in completing an End-to-End Machine Learning project. For this project, I’ve chosen a supervised learning regression problem.

Major topics covered:-

Pre-requisites and Resources
Data Collection and Problem Statement
Exploratory Data Analysis with Pandas and NumPy
Data Preparation using Sklearn
Selecting and Training a few Machine Learning Models
Cross-Validation and Hyperparameter Tuning using Sklearn
Deploying the Final Trained Model on Heroku via a Flask App

Let’s start building…

Pre-requisites and Resources

This project and tutorial expect familiarity with Machine Learning algorithms, Python environment setup, and common ML terminologies. Here are a few resources to get you started:

Read the first 2–3 chapters of The hundred page ML book: http://themlbook.com/wiki/doku.php
List of Tasks for almost every Machine Learning Project — Keep referring to this list while working on this(or any other) ML project.
You need a Python Environment set up — a virtual environment dedicated to this project.
Familiarity withJupyter Notebook.

That’s it, make sure you have an understanding of these concepts and tools and you’re ready to go!

Data Collection and Problem Statement

A3EVR3Q.png!web

The first step is to get your hands onto the data but if you have access to data(as in most product-based companies) then, the first step is to define the problem that you want to solve. We don’t have the data yet, so we are going to collect the data first.

We are using the Auto MPG dataset from the UCI Machine Learning Repository . Here is the link to the dataset:

http://archive.ics.uci.edu/ml/datasets/Auto+MPG

The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

Once you have downloaded the data, move it to your project directory, activate your virtualenv, start the jupyter local server.

You can download the data into your project from the notebook as well using wget :

!wget "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"

MJZ7jqa.png!web

The next step is to load this .data file into a pandas datagram, for that, make sure you have pandas and other general use case libraries installed. Import all the general use case libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Reading and loading the file into a dataframe using read_csv() method:

Looking at a few rows of the dataframe and reading the description of each attribute from the website helps you define the problem statement.

2eu6fyN.png!web

Problem Statement —The data contains the MPG(Mile Per Gallon) variable which is continuous data and tells us about the efficiency of fuel consumption of a vehicle in the 70s and 80s.

Our aim here is to predict the MPG value for a vehicle given we have other attributes of that vehicle.

Exploratory Data Analysis with Pandas and NumPy

For this rather simple dataset, the exploration is broken down into a series of steps:

Check for Data type of columns

##checking the data info
data.info()

2. Check for null values.

##checking for all the null values
data.isnull().sum()

ziM3Ify.png!web

The horsepower column has 6 missing values. We’ll have to study the column a bit more.

3. Check for outliers in horsepower column

##summary statistics of quantitative variables
data.describe()##looking at horsepower box plot
sns.boxplot(x=data['Horsepower'])

ry6riqQ.png!web

Since there are a few outliers, we can use the median of the column to impute the missing values using the pandas median() method.

##imputing the values with median
median = data['Horsepower'].median()
data['Horsepower'] = data['Horsepower'].fillna(median)
data.info()

4. Look for the category distribution in categorical columns

##category distribution
data["Cylinders"].value_counts() / len(data)data['Origin'].value_counts()

The 2 categorical columns are Cylinders and Origin which only have a few categories of values. Looking at the distribution of the values among these categories will tell us how the data is distributed:

ZNz2ayI.png!web

5. Plot for correlation

##pairplots to get an intuition of potential correlations
sns.pairplot(data[["MPG", "Cylinders", "Displacement", "Weight", "Horsepower"]], diag_kind="kde")

yQNJneQ.png!web

The pair plot gives you a brief overview of how each variable behaves with respect to every other variable.

For example, the MPG column(our target variable) is negatively correlated with Displacement, weight, and horsepower features.

6. Set aside the test data set

This is one of the first things we should do as we want to test our final model on unseen/unbiased data.

There are many ways to split the data into training and testing sets but we want our test set to represent the overall population and not just a few specific categories. Thus, instead of using simple and common train_test_split() method from sklearn, we use stratified sampling.

Stratified Sampling — We create homogeneous subgroups called strata from the overall population and sample the right number of instances to each stratum to ensure that the test set is representative of the overall population.

In task 4, we saw how the data is distributed over each category of the Cylinder column. We’re using the Cylinder column to create the strata:

from sklearn.model_selection import StratifiedShuffleSplitsplit = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["Cylinders"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

Checking for the distribution in training set:

##checking for cylinder category distribution in training set
strat_train_set['Cylinders'].value_counts() / len(strat_train_set)

Testing set:

strat_test_set["Cylinders"].value_counts() / len(strat_test_set)

You can compare these results with the output of train_test_split() to find out which one produces better splits.

7. Checking the Origin Column

The Origin column about the origin of the vehicle and has discrete values that look like the code of a country.

To add some complication and make it more explicit, I converted these numbers to strings:

##converting integer classes to countries in Origin column
train_set['Origin'] = train_set['Origin'].map({1: 'India', 2: 'USA', 3 : 'Germany'})
train_set.sample(10)

IfYr22n.png!web

We’ll have to preprocess this categorical column by one-hot encoding these values:

##one hot encoding
train_set = pd.get_dummies(train_set, prefix='', prefix_sep='')
train_set.head()

8. Testing for new variables — Analyze the correlation of each variable with the target variable

## testing new variables by checking their correlation w.r.t. MPG
data['displacement_on_power'] = data['Displacement'] / data['Horsepower']
data['weight_on_cylinder'] = data['Weight'] / data['Cylinders']
data['acceleration_on_power'] = data['Acceleration'] / data['Horsepower']
data['acceleration_on_cyl'] = data['Acceleration'] / data['Cylinders']corr_matrix = data.corr()
corr_matrix['MPG'].sort_values(ascending=False)

JJ7JfqY.png!web

We found acceleration_on_power and acceleration_on_cyl as 2 new variables which turned out to be more positively correlated than the original variables.

This brings us to the end of the Exploratory Analysis. We are ready to proceed to our next step of preparing the data for our Machine Learning.

End-to-End Machine Learning Project Tutorial — Part 1

Agenda

Pre-requisites and Resources

Data Collection and Problem Statement

Exploratory Data Analysis with Pandas and NumPy

2. Check for null values.

3. Check for outliers in horsepower column

4. Look for the category distribution in categorical columns

5. Plot for correlation

6. Set aside the test data set

Stratified Sampling — We create homogeneous subgroups called strata from the overall population and sample the right number of instances to each stratum to ensure that the test set is representative of the overall population.

7. Checking the Origin Column

8. Testing for new variables — Analyze the correlation of each variable with the target variable

Recommend

Hadoop 之父趣事：用儿子的大象玩偶为大数据项目命名

用 Python 可以实现侧脸转正脸？我也要试一下！

小书MybatisPlus第7篇-代码生成器的原理精讲及使用方法

如何为你的UI制定一套色彩系统？来看这个实战案例！

英特尔第二季度营收197亿美元净利同比增22%

陌声APP里的“爱情”：你想脱单，我想赚钱

钉钉发布新职业报告:未来5年新职业人才需求超3000万

特斯拉“头号玩家”地位稳固马斯克称“只赚一点点”

科技爱好者周刊（第 117 期）：我不想让你记住我的脸 - 阮一峰的网络日志

华硕 ROG 游戏手机 3 评测：骁龙 865 Plus 加持，「返璞归真」的游戏手机

About Joyk