Complete Machine Learning solution(Part 2|3): Create and Manage ML Model

“Errors using inadequate data are much less than those using no data at all.”

— Charles Babbage , inventor and mathematician

This is the second blog post of this blog series. I am assuming that you have already gone through the first one ( Complete Machine Learning solution(Part 1|3): Create Flask Application ) , if not then I would suggest you to go through that first. Here I am going to show you, how can we develop our Machine Learning solution to make it usable by any web application. I am dividing this blog post into four sub-parts:

Loading Data
Data Pre-processing and Feature Extraction
Model Building and Cross-validation
Making Predictions

Loading Data

If you remember our previous blog post, in Project Architecture section, we planned to have a /Data directory on our project’s root, so let's create this directory first and since this is a place where our data-sets are going to live, lets copy test and train sets into this. You can download it from my GitHub repository .

Next, create a file .\Src\utils\ClassificationModelBuilder.py and add the below content for now:

# Library Import
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import RidgeClassifier, RidgeClassifierCV, LogisticRegression, LogisticRegressionCV
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import f1_score, accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
import dill as pickle
from .DataPreparation import *

# Ignoring all warnings
import warnings
warnings.filterwarnings(“ignore”)
import os

# read dataset
train_df = pd.read_csv(‘./Data/train.csv’)

# global random state
rand_state_ = 42

In above code, we have imported all the required packages, ignored warnings, created DataFrame train_df containing train data-set and defined the global random state which is going to be constant throughout our model building.

Data Pre-processing and Feature Extraction

Our data needs to be prepared first and only then we can use it to train our classification models. I did some visualization and feature analysis (checkout Src/notebook for details) and reached to below conclusions:

Features PassengerId and Ticket seems to be of no use because they don’t convey any useful information.
Age and Fare are of type float64 , it will be better if we take their ceiling value and convert them to an int64 .
Cabin contains more than 77% of missing values. We are not supposed to fill these many missing values on our own and even if we do then the feature will lose its significance anyway, so better we discard this.
From Name , we can create a new feature called Title. It may help model in performing better. After that we can get rid of Name too.
We have SibSp and Parch features which shows a person’s Siblings , Spouse , Parents , and children . We can use these features to come up with two new features called FamilySize and IsAlone .
Newly created feature FamilySize can be used to create a new feature called FarePerHead .

Let’s create a class DataPreparation to perform all these operations:

/Src/utils/DataPreparation.py

# Library Import
import numpy as np # linear algebra
import pandas as pd # data processing
from sklearn.preprocessing import LabelEncoder
import re

class DataPreparation():

def __init__(self):
        pass

def preprocess(self, df):
        df = self.fill_missing_values(df)
        df = self.feature_extraction(df)
        df = self.handle_categorical_variables(df)
        df = self.dimensionality_reduction(df)
        return df

def fill_missing_values(self, df):
        df.Age = np.ceil(df.Age.fillna(df.Age.median())).astype(int)
        df.Embarked = df.Embarked.fillna(df.Embarked.mode()[0])
        df.Fare = np.ceil(df.Fare.fillna(df.Fare.mean())).astype(int)
        return df

def feature_extraction(self, df):
        df[‘FamilySize’] = df.SibSp + df.Parch + 1
        df[‘FarePerHead’] = (df.Fare/df.FamilySize).astype(int)
        df[‘IsAlone’] = df.FamilySize.apply(lambda x: 1 if x==1 else 0)
        df[‘AgeGroup’] = df.Age.apply(lambda x: ‘kid’ if x<13 else ‘teen’ if x<20 else ‘adult’ if x<41 else ‘old’)
        df[‘Title’] = df.Name.apply(lambda x: re.search(‘(?<=, )\w+’, x).group(0))
        df.Title.replace(to_replace=[‘Ms’, ‘Lady’, ‘the’, ‘Dona’], value=’Mrs’, inplace=True)
        df.Title.replace(to_replace=[‘Mme’, ‘Mlle’], value=’Miss’, inplace=True)
        df.Title.replace(to_replace=[‘Jonkheer’, ‘Sir’, ‘Capt’, ‘Don’, ‘Col’, ‘Major’, ‘Rev’, ‘Dr’], value=’Mr’, inplace=True)

return df

def handle_categorical_variables(self, df):
        df = pd.get_dummies(df, drop_first=True, columns=[‘Sex’, ‘Embarked’])
        df.AgeGroup = LabelEncoder().fit_transform(df.AgeGroup)
        df.Title = LabelEncoder().fit_transform(df.Title)
        return df

    def dimensionality_reduction(self, df):

return df.drop(labels=[‘PassengerId’,’Name’,’Ticket’, ‘Cabin’], axis=1)

Above class takes care of all the required operations and returns the pre-processed data, which can be used to train Classification Models. Note that, we kept this class completely separate and generic so that we can use it for both test and train data set pre-processing.

Model Building and Cross-validation

Like we created a class for data pre-processing, we can create a class for Model training, cross-validation and evaluation as well. However, we don’t need to create it in a separate file, we can just append it in ClassificationModelBuilder.py file itself because it will be required at the time of model building only.

/Src/utils/ClassificationModelBuilder.py

class Modeling(object):

def __init__(self, test_train_ratio):
        self.classifiers = {}
        self.X_train, self.X_test, self.y_train, self.y_test =      train_test_split(train_df.drop([‘Survived’], axis=1), train_df.Survived, test_size=test_train_ratio, random_state=rand_state_)

    def evaluate_model(self, model_name, train_predictions, test_predictions):

self.classifiers[model_name] = {
            ‘TrainingAccuracy’: accuracy_score(
                                self.y_train, train_predictions), 
            ‘TestAccuracy’: accuracy_score(
                            self.y_test, test_predictions)
        }

    def fit_and_predict_using_RandomSearchCV(self, classifier):
        random_cv_model = RandomizedSearchCV(
            estimator=classifier[‘instance’],
            param_distributions=classifier[‘param_grid’],
            cv=10)

random_cv_model.fit(self.X_train, self.y_train)
        self.evaluate_model(classifier[‘name’],
                            random_cv_model.predict(self.X_train),
                            random_cv_model.predict(self.X_test))
        
        self.classifiers[classifier[‘name’]][‘Estimator’] = random_cv_model.estimator

return self.classifiers[classifier[‘name’]]

    def voting_classifier(self, classifier_names):

selected_classifiers = [(classifier_name,  self.classifiers[classifier_name][‘Estimator’]) for classifier_name in classifier_names]

voting_classifier = VotingClassifier(
                                estimators=selected_classifiers,
                                voting=’hard’)

voting_classifier.fit(self.X_train, self.y_train)

self.evaluate_model(voting_classifier.__class__.__name__,
                         voting_classifier.predict(self.X_train),
                         voting_classifier.predict(self.X_test))

self.evaluate_model(voting_classifier.__class__.__name__, voting_classifier.predict(self.X_train),voting_classifier.predict(self.X_test))

self.classifiers[voting_classifier.__class__.__name__][‘Estimator’] = voting_classifier

return self.classifiers[voting_classifier.__class__.__name__]

In this class,

We are splitting the train_df into train/test feature set and target variable in a desired ratio.
fit_and_predict_using_RandomSearchCV(self, classifier) method performs the cross-validation on specified classifier and finds out the best hyper-parameters using RandomizedSearchCV . We are preferring RandomizedSearchCV over GridSearchCV because it is much faster and gives almost similar and reliable hyper-parameters. Once we get the model instance with the best set of hyper-parameters, we train this model, make predictions and find out the Accuracy on Train and Test data set using evaluate_model(self, model_name, train_predictions, test_predictions) method.
voting_classifier(self, classifier_names) method uses a Simple Voting Ensemble technique. It takes the set of trained models to come up with a more effective model which performs ‘hard’ voting on each prediction to make the final prediction.

Let’s go ahead and start using these classes to build our final classifier.

/Src/utils/ClassificationModelBuilder.py

… …
… …

"""
    Processing Datasets
"""
data_preparation = DataPreparation()
train_df = data_preparation.preprocess(train_df)

"""
    Building and comparing Models
"""
model_ops = Modeling(3/10)

classifiers = [
    {
        'name': 'DecisionTreeClassifier',
        'instance': DecisionTreeClassifier(),
        'param_grid': {
            'splitter': ['best', 'random'],
            'criterion': ['gini', 'entropy'],
            'max_depth': [3, 4],
            'min_samples_split': [2, 3, 4],
            'max_features': ['sqrt'],
            'random_state': [rand_state_]
        }
    }, {
        'name': 'RandomForestClassifier',
        'instance': RandomForestClassifier(),
        'param_grid': {
            'n_estimators': [10, 30, 60, 90, 100],
            'criterion': ['gini', 'entropy'],
            'max_depth': [3, 4],
            'min_samples_split': [2, 3, 4],
            'max_features': ['sqrt'],
            'random_state': [rand_state_]
        }
    }, {
        'name': 'XGBClassifier',
        'instance': XGBClassifier(),
        'param_grid': {
            'max_depth': [3, 4, 5],
            'learning_rate': [.1, .06, .03, .01],
            'n_estimators': [80, 100, 120],
            'booster': ['gbtree', 'gblinear', 'dart'],
            'gamma': [0, 2, 4],
            'random_state': [rand_state_]
        }
    }, {
        'name': 'KNeighborsClassifier',
        'instance': KNeighborsClassifier(),
        'param_grid': {
            'n_neighbors': [5, 6, 7, 8, 9],
            'weights': ['uniform', 'distance'],
            'algorithm': ['ball_tree', 'kd_tree', 'brute'],
            'p': [1, 2]
        }
    }, {
        'name': 'ExtraTreesClassifier',
        'instance': ExtraTreesClassifier(),
        'param_grid': {
            'n_estimators': [20, 40, 80],
            'min_samples_split': [2, 3, 4],
            'criterion': ['gini', 'entropy'],
            'max_features': ['sqrt']
        }
    }, {
        'name': 'RidgeClassifierCV',
        'instance': RidgeClassifierCV(),
        'param_grid': {
            'alphas': [(0.05, 0.1, 0.5, 1, 2)]
        }
    }, {
        'name': 'AdaBoostClassifier',
        'instance': AdaBoostClassifier(),
        'param_grid': {
            'base_estimator': [ 
                # Decided hyper parameter values after RandomSearch Cross Validation
                DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=1, min_samples_split=3, random_state=rand_state_, splitter='best'),
                XGBClassifier(booster='dart', gamma=2, learning_rate=0.1, max_depth=3, n_estimators=100, random_state=rand_state_)
            ],
            'n_estimators': [50, 70, 90],
            'random_state': [rand_state_],
            'algorithm': ['SAMME', 'SAMME.R'],
            'learning_rate': [0.8, 1.0, 1.3]
        }
    }, {
        'name': 'BaggingClassifier',
        'instance': BaggingClassifier(),
        'param_grid': {
            'base_estimator': [
                # Decided hyper parameter values after RandomSearch Cross Validation
                DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=1, min_samples_split=3, random_state=rand_state_, splitter='best'),
                XGBClassifier(booster='dart', gamma=2, learning_rate=0.1, max_depth=3, n_estimators=100, random_state=rand_state_)
            ],
            'n_estimators': [10, 20, 30],
            'random_state': [rand_state_],
            'bootstrap': [True, False],
            'bootstrap_features': [True, False]
        }
    }, {
        'name': 'GradientBoostingClassifier',
        'instance': GradientBoostingClassifier(),
        'param_grid': {
            'loss': ['deviance', 'exponential'],
            'n_estimators': [100, 120, 150],
            'random_state': [rand_state_],
            'min_samples_split': [2, 3, 4],
            'max_depth': [3, 4, 5],
            'min_samples_split': [2, 3, 4]
        }
    }
]

for classifier in classifiers:
    classifier_performance = model_ops.fit_and_predict_using_RandomSearchCV(classifier)
    print(f"{classifier['name']} Performance - \n{classifier_performance}")

# We are going to use the SimpleVoting Ensemble technique to make our final predictions.
voting_classifier = model_ops.voting_classifier(['AdaBoostClassifier', 'BaggingClassifier'])
print(f'VotingClassifier Performance - \n{voting_classifier}')

# Comparing performance of Classifiers 
score_df = pd.DataFrame([{'ModelName': name, 'Test Accuracy': props['TestAccuracy'], 'Training Accuracy': props['TrainingAccuracy']} for name, props in model_ops.classifiers.items()])
score_df.set_index('ModelName')
print(f'All Classifiers Performance Table - \n{score_df}')

# We are saving VotingClassifier trained instance, if you want you can save any other model/s as well.
filename = 'voting_classifier_v1.pk'
with open('./Src/ml-model/'+filename, 'wb') as file:
    pickle.dump(voting_classifier['Estimator'], file)

Below is the performance obtained after training and evaluating various classifiers.

I am choosing AdaBoostClassifier and BaggingClassifier to build the Ensemble voting_classifier and will use this to make final predictions. At last, we save voting_classifier inside /Src/ml-model/voting_classifier_v1.pk using Python’s dill package. It is the extension of Pickle package and performs the serialization and de-serialization of Python objects, in our case trained model. In simple words, dill will keep the instance of our trained model in voting_classifier_v1 file. Write in comments if you face any difficulty in understanding this class, I will edit and elaborate more.

Now If you remember from previous blog post, in application.py file we commented out the import statement on second line, let’s uncomment that now. It will basically run the entire Model building code on server start and will save the final classifier. From second time, since we will already be having this model file from first execution, you can comment out this import statement again.

application.py

… .. .
from Src.utils import ClassificationModelBuilder
… .. .
… .. .

Let’s also create an empty file /Src/utils/__init__.py to make all classes in utils directory importable.

Making Predictions

Now is the time to change the content of predict.py file

/Src/api/predict.py

from flask import Blueprint, jsonify, request
import pandas as pd
import dill as pickle
import json
from Src.utils.DataPreparation import *

predict_api = Blueprint(“predict_api”, __name__)

@predict_api.route(‘/predict’, methods=[‘POST’])
def apicall():
    try:
        test_json_dump = json.dumps(request.get_json())
        test_df = pd.read_json(test_json_dump, orient=’records’)
        # Because of request processing Age is being considered as object, but it needs to be float type.
        test_df[‘Age’] =    test_df.Age.convert_objects(convert_numeric=True)
        #Getting the PassengerId separated out
        passenger_ids = test_df[‘PassengerId’]

except Exception as e:
        print(‘:::: Exception occurred while reading json content ::::’)
        raise e
 
    if test_df.empty:
        return(bad_request())
    else:
        #Load the saved model
        loaded_model = None
        with open(‘./Src/ml-model/voting_classifier_v1.pk’,’rb’) as model:
            loaded_model = pickle.load(model)

# Before we make any prediction, let's pre-process first.
            data_preparation = DataPreparation()
            test_df = data_preparation.preprocess(test_df)
            print(f’After pre-process test df — \n {test_df}’)
            predictions = loaded_model.predict(test_df)

prediction_series = list(pd.Series(predictions))

final_predictions = pd.DataFrame({‘PassengerId’: passenger_ids, ‘Survived’: prediction_series})

responses = jsonify(predictions=final_predictions.to_json(orient=”records”))
 responses.status_code = 200

return (responses)

Here we are simply parsing the request body JSON into DataFrame and after pre-processing, using it to make predictions. For making a prediction we are using the same model that we saved inside Src/ml-model .

If you have done everything correctly, for below POST API and request parameters:

POST API — http://localhost:5000/titanic-survival-classification-model/predict

[{ “PassengerId”:892, “Pclass”:3, “Name”:”Kelly, Mr. James”, “Sex”:”male”, “Age”:34.5, “SibSp”:0, “Parch”:0, “Ticket”:330911, “Fare”:7.8292, “Cabin”:””, “Embarked”:”Q” },{ “PassengerId”:893, “Pclass”:3, “Name”:”Wilkes, Mrs. James (Ellen Needs)”, “Sex”:”female”, “Age”:47, “SibSp”:1, “Parch”:0, “Ticket”:363272, “Fare”:7, “Cabin”:””, “Embarked”:”S” },{ “PassengerId”:894, “Pclass”:2, “Name”:”Myles, Mr. Thomas Francis”, “Sex”:”male”, “Age”:62, “SibSp”:0, “Parch”:0, “Ticket”:240276, “Fare”:9.6875, “Cabin”:””, “Embarked”:”Q” },{ “PassengerId”:895, “Pclass”:3, “Name”:”Wirz, Mr. Albert”, “Sex”:”male”, “Age”:27, “SibSp”:0, “Parch”:0, “Ticket”:315154, “Fare”:8.6625, “Cabin”:””, “Embarked”:”S” },{ “PassengerId”:896, “Pclass”:3, “Name”:”Hirvonen, Mrs. Alexander (Helga ELindqvist)”, “Sex”:”female”, “Age”:22, “SibSp”:1, “Parch”:1, “Ticket”:3101298, “Fare”:12.2875, “Cabin”:””, “Embarked”:”S” },{ “PassengerId”:897, “Pclass”:3, “Name”:”Svensson, Mr. Johan Cervin”, “Sex”:”male”, “Age”:14, “SibSp”:0, “Parch”:0, “Ticket”:7538, “Fare”:9.225, “Cabin”:””, “Embarked”:”S” },{ “PassengerId”:898, “Pclass”:3, “Name”:”Connolly, Miss. Kate”, “Sex”:”female”, “Age”:30, “SibSp”:0, “Parch”:0, “Ticket”:330972, “Fare”:7.6292, “Cabin”:””, “Embarked”:”Q” },{ “PassengerId”:899, “Pclass”:2, “Name”:”Caldwell, Mr. Albert Francis”, “Sex”:”male”, “Age”:26, “SibSp”:1, “Parch”:1, “Ticket”:248738, “Fare”:29, “Cabin”:””, “Embarked”:”S” },{ “PassengerId”:900, “Pclass”:3, “Name”:”Abrahim, Mrs. Joseph (Sophie Halaut Easu)”, “Sex”:”female”, “Age”:18, “SibSp”:0, “Parch”:0, “Ticket”:2657, “Fare”:7.2292, “Cabin”:””, “Embarked”:”C” },{ “PassengerId”:901, “Pclass”:3, “Name”:”Davies, Mr. John Samuel”, “Sex”:”male”, “Age”:21, “SibSp”:2, “Parch”:0, “Ticket”:”A/4 48871", “Fare”:24.15, “Cabin”:””, “Embarked”:”S” },{ “PassengerId”:902, “Pclass”:3, “Name”:”Ilieff, Mr. Ylio”, “Sex”:”male”, “Age”:””, “SibSp”:0, “Parch”:0, “Ticket”:349220, “Fare”:7.8958, “Cabin”:””, “Embarked”:”S” },{ “PassengerId”:903, “Pclass”:1, “Name”:”Jones, Mr. Charles Cresson”, “Sex”:”male”, “Age”:46, “SibSp”:0, “Parch”:0, “Ticket”:694, “Fare”:26, “Cabin”:””, “Embarked”:”S”}]

You should receive below response:

{
 “predictions”: “[{\”PassengerId\”:892,\”Survived\”:1},{\”PassengerId\”:893,\”Survived\”:1},{\”PassengerId\”:894,\”Survived\”:0},{\”PassengerId\”:895,\”Survived\”:1},{\”PassengerId\”:896,\”Survived\”:1},{\”PassengerId\”:897,\”Survived\”:1},{\”PassengerId\”:898,\”Survived\”:0},{\”PassengerId\”:899,\”Survived\”:0},{\”PassengerId\”:900,\”Survived\”:1},{\”PassengerId\”:901,\”Survived\”:0},{\”PassengerId\”:902,\”Survived\”:1},{\”PassengerId\”:903,\”Survived\”:0}]”
}

Note that, I took this request body test records from our test data-set. Request should have at-least 10 –12 records otherwise due to insufficient categorical feature’s class values, DataPreparation class won’t be able to form the equal number of features like it formed during training, so will throw an error.

Congrats! you have created a complete application. Now use the command pip freeze > requirements.txt to create a requirements.txt file to maintain the list of package dependencies, we will be needing this file during deployment.

You can also checkout the complete project on below GitHub repository:

amdp-chauhan/titanic-survival-complete-ml-solution

API-First approach to make Machine Learning solution usable — amdp-chauhan/titanic-survival-complete-ml-solution github.com

In next blog post, we will proceed on the deployment of this application. Leave a comment if you face any issue or have any suggestions.

Thanks. Happy Learning ;)

Complete Machine Learning solution(Part 2|3): Create and Manage ML Model