34

The almost-5-minute Data Science Pipeline.

 5 years ago
source link: https://www.tuicool.com/articles/hit/3qA73aj
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Introduction

So you guessed right. This pipeline will not take 5 minutes to develop but it’s not going to be far from it either. This post addresses the following problem:

“How can I minimize the development time of a Data Science Pipeline and check all required task boxes at the same time?”

Now is a great time to work on Data Science problems since there are a lot of tools that can help you solve them using different approaches and focusing on different goals. There are tools that focus on achieving greater model performance and others that can assist your efforts towards minimizing development time. On this post, we focus on the latter.

The end target of a firm is to offer products that maximize the utility of its consumers. By maximizing that utility, the firm can attract users and have a profitable business model. Out of all the ideas that will be examined, only a small sample of them will translate to a successful product; thus, by exploring more ideas over the same time period and using your time wisely, you can have more successful products.

iy6BJni.jpg!web2Ab2aan.jpg!web

By trying to minimize the development time we should be careful with the selection of tasks from which we decide to reduce time. Below are some tasks that our pipeline should include:

  1. Data Profiling
  2. Data Cleaning
  3. Feature Engineering
  4. Modeling
  5. Application

Problem Definition

New knowledge is easier to comprehend using examples and real life problems so we can go through the aforementioned steps using a real (but sampled) dataset of user searches from the Property Finder portal.

Let’s define the problem:

Given a dataset of user searches from the Property Finder portal, build and deploy a model that predicts the expected price of a user for a property offered for sale, given a set of property characteristics, such as the time of search, property type, location, etc., as provided by the user.

They say that a good problem definition is half the solution so only half of the problem left. Let’s start!

Data Profiling

The first part of the pipeline is all about understanding the data. Distribution, data format and missing values are some examples of data profiling tasks. In order to minimize the time of this task we can utilize two Python libraries, pandas and pandas_profiling. For instance, we can use pandas to get the following results:

bINjYnb.png!webJBnEruv.png!web
bUZJj2n.png!web
YRruAzR.png!webqAJbInF.png!web
Q3AzYvv.png!webvmMvEbz.png!web
miaA3um.png!webIrq26jz.png!web

By running 5 one-liners we get a clear picture of the dataset. For example, we now know that:

  • There is a lot of missing information.
  • There are multiple searches per user.
  • Even though it looks like location has a few missing values the df.describe() method shows that the most frequent value is [].

We can get an even better grasp of our dataset by using the pandas_profiling library. This tool takes as input a pandas DataFrame object and produces a report that can be viewed on the Jupyter Notebook or exported (e.g. as HTML file); and it takes just two lines of code!

import pandas_profiling
profile = pandas_profiling.ProfileReport(df)
profile.to_file(outputfile = "report.html")

The report is rich and provides various insights. Take these parts of the report as example:

6jqeqym.png!webUNZj2uz.png!web
mmuIVjr.png!web2YZny2b.png!web

Data Cleaning

In this part of the pipeline we address the data integrity issues. Some examples of data cleaning tasks are:

  • Dealing with missing values.
  • Changing the format of the features, so that they can be used by various Machine Learning algorithms effectively.
  • Changing the format of the features, so that their values comply with their actual nature (e.g. fixing negative prices).

Regarding missing information (and since we are aiming for a fast solution) one way could be to drop every row that contains at least one missing element. This is not a good practice and below you can see why:

eaUn2mE.png!web

So, by removing these rows we are left with only 698 records! We can still address the features’ missing values later in the process. What we can do now is get rid of the records that do not have at least one of the min/max price variables, since we will use them later to generate the target variable:

QRVjeq7.png!web

We can also drop the duplicates and then check with how many records we are left:

QnE3Izy.png!web

This is also the right time to fix the issues in the location feature. As we saw on the Data Profiling section the location feature contains lists of values, but we are actually interested on the values themselves. Some reasonable questions that emerge are:

  • What is the data type of the location feature values?
  • Is there more than one value contained in each list?

We can answer these questions using pandas:

IZBNJvv.png!web3IvUNzZ.png!web

So, by splitting each value based on the number of “,” characters we see that the majority of lists contain a single value but there are records that have lists of more than one value. One way to deal with this is to create n records for the n locations that are contained inside each list, keeping all the other feature values the same and adding the different locations on each record. Of course, you can use pandas for this:

# get multiple-locations records
df_multi_loc = df[results.list_elements_count > 1].copy()
df_multi_loc_new = pd.DataFrame(columns = df_multi_loc.columns)
for j in range(df_multi_loc.shape[0]):
    temp_df = pd.DataFrame(columns=df_multi_loc.columns[:-1])
    locs = pd.DataFrame(df_multi_loc.iloc[j,-1].replace("[","").replace("]","").split(","),columns=["location"])
    temp_df = temp_df.append(df_multi_loc.iloc[j,:-1]).reset_index(drop=True)
    for i in range(locs.shape[0]):
        temp_df["location"] = locs.iloc[i].values.astype('float64')
        df_multi_loc_new = df_multi_loc_new.append(temp_df)
# now we can process and append the df_single_loc and get the total dataset
df_single_loc = df[results.list_elements_count == 1].copy()
new_locs = df_single_loc["location"].apply(lambda x: x.replace("[","").replace("]",""))
new_locs = new_locs.replace('', '-99')
new_locs = new_locs.astype('float64').replace(-99, np.nan)
df_single_loc["location"] = new_locs
new_df = df_single_loc.append(df_multi_loc_new).reset_index(drop=True)

We can check again the first records of the pandas DataFrame:

YRZBfmQ.png!webQRZvEzq.png!web

Finally, we can also change the data types of our features. For example, we can convert property_type_id and location features to categorical variables using the following code:

new_df["property_type_id"] = new_df["property_type_id"].astype('category')
new_df["location"] = new_df["location"].astype('category')

Feature Engineering

If Data Cleaning is the most time consuming part of the pipeline, then Feature Engineering can easily claim the second place. The accuracy of the final model usually depends more on the quality of the input data and less on the selection of the algorithm, making this part crucial. This will not be the case with this pipeline, though, due to a library called Featuretools. This framework is the implementation of Deep Feature Synthesis and is mostly used for relational tables. However, on this specific problem we have only one table so it doesn’t make sense to use Featuretools. On the other hand, this library provides a variety of feature transformations that could be of real help on this problem and we could save time by not having to construct them manually. Before starting to generate new features, we can construct the target variable. We are trying to predict the expected price of a user’s search query, so, it makes sense to start by replacing the min and max search prices with their mean value. Later we will group by each user and get our target variable:

new_df["target"] = new_df[['min_price','max_price']].mean(axis=1)
new_df.drop(["min_price","max_price"], axis=1, inplace=True)

Using Featuretools is really easy. You just have to define:

  • An entity set.
  • The entities inside that set.
  • The relationships between the entities of the entity set.

This translates to the following Python code:

import featuretools as ft
es = ft.EntitySet(id = 'searches')
es.entity_from_dataframe(entity_id = 'features', dataframe = new_df.drop(['target'],axis=1).reset_index())
target = new_df[['domain_userid','target']].groupby('domain_userid').agg('mean').reset_index()
es.entity_from_dataframe(entity_id = 'target', dataframe = target)
relationship_target_features = ft.Relationship(es['target']['domain_userid'], es['features']['domain_userid'])
es.add_relationship(relationship_target_features)
feature_matrix, feature_names = ft.dfs(entityset=es, 
target_entity = 'target', 
max_depth = 1, 
verbose = 1, 
n_jobs = 3
)

These commands represent the following processes:

  • Creating the entity set.
  • Adding an entity that contains only the features.
  • Creating a new DataFrame that contains only the domain_userid and the target we constructed above. Then, we group by the domain_userid and aggregate by taking the average of each user’s prices.
  • Adding another entity that contains only the target.
  • Define the relationship of these two tables. One record of the target table corresponds to multiple records of the features table (one-to-many); thus, target table is the parent, making the features the child table.
  • After adding the relationship to the entity set, we have created two relational tables out of a single table. Neat, right?
  • Finally, we use the dfs() method (Deep Feature Synthesis) to get a new DataFrame that contains the new features and the target variable. Starting with 7 features we now count 30. Some of the features generated are SUM(features.min_bedrooms), COUNT(features), NUM_UNIQUE(features.property_type_id).

We can now move to the fanciest -but actually easiest- part of all; Modeling.

Modeling

The modeling process usually includes:

  1. Missing Values Imputation
  2. Feature Selection
  3. Model Selection
  4. Hyper-Parameter Tuning

There is one issue that we didn’t resolve in the previous sections; missing values. We didn’t drop missing values so, we have to come up with a way to deal with this issue. We also created a lot of new features and some of them will be either correlated or useless. Below we will solve all the above issues using the scikit-learn Pipeline and GridSearchCV modules. The former provides the functionality of building pipelines that take care of preprocessing and model construction. The latter can help us select the optimal features, models and parameters according to the selected criteria. As we already mentioned, it is the easiest part, especially if you are familiar with how scikit-learn operates.

from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
pipeline = Pipeline(
                    [
                        ('impute', SimpleImputer()),
                        ('feat_select', SelectFromModel(RandomForestRegressor(n_estimators=100))),
                        ('rf', RandomForestRegressor(n_estimators=100))
                    ]
)
grid_params = {}
grid_params['impute__strategy'] = ['median', 'most_frequent']
grid_params['feat_select__threshold'] = [0.1,0.25]
grid_search = GridSearchCV(pipeline, grid_params, cv=5, n_jobs= -1)
grid_search.fit(X, y)

The code above implements the following steps:

  • Create a Pipeline that performs imputation, then selects the most important features according to some estimator (here RandomForestRegressor) and then creates a model using the input features and target variable.
  • Construct the parameters set over which we will search for optimal solution.
  • Fit multiple models with different parameters and use cross validation to get an accurate estimate of the model accuracy for each parameter combination.

We can get the accuracy of the best model and the optimal parameter set with the following commands:

print("Best cross validation score: {}".format(grid_search.best_score_))
   
print("Best parameter set: {}".format(grid_search.best_params_))

Finally, we will store the model, so that it can be used by the Application.

from sklearn.externals import joblib
joblib.dump(model, 'grid_search_model.joblib')

Application

In this final section, we will use Flask to create an application that will provide predictions using our trained model. The previous process comprises the model building process, and Jupyter Notebooks can be a suitable tool for this process but not for the one that follows. In order to serve our model we will create two scripts.

  • The first one (api.py) will run the Flask app.
  • The second one (post.py) will be used to provide new data to our API and receive predictions. Below we present the two scripts:

api.py

from flask import Flask, request, jsonify
from sklearn.externals import joblib
import pandas as pd
app = Flask(__name__)
grid_search_best = joblib.load('grid_search_model.joblib')
@app.route('/predict',methods=['POST'])
def predict():
data = pd.DataFrame(request.json)
index = data.index
prediction = grid_search_best.predict(data)
return pd.Series(prediction,name='target',index=index).to_json()
if __name__ == '__main__':
    app.run(port=8080, debug=True)

post.py

import requests
import json
with open("X_sample.json", "r") as read_file:
    X_sample_loaded = json.load(read_file)
url = 'http://localhost:8080/predict'
r = requests.post(url,json=X_sample_loaded)
print(r.json())

On the api.py script we create the app object and the predict route. This creates an endpoint (‘/predict’) where a user can post data and get back the predictions as generated by the model that is loaded back in memory. On the post.py script we read a sample of the features (in JSON format) we used as input to the pipeline earlier and using requests we send them to the Flask app. All we have to do now is:

  • Run the api.py file.
  • Run the post.py file.

Discussion

The goal of this article was to explore a fast but solid approach to the Data Science Pipeline. Some comments on the above process.

  1. This is just a prototype: In a real-life situation you might need a prototype to prove the necessity of your model. So, it is not efficient to spend a lot of time working on something that might not reach production and this is why the aforementioned tools might be of great value.
  2. Replicating the pipeline for the new data: Here, we used a sample of the train dataset in order to test the API. In general, the data will arrive unprocessed and then we have to follow the exact same steps in order to get them to the same shape that the train data were, before entering the scikit-learn pipeline. If, for example, you would like to add another processing step, you should either do it before training the scikit-learn pipeline and store the trained preprocessor or introduce the preprocessor inside the scikit-learn Pipeline.

This was just a short demo of what we can achieve using these Python frameworks. You can now try and build more complex Pipeline objects by adding preprocessors and estimators, create more transformers and add them to DFS, etc. Check the resources section for more details and examples on these tools.

Well that’s it! Happy pipelining!

Resources

  1. Pandas Profiling
  2. Pandas
  3. Featuretools
  4. scikit-learn Pipeline
  5. GridSearchCV
  6. Pandas Medium post
  7. Featuretools Medium post
  8. scikit-learn Medium post
  9. Flask/sklearn Medium post

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK