Data Discretization Using Sklearn In Machine Learning

Filed Under: Python Advanced

Hello folks, hope this story finds you in good health!. As we know, some of the clustering and classification algorithms(i.e. rule-based algorithms) prefer working on ordinal data rather than the data which is measured on a numerical scale.

Yes, most of the time we heard that most of the ML algorithms need numerical input and it is true too. It will depend on the use case you are working on. So, here comes the Data discretization. In layman’s terms, it is a process of grouping continuous data into discrete buckets, by grouping.

Data Discretization – In Detail

Data discretization is a process of converting continues numerical data into discrete bins.

This process helps to limit the data to some states rather than having it in continuous form. It works best when we have too much data in a large scale. Then it will be difficult to classify or cluster without discretization.

Discretization is mesentery as some of the rule-based algorithms tend to work on categorical data than data on a numerical scale. Ex: Clustering and Classification.
You may be reading this word for this first time, but don’t worry. It is also called as Data binning and I am sure you heard of it hundred times 😛
There are 3 types of Data discretization methods –

Quantile Transformation:

In this transformation, each bin has an equal number of values based on the percentiles.

2. Uniform Transformation:

In this transformation, each bin has equal or the same width with the possible values in the attribute.

3. Kmeans Transformation:

In this transformation, clusters are defined and values are assigned to them.

Well, now let’s import the sklearn library and our data to see how to perform these data binning methods. Let’s roll!!!

Data For Our Implementation

For the data transformation, we need data right!. So, we are going to work on loan data which is a pretty big dataset having huge volumes of data.

#data

import pandas as pd

df = pd.read_csv('loan_data.csv')

1. Quantile Transformation

The quantile transformation will bin the data records of each variable into k groups. Here, the number of records or values in each group will be the same or equal.

Let’s see how we can do this in python using scikit learn package. The class we will be using from sklearn is KBinsDiscritizer.

#quantile transformation

#Import the class

from sklearn.preprocessing import KBinsDiscretizer

#Discrete the data

transf = KBinsDiscretizer(n_bins = 10, encode = 'ordinal', strategy = 'quantile')

#fit transform

data = transf.fit_transform(df)

#Array to dataframe

from pandas import DataFrame

data1 = DataFrame(data)

#Peak into data

data1.head(5)

Here –

We have imported the KBinDiscritizer class from Sklearn.
Discretized the data with 10 bins and grouped by quantile method.
Then we fitted the data to the transformer.
After that, it will result in an array. We need to convert that array to a dataframe using Pandas DataFrame object as shown.

0 1 2 3 4

0 8.0 9.0 0.0 1.0 1.0

1 8.0 6.0 0.0 4.0 0.0

2 8.0 8.0 9.0 4.0 0.0

3 8.0 8.0 9.0 2.0 0.0

4 8.0 9.0 9.0 7.0 2.0

But, wait! It’s cool to visualize this to get a better idea right?

#visualize the data

import matplotlib.pyplot as plt

data1.hist()

array([[<AxesSubplot:title={'center':'0'}>,

<AxesSubplot:title={'center':'1'}>],

[<AxesSubplot:title={'center':'2'}>,

<AxesSubplot:title={'center':'3'}>],

[<AxesSubplot:title={'center':'4'}>, <AxesSubplot:>]], dtype=object)

Inference –

Here, you can observe that all the 10 bins or groups have equal number of values. That’s how quantile transformation works.

2. Uniform Transformation

In the Uniform transformation, each bin will be of equal width included with possible values in the variables. Let’s see how it works.

#uniform transformation

#Import the class

from sklearn.preprocessing import KBinsDiscretizer

#Discrete the data

transf = KBinsDiscretizer(n_bins = 10, encode = 'ordinal', strategy = 'uniform')

#fit transform

data = transf.fit_transform(df)

#Array to dataframe

from pandas import DataFrame

data1 = DataFrame(data)

#Peak into data

data1.head(5)

Here –

We have updated the strategy as “uniform”. This will result in a equal width with possible values in each group.

Let’s visualize the data to interpret it better.

#visualize the data

import matplotlib.pyplot as plt

data1.hist()

array([[<AxesSubplot:title={'center':'0'}>,

<AxesSubplot:title={'center':'1'}>],

[<AxesSubplot:title={'center':'2'}>,

<AxesSubplot:title={'center':'3'}>],

[<AxesSubplot:title={'center':'4'}>, <AxesSubplot:>]], dtype=object)

Inference –

Here, you can see that rather than having equal values in each bin, the uniform transform have equal bin width with possible values.

3. KMeans Transformation

The KMeans will work quite differently than previous transformations. Here, Kmeans will try to fit the values into specified clusters. Let’s see how it works.

#Kmeans transformation

#Import the class

from sklearn.preprocessing import KBinsDiscretizer

#Discrete the data

transf = KBinsDiscretizer(n_bins = 10, encode = 'ordinal', strategy = 'kmeans')

#fit transform

data = transf.fit_transform(df)

#Array to dataframe

from pandas import DataFrame

data1 = DataFrame(data)

#Peak into data

data1.head(5)

Here –

Here, we have again updated the strategy parameter with “kmeans”. With this, the data values will fall into any of the clusters.

Let’s visualize the data.

#visualize the data

import matplotlib.pyplot as plt

data1.hist()

array([[<AxesSubplot:title={'center':'0'}>,

<AxesSubplot:title={'center':'1'}>],

[<AxesSubplot:title={'center':'2'}>,

<AxesSubplot:title={'center':'3'}>],

[<AxesSubplot:title={'center':'4'}>, <AxesSubplot:>]], dtype=object)

Inference –

You can observe that we got 3 clusters and values were all the values were fitted into those clusters.

Wrapping Up – Data Discretization

Data discretization is an essential step in data preprocessing. Because some of the rule-based algorithms will prefer dealing with qualitative data or the bins. I hope now you are clear with these 3 methods for data binning. Make sure to feed the data in the best form to your model to get the best results.

That’s all from now. Happy Python!!!

More read: sklearn.preprocessing

Data Discretization Using Sklearn In Machine Learning

Data Discretization Using Sklearn In Machine Learning

Data Discretization – In Detail

Data For Our Implementation

1. Quantile Transformation

2. Uniform Transformation

3. KMeans Transformation

Wrapping Up – Data Discretization

Recommend

如何让你的大文件上传变得又稳又快？

HP Chromebook x2 review: A price cut away from great

LeetCode-113-路径总和 II

Top 20 Best Manga Apps for Android

Next Windows 11 update brings back Clippy, along with other redesigned emoji

Spring Boot: How do I execute my code on startup?

Who wants an electric superwagon? Porsche offers 590-hp GTS Sport Turismo

Top 25 Best Brain Games for Android

Elizabeth Holmes testifies in Theranos trial - The Washington Post

Golang 中 defer Close() 的潜在风险

About Joyk