2

Features Selection by Using Xverse Package

 3 years ago
source link: https://hackernoon.com/features-selection-by-using-xverse-package-s03s34bz
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Features Selection by Using Xverse Package

5
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png

@davisdavidDavis David

Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing.

Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested. One major reason is that machine learning follows the rule of “garbage in garbage out” and that is why you need to be very concerned about the features that are being fed to the model.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Having irrelevant features in your data can both increase the computational cost of modeling and decrease the accuracy of the models and make your model learn based on irrelevant features. This means you need to select only important features to be presented during model training.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Top reasons to apply feature selection:

It enables the machine learning algorithm to train faster. It reduces the complexity of a model and makes it easier to interpret. It improves the accuracy of a model if the right subset is chosen. It reduces overfitting.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

“I prepared a model by selecting all the features and I got an accuracy of around 65% which is not pretty good for a predictive model and after doing some feature selection and feature engineering without doing any logical changes in my model code my accuracy jumped to 81% which is quite impressive”- By Raheel Shaikh

Feature selection methods are intended to reduce the number of features to those that are believed to be most useful/important to a model in order to predict the target feature. There are different feature selection methods available that can help you to select important features. The most popular methods are.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

You can read more about feature selection methods from Scikit learn library.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

One of the challenges of these methods is to identify which method(s) should you apply in your dataset to select important features. Each method has its own way to identify the important features. For example, a certain feature can be selected as an important feature in one method and not selected as an important feature in another method.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Xverse package can help you to solve this problem.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

“At the end of the day, some machine learning projects succeed and some fail.What makes the difference? Easily the most important factor is the features used.” — Pedro Domingos

What is Xverse?

Xverse stands for X Universe which is the python package for machine learning to assist Data Scientists with feature transformation and feature selection. Xverse is created by Sundar Krishnan.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

How does it work?

Xverse applies a variety of techniques to select features. When an algorithm picks a feature, it gives a vote for that feature. In the end, Xverse calculates the total votes for each feature and then picks the best ones based on votes. This way, we end up picking the best variables with minimum effort in the features selection process.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Xverse uses the following methods to select important features.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • Information Value using Weight of evidence.
  • Variable Importance using Random Forest.
  • Recursive Feature Elimination.
  • Variable Importance using Extra trees classifier.
  • Chi-Square best variables.
  • L1-based feature selection.

Installation

The package requires Numpy, Pandas, Scikit-learn, Scipy and Statsmodels. In addition, the package is tested on Python version 3.5 and above.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Run the following command to install Xverse.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
pip install xverse

I will use the Loan dataset to find the best features that can help to get good accuracy when predicting if a customer deserves to get a loan or not. You can download the dataset here.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Import important packages for this problem.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
import pandas as pd
import numpy as np                     
from xverse.ensemble import VotingSelector
from sklearn.preprocessing import StandardScaler, MinMaxScaler  
import warnings                        # To ignore any warnings
warnings.filterwarnings("ignore")

Load the Loan dataset.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
data = pd.read_csv("data/loans_data.csv")

data.columns
Loan_ID
Gender
Married
Dependents
Education
Self_Employed
ApplicantIncome
CoapplicantIncome
LoanAmount
Loan_Amount_Term
Credit_History
Property_Area
Loan_Status

We have 12 independent features and a target (Loan_Status). You can read the description of each feature here.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

I have created a simple python function to handle missing data and feature engineering.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
def preprocessing(data):

	# replace with numerical values
	data['Dependents'].replace('3+', 3,inplace=True)
	data['Loan_Status'].replace('N', 0,inplace=True)
	data['Loan_Status'].replace('Y', 1,inplace=True)

	# handle missing data 
	data['Gender'].fillna(data['Gender'].mode()[0], inplace=True)
	data['Married'].fillna(data['Married'].mode()[0], inplace=True)
	data['Dependents'].fillna(data['Dependents'].mode()[0], inplace=True)
	data['Self_Employed'].fillna(data['Self_Employed'].mode()[0], inplace=True)
	data['Credit_History'].fillna(data['Credit_History'].mode()[0], inplace=True)
	data['Loan_Amount_Term'].fillna(data['Loan_Amount_Term'].mode()[0], inplace=True)
	data['LoanAmount'].fillna(data['LoanAmount'].median(), inplace=True)

	# drop ID column
	data = data.drop('Loan_ID',axis=1)

	#scale the data
	data["ApplicantIncome"] = MinMaxScaler().fit_transform(data["ApplicantIncome"].values.reshape(-1,1))
	data["LoanAmount"] = MinMaxScaler().fit_transform(data["LoanAmount"].values.reshape(-1,1))
	data["CoapplicantIncome"] = MinMaxScaler().fit_transform(data["CoapplicantIncome"].values.reshape(-1,1))
	data["Loan_Amount_Term"] = MinMaxScaler().fit_transform(data["Loan_Amount_Term"].values.reshape(-1,1))


	return data 

Let’s preprocess the loan dataset.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
data = prepocessing(data)

Split into independent features and target.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
X = data.drop('Loan_Status',axis = 1)
y = data.Loan_Status

Now is time to call VotingSelector from Xverse and train it to find the best features by using the voting approach.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
#train to find the best features

clf = VotingSelector(minimum_votes=2)
clf.fit(X, y)

I have set minimum_votes= 2, this means the feature to be selected must have at least a total of 2 votes from the six feature selection methods presented in Xverse. Features below a total of 2 votes will be neglected.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

After training, we can see the feature importance in each feature selection method used during the training.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
#show important features 

clf.feature_importances_

The output shows all the features and their values of importance in each method.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Now let’s observe the votes from these feature selection methods.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
# votes 
clf.feature_votes_

The output shows the variable name, list of feature selection methods and their votes and at last, it shows total votes for each feature. It starts by show features with many votes to the features with low or zero votes.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

You can see that Credit_History has a total of 6 votes, which means credit_History is a very important feature for this Loan problem. But both Gender and Self_employed features have 0 votes, which means we can neglect these two features because of the very low contribution to the prediction if a customer deserves to get a loan or not.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Now we can transform our data to remain with only important selected features.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
# transform your data into important features 

X = clf.transform(X)

Conclusion

Xverse is under active development. Currently, xverse package handles only binary targets.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The code for this post is available on Github.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

If you learned something new or enjoyed reading this article, please share it so that others can see it. Feel free to leave a comment too. Till then, see you in the next post! I can also be reached on Twitter @Davis_McDavid

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Previously published here.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
0 reactions
heart.png
light.png
money.png
thumbs-down.png
5
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png
by Davis David @davisdavid. Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing.Contact me to collaborate
Join Hacker Noon

Create your free account to unlock your custom reading experience.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK