Information Gain based Feature Selection in Python for Machine Learning Models

There are various techniques for feature selection for a classification or regression ML models. There is a category of techniques founded on information theory. The particular technique under discussion is based on information gain and belongs to the same category. Information gain (IG) is the reduction in entropy or surprise after the data is split at certain value of a feature variable. It’s the technique used for training Decision Tree models. We will be using the same technique, but for feature selection.

The implementation is available in my Python package matumizi. The package provides among other features, a data exploration class with more than 100 data exploration functions. Among them are some information theory based feature selection algorithms. The codes in available in the GitHub repository whakapai

Information Gain

As mentioned earlier, Information Gain is the reduction in entropy or randomness after the data is split at certain value of a feature variable. For numerical features the split is certain value which partitions the data, one data set having the feature values less than the split value and the other having feature values greater than or equal to the split value.

For categorical features the split is defined as a sub set of the unique values for the categorical features. The partitioning generates two data sets, one with categorical feature values belonging to the splitting sub set and the other not belonging to the sub set.

Splitting categorical features is more complex, as it requires generation of a power set based on the set of unique values. As the cardinality of the categorical variable increases, the power set size grows exponentially and could be very large and hence computationally intensive. The power set size is 2c where c is the cardinality.

input:
  data for set of feature variables
  data for target variable
  no of splits
  no of desired features (n)
output:'
  top n features
  
for each feature
  generate multiple splits
  for each split
    calculate information gain
    track max information gain
  set max information gain as the information gain for the feature
sort features by information gain in descending order
return top n features

As alluded to earlier, for categorical variables it’s necessary to generate a power set of the unique feature values. A power set is all possible sub sets of a set. Each sub set of the power set of a categorical feature defines a split.

Feature Selection by Information Gain

The function as shown below is implemented in the data exploration module daexp.py in matumizi. It can handle numerical and categorical features.

	def getInfoGainFeatures(self, fdst, tdst, nfeatures, nsplit, nbins=20):
		"""
		get top n features based on information gain or entropy loss
		
		Parameters
			fdst: list of pair of data set name or list or numpy array and data type
			tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
			nsplit : num of splits
			nfeatures : desired no of features
			nbins : no of bins for numerical data
		"""

As an example, we have used loan approval data for a classification task. This post contains more information about the data. The post referred discusses some other information theory based feature selection techniques. The driver code for running the example is available in GitHub. Here is some sample output. The number next to each feature name is the information gain value.

{'selFeatures': [('debt', 0.49150853081120555), ('income', 0.49131832602027337), ('crscore', 0.4785632084171897)]}

Please refer to the tutorial for instructions on data generation and execution of any feature selection technique. Currently there are information theory based 5 feature selection techniques implemented.

Wrapping Up

Information gain is a powerful technique for feature selection among the various information theory based feature selection techniques. In this post we have seen how to use it using the Python package matumizi.

Information Gain based Feature Selection in Python for Machine Learning Models