Transformation & Scaling of Numeric Features: Intuition

Dataset is published for explaining the capability of each applicant’s repaying a loan? Below is the distribution of the Target feature and some of the independent features. Target feature has imbalanced data problem because the positive class only has 8% ratio of the full data.

Target Feature: Loan Default

Below are some of the important numeric independent features and their histogram. They are picked just for the exercise explanation. They all are looking in different ranges and scales e.g. AMT_ANNIUITY is in mn range but OWN_CAR_AGE can max go till 90.

AMT_ANNUITY; AMT_CREDIT; AMT_GOODS_PRICE; AMT_INCOME_TOTAL; DAYS_BIRTH; OWN_CAR_AGE;

Histogram of numeric features: Original Data

Below is the original data’s combined boxplot, it is looking highly skewed because of different scale and skewed features in one space.

Combined Box Plot: Original Data

Transformation

Normally distributed features are an assumption in Statistical algorithms. Deep learning & Regression-type algorithms also benefit from normally distributed data.

Transformation is required to treat the skewed features and make them normally distributed. Right skewed features can be transformed to normality with Square Root/ Cube Root/ Logarithm transformation.

As per the above histograms AMT_ANNUITY, AMT_CREDIT, AMT_GOODS_PRICE, AMT_INCOME_TOTAL, & OWN_CAR_AGE are skewed numeric features and DAYS_BIRTH are normally distributed.

Skewness can be because of one of the two reasons

Present of extreme abnormal outliers, which may not be important to us.
Or Feature’s natural distribution is skewed, and the tail is important to us. This is a case in most real-life cases

Introduction of log transformation:As the left graph exhibits, the output of the Log function for positive values increases very slowly. So higher values are marginalized more as compared to the lower observations.

Effects of Transformation:Skewed Numeric feature may get normally distributed after log transformation. For example, in the below graph AMT_CREDIT is normally distributed after the log transformation.

Before and After Log Transformation: AMT_CREDIT

Effect of log transformation on skewed target feature(case of regression): log transformation may treat the skewed feature to normality. And, if our target feature is normally distributed, the algorithm will give equal importance to all the samples. Its also called homoscedasticity. It’s equivalent to treating the imbalanced data problem in categorical target feature like we have in our given dataset. So it’s good to have a normally distributed target feature.
Effect of log transformation on the skewed independent feature: log transformation may bring the independent feature to normality like above where AMT_CREDIT is nearly normal distributed after log. But it may not improve the relationship between the target and the independent features. So, treating independent skewed features may or may not benefit modelling accuracy, it all depends on the original causal relationship between the two.

Scaling

Scaling is required to rescale the data and it’s used when we want features to be compared on the same scale for our algorithm. And, when all features are in the same scale, it also helps algorithms to understand the relative relationship better.

If dependent features are transformed to normality, Scaling should be applied after transformation.

Which Algorithms may benefit after Scaling? Scaling is helpful in Distance-based algorithms and also in faster convergence

Linear & Logistic Regression, KMeans/ KNN, Neural Networks, PCA will benefit from scaling

Which Algorithms may not benefit after Scaling?Some algorithms are independent of Scaling. Entropy & Information Gain based techniques are not sensitive to monotonic transformation.

Tree-Based Algorithms, Decision Tree, Random Forest, Boosted Trees(GBM, light GBM, xgboost) may not benefit from scaling.

D uring Scaling/ Standardizing/ Normalizing, we will follow sklearn vocabulary, so it’s a good choice to use general word S caling instead of Standardizing or Normalizing

Scaler model fitted on the train data will be used to transform the test set. Never fit scaler again on the test data

Sklearn has following four scalers primarily

1. Minmax scaler

2. Robust scaler

3. Standard Scaler

4. Normalizer.

Minmax scalershould be the first choice for scaling. For each feature, each value is subtracted by the minimum value of the respective feature and then divide by the range of original maximum and minimum of the same feature. It has a default range between [0,1].

Below is the histogram of all 6 feature after the Minmax scaling. We haven’t log-transformed any of the features before scaling. MinMaxScaler hasn’t changed the internal distribution of the feature and also brought everyone on the same scale.

Histogram after MinMax Scaling

Below is the combined box plot of all 6 features after scaling. And, all are in the range of [0,1 ]. The internal space between each feature’s values has been maintained and their relative distribution is also looking better as compared to the original data

Boxplot after minmax scaler

RobustScalercan be used when your data has high outliers and we want to subside their effects. But unimportant outliers should be removed in the first place. RobustScaler subtracts the column’s median and divides by the interquartile range.

Following graph is a histogram of features after the Robust Scaler. Though the histogram looks similar to the original data distribution, their respective internal distance space is not maintained like the original data.

Histogram after Robust Scaler

Also, as seen in the below box plot, now the range is not in [0,1]. And also the relative spaces between each feature’s values are distorted and not same now. Using robustscaler, in this case, will pass wrong information to the modelling process about your underlying data

Boxplot after RobustScaler

StandardScalerrescales each column to have 0 mean and 1 Standard Deviation. It standardizes a feature by subtracting the mean and dividing by the standard deviation. If the original distribution is not normally distributed, it may distort the relative space among the features.

Below is the histogram of features after applying standard scaler, Distribution is looking similar to the original data distribution, but they aren’t same, their respective internal distance of observations are changed during standard scaling.

Following is a combined boxplot of features after the standard scaling, As expected, it distorts the relative distances between the feature values, where after min-max they were looking better

Box Plot After Standard Scaler

Normalizeris applied on rows, not columns, so sklearn users shouldn’t get confused and must not use normalizer. Some of the used applications cases of Normalizer are comparing multiple entities during the same time-series, i.e. stock movements of multiple stocks in a given period

Conclusion

Skewed Target Feature should be treated for normality before modelling, especially when the outliers are also important
Treating a skewed dependent feature and its effect should be understood during the analysis
MinMax Scaler should be the first choice for scaling
Experiments and observations can help us further to decide on the right approach

As always, I welcome your thoughts & feedback. I am also reachable on Linkedin .

Transformation

Skewness can be because of one of the two reasons

Scaling

Conclusion

Recommend

Know What You Don’t Know: Getting Reliable Confidence Scores When Unsure of a Pr...

Top 20 YouTube Channels for Data Science in 2020

👴开发了 1️⃣🈹7️⃣🧗功能，🦡🦅🦁️用

NoSQL_系统设计笔记12

Open source library to create browser tests 10x faster

干货 | 以太坊 2.0 Phase 0 的奖惩制度

面试：删除链表的节点

Elasticsearch 之聚合分析入门

推荐｜6款免费又好用的远程管理工具

全球新冠抗疫，世界首善比尔·盖茨回答了31个问题

About Joyk