Boosting: Is It Always The Best Option?

RVbIZvJ.png!web

Gradient boosting has become quite a popular technique in the area of machine learning. Given its reputation for achieving potentially higher accuracy than other models, it has become particularly popular as a “go-to” model for Kaggle competitions.

However, use of gradient boosting raises two questions:

Does this technique really outperform others consistently irrespective of the data being examined?
Even if this is the case, are gradient boosting techniques always a wise choice?

To answer these questions, I decided to compare the use of gradient boosting techniques to that oflogistic regression by attempting to classify diabetes based on outcome. The dataset is available at the UCI Machine Learning Repository .

Essentially, the dataset provides us with several features that are used to predict the outcome variable (diabetes = 1, no diabetes = 0).

Firstly, feature extraction with extratrees was performed to identify the most important features in predicting the outcome variable.

Then, the following models were run:

Logistic Regression
Gradient Boosting Classifier
LightGBM Classifier
XGBoost Classifier
AdaBoost Classifier

Feature Extraction

Feature Extraction is being used to determine the most important features that influence the outcome variable, i.e. which features have the strongest correlation with diabetes incidence.

<strong>>>> from sklearn.ensemble import ExtraTreesClassifier</strong>
<strong>>>> model = ExtraTreesClassifier()</strong>
<strong>>>> model.fit(x, y)</strong>
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
<strong>>>> print(model.feature_importances_)</strong>
[0.10696279 0.25816011 0.09378777 0.09258844 0.06920807 0.11396286
 0.12328806 0.1420419 ]

From the feature extraction, features 0 (pregnancies), 1 (glucose), 5 (BMI), 6 (DiabetesPedigreeFunction), and 7 (Age) showed the highest scores in terms of feature importance, and these are the ones that are included in the models to predict the outcome variable.

Feature Score Pregnancies 0.10696279 Glucose 0.25816011 Blood Pressure 0.09378777 Skin Thickness 0.09258844 Insulin 0.06920807 BMI 0.11396286 Diabetes Pedigree Function 0.12328806 Age 0.1420419

Therefore, these variables were defined as xnew under a numpy column stack, and the data was partitioned into training and validation data with train_test_split .

x0=x[:,0]
x1=x[:,1]
x5=x[:,5]
x6=x[:,6]
x7=x[:,7]
xnew=np.column_stack((x0,x1,x5,x6,x7))
xnew

from sklearn.model_selection import train_test_split
x_train,x_val,y_train,y_val=train_test_split(xnew,y,random_state=0)

Logistic Regression vs. Boosting Classifiers

Having selected the relevant features and partitioning the data, a logistic regression was run in conjunction with several boosting classifiers.

<strong># Logistic Regression</strong>
logreg=LogisticRegression().fit(x_train,y_train)
logreg

<strong># GradientBoostingClassifier</strong>
from sklearn.ensemble import GradientBoostingClassifier
gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(x_train, y_train)

<strong>#  LightGBM Classifier</strong>
import lightgbm as lgb
lgb_model = lgb.LGBMClassifier(learning_rate = 0.001, 
                              num_leaves = 65,  
                              n_estimators = 100)                       
lgb_model.fit(x_train, y_train)
 
<strong># XGBoost</strong>
import xgboost as xgb
xgb_model = xgb.XGBClassifier(learning_rate=0.001,
                            max_depth = 1, 
                            n_estimators = 100)
xgb_model.fit(x_train, y_train)

<strong># AdaBoost</strong>
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=100,
    algorithm="SAMME.R", learning_rate=0.001)
ada_clf.fit(x_train, y_train)

As can be observed, the number of n_estimators was set to 100 while the learning rate was set to 0.001. Machine Learning Mastery offers more detail as to how to implement gradient boosting techniques, but in this case the learning rate (or shrinkage parameter) is set to below 0.1 for better generalization error, while the number of n_estimators (or number of trees) is set to 100 in accordance with the recommended range of 100 to 500 as outlined in the “Greedy Function Approximation: A Gradient Boosting Machine” paper.

When these models were run, the following training and validation set scores were obtained:

<strong>>>> print("Accuracy on training set: {:.3f}".format(logreg.score(x_train,y_train)))</strong>
Accuracy on training set: 0.766
<strong>>>> print("Accuracy on validation set: {:.3f}".format(logreg.score(x_val,y_val)))</strong>
Accuracy on validation set: 0.797

<strong>>>> print("Accuracy on training set: {:.3f}".format(gbrt.score(x_train, y_train)))</strong>
Accuracy on training set: 0.896
<strong>>>> print("Accuracy on validation set: {:.3f}".format(gbrt.score(x_val, y_val)))</strong>
Accuracy on validation set: 0.792

<strong>>>> print("Accuracy on training set: {:.3f}".format(lgb_model.score(x_train, y_train)))</strong>
Accuracy on training set: 0.642
<strong>>>> print("Accuracy on validation set: {:.3f}".format(lgb_model.score(x_val, y_val)))</strong>
Accuracy on validation set: 0.677

<strong>>>> print("Accuracy on training set: {:.3f}".format(xgb_model.score(x_train, y_train)))</strong>
Accuracy on training set: 0.748
<strong>>>> print("Accuracy on validation set: {:.3f}".format(xgb_model.score(x_val, y_val)))</strong>
Accuracy on validation set: 0.750

<strong>>>> print("Accuracy on training set: {:.3f}".format(ada_clf.score(x_train, y_train)))</strong>
Accuracy on training set: 0.748
<strong>>>> print("Accuracy on validation set: {:.3f}".format(ada_clf.score(x_val, y_val)))</strong>
Accuracy on validation set: 0.750

Model Training Accuracy Validation Accuracy Logistic Regression 0.766 0.797 Gradient Boosting Classifier 0.896 0.792 LightGBM Classifier 0.642 0.677 XGBoost Classifier 0.748 0.750 AdaBoost Classifier 0.748 0.750

From looking at the above results, two things are evident:

Only the GradientBoostingClassifier yields a similar validation accuracy to the logistic regression – all other boosting models show a slightly less validation accuracy.
Moreover, the accuracy of the logistic regression on the training set is slightly lower than that of the validation set, implying that overfitting is less of an issue on the logistic regression than on the gradient boosting models.

Conclusion

Boosting models have become a “black box” model of sorts, and are increasingly being relied upon for increased accuracy. However, these models don’t necessarily give the best accuracy in all cases (as we have seen here), and the issue of overfitting also must be considered. It was observed that the training accuracy was significantly higher than the validation accuracy in many cases, and this indicates overfitting.

Boosting works on the premise of combining several weak models (e.g. many decision trees) in order to increase accuracy – hence why these are often referred to as ensemble models.

While boosting can be advantageous depending on the data one is working with, they do come with an overfitting risk and should not simply be relied upon by default without considering the data in question and whether other models could prove to be more suitable.

Feature Extraction

Logistic Regression vs. Boosting Classifiers

Conclusion

Recommend

2FA on the Command Line

cacheable-response –Adds cache capabilities to any HTTP server.

Creating a Live Search Feature in React Using Axios

作为 IT 行业的过来人，你有什么话想对后辈说的？ - 知乎

为什么欢乐喜剧人的弹幕对张云雷就那么不友好? - 知乎

NBA 历史上有哪些著名的三分手，他们有怎样的故事和特点？ - 知乎

苹果开发中文网站WWDC 2018：初探 Create ML

苹果开发中文网站当亲戚问你工资，程序猿如何作答，简直不能再机智

Node v6.17.0 (LTS)

Quilc: An optimizing quantum compiler written in Common Lisp

About Joyk