62

Boosting: Is It Always The Best Option?

 5 years ago
source link: https://www.tuicool.com/articles/hit/2iYNnyI
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

RVbIZvJ.png!web

Gradient boosting has become quite a popular technique in the area of machine learning. Given its reputation for achieving potentially higher accuracy than other models, it has become particularly popular as a “go-to” model for Kaggle competitions.

However, use of gradient boosting raises two questions:

  1. Does this technique really outperform others consistently irrespective of the data being examined?
  2. Even if this is the case, are gradient boosting techniques always a wise choice?

To answer these questions, I decided to compare the use of gradient boosting techniques to that oflogistic regression by attempting to classify diabetes based on outcome. The dataset is available at the UCI Machine Learning Repository .

Essentially, the dataset provides us with several features that are used to predict the outcome variable (diabetes = 1, no diabetes = 0).

Firstly, feature extraction with extratrees was performed to identify the most important features in predicting the outcome variable.

Then, the following models were run:

  1. Logistic Regression
  2. Gradient Boosting Classifier
  3. LightGBM Classifier
  4. XGBoost Classifier
  5. AdaBoost Classifier

Feature Extraction

Feature Extraction is being used to determine the most important features that influence the outcome variable, i.e. which features have the strongest correlation with diabetes incidence.

<strong>>>> from sklearn.ensemble import ExtraTreesClassifier</strong>
<strong>>>> model = ExtraTreesClassifier()</strong>
<strong>>>> model.fit(x, y)</strong>
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
<strong>>>> print(model.feature_importances_)</strong>
[0.10696279 0.25816011 0.09378777 0.09258844 0.06920807 0.11396286
 0.12328806 0.1420419 ]

From the feature extraction, features 0 (pregnancies), 1 (glucose), 5 (BMI), 6 (DiabetesPedigreeFunction), and 7 (Age) showed the highest scores in terms of feature importance, and these are the ones that are included in the models to predict the outcome variable.

Feature Score Pregnancies 0.10696279 Glucose 0.25816011 Blood Pressure 0.09378777 Skin Thickness 0.09258844 Insulin 0.06920807 BMI 0.11396286 Diabetes Pedigree Function 0.12328806 Age 0.1420419

Therefore, these variables were defined as xnew under a numpy column stack, and the data was partitioned into training and validation data with train_test_split .

x0=x[:,0]
x1=x[:,1]
x5=x[:,5]
x6=x[:,6]
x7=x[:,7]
xnew=np.column_stack((x0,x1,x5,x6,x7))
xnew

from sklearn.model_selection import train_test_split
x_train,x_val,y_train,y_val=train_test_split(xnew,y,random_state=0)

Logistic Regression vs. Boosting Classifiers

Having selected the relevant features and partitioning the data, a logistic regression was run in conjunction with several boosting classifiers.

<strong># Logistic Regression</strong>
logreg=LogisticRegression().fit(x_train,y_train)
logreg

<strong># GradientBoostingClassifier</strong>
from sklearn.ensemble import GradientBoostingClassifier
gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(x_train, y_train)

<strong>#  LightGBM Classifier</strong>
import lightgbm as lgb
lgb_model = lgb.LGBMClassifier(learning_rate = 0.001, 
                              num_leaves = 65,  
                              n_estimators = 100)                       
lgb_model.fit(x_train, y_train)
 
<strong># XGBoost</strong>
import xgboost as xgb
xgb_model = xgb.XGBClassifier(learning_rate=0.001,
                            max_depth = 1, 
                            n_estimators = 100)
xgb_model.fit(x_train, y_train)

<strong># AdaBoost</strong>
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=100,
    algorithm="SAMME.R", learning_rate=0.001)
ada_clf.fit(x_train, y_train)

As can be observed, the number of n_estimators was set to 100 while the learning rate was set to 0.001. Machine Learning Mastery offers more detail as to how to implement gradient boosting techniques, but in this case the learning rate (or shrinkage parameter) is set to below 0.1 for better generalization error, while the number of n_estimators (or number of trees) is set to 100 in accordance with the recommended range of 100 to 500 as outlined in the “Greedy Function Approximation: A Gradient Boosting Machine” paper.

When these models were run, the following training and validation set scores were obtained:

<strong>>>> print("Accuracy on training set: {:.3f}".format(logreg.score(x_train,y_train)))</strong>
Accuracy on training set: 0.766
<strong>>>> print("Accuracy on validation set: {:.3f}".format(logreg.score(x_val,y_val)))</strong>
Accuracy on validation set: 0.797

<strong>>>> print("Accuracy on training set: {:.3f}".format(gbrt.score(x_train, y_train)))</strong>
Accuracy on training set: 0.896
<strong>>>> print("Accuracy on validation set: {:.3f}".format(gbrt.score(x_val, y_val)))</strong>
Accuracy on validation set: 0.792

<strong>>>> print("Accuracy on training set: {:.3f}".format(lgb_model.score(x_train, y_train)))</strong>
Accuracy on training set: 0.642
<strong>>>> print("Accuracy on validation set: {:.3f}".format(lgb_model.score(x_val, y_val)))</strong>
Accuracy on validation set: 0.677

<strong>>>> print("Accuracy on training set: {:.3f}".format(xgb_model.score(x_train, y_train)))</strong>
Accuracy on training set: 0.748
<strong>>>> print("Accuracy on validation set: {:.3f}".format(xgb_model.score(x_val, y_val)))</strong>
Accuracy on validation set: 0.750

<strong>>>> print("Accuracy on training set: {:.3f}".format(ada_clf.score(x_train, y_train)))</strong>
Accuracy on training set: 0.748
<strong>>>> print("Accuracy on validation set: {:.3f}".format(ada_clf.score(x_val, y_val)))</strong>
Accuracy on validation set: 0.750
Model Training Accuracy Validation Accuracy Logistic Regression 0.766 0.797 Gradient Boosting Classifier 0.896 0.792 LightGBM Classifier 0.642 0.677 XGBoost Classifier 0.748 0.750 AdaBoost Classifier 0.748 0.750

From looking at the above results, two things are evident:

  1. Only the GradientBoostingClassifier yields a similar validation accuracy to the logistic regression – all other boosting models show a slightly less validation accuracy.
  2. Moreover, the accuracy of the logistic regression on the training set is slightly lower than that of the validation set, implying that overfitting is less of an issue on the logistic regression than on the gradient boosting models.

Conclusion

Boosting models have become a “black box” model of sorts, and are increasingly being relied upon for increased accuracy. However, these models don’t necessarily give the best accuracy in all cases (as we have seen here), and the issue of overfitting also must be considered. It was observed that the training accuracy was significantly higher than the validation accuracy in many cases, and this indicates overfitting.

Boosting works on the premise of combining several weak models (e.g. many decision trees) in order to increase accuracy – hence why these are often referred to as ensemble models.

While boosting can be advantageous depending on the data one is working with, they do come with an overfitting risk and should not simply be relied upon by default without considering the data in question and whether other models could prove to be more suitable.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK