9

2 万 Star!屡屡斩获 Kaggle 各大竞赛冠军宝座的利器

 3 years ago
source link: https://zhuanlan.zhihu.com/p/344769117
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

一、简介

XGBoost (eXtreme Gradient Boosting) 是“极端梯度提升”的简称,它是为满足高效、灵活和可移植性的目的而“诞生”,也被称为优化的分布式增强库。支持 Python、Scala、R、Julia、C++ 等语言。

XGBoost 所应用的算法是机器学习算法—GBDT(gradient boosting decision tree)的改进,既可以用于分类也可以用于回归问题中。

XGBoost 提供了一种并行树增强(也被称为GBDT、GBM),可以快速准确地解决许多数据科学问题。相同的代码可以运行在目前主流的分布式环境(Kubernetes、Hadoop、SGE、MPI、Dask),并且可以解决超过数十亿个样例计算的问题。

XGBoost 可以处理回归、分类和排序等多种任务。由于它在预测性能上的强大且训练速度快,XGBoost 已屡屡斩获 Kaggle 各大竞赛的冠军宝座。

二、开源主页

XGBoost 在 GitHub 已获得 20.4 k Star。

jQr22uZ.jpg!mobile

https://github.com/dmlc/xgboost

三、安装和使用案例

3.1、下载(windows10 64位,python 3.7.3)

打开cmd命令模式,输入 pip install xgboost

但网络需要架梯子(不知道是不是我的网不好),我不架梯子会因网络问题出现以下报错:

fymy6zB.jpg!mobile

架梯子后,虽然速度慢了点,但能成功安装。

vEfMJrR.png!mobilej2uURbY.jpg!mobile

2、如果没有梯子,可以从这个网址( https://www. lfd.uci.edu/~gohlke/pyt honlibs/#xgboost ) ,找到与自己 Python 版本

6RJzuuz.jpg!mobile

xgboost

3aEruaa.jpg!mobile

然后重新执行 pip install xgboost-0.90-cp37-cp37m-win_amd64.whl 可以安装成功

3.2、使用示例

在这个例子中,将使用墨尔本房屋数据集进行简单的测试使用——预测房价。

数据集下载方式:请关注微信公号 开源前哨 ,然后发送 xgboost 即可获取。

在使用这数据之前,需要对三种常见的数值问题进行处理。

1、缺失值的处理(通常处理方法是直接删除缺少值的列或利用该列均值填充);

2、异常值的处理;

3、重复值的处理等;

在这里我们不会关注数据的处理加载步骤(请自行学习),假设已经拥有(我处理过了)了X_train(样本训练集)、X_valid(样本验证集)、y_train(标签训练集)和y_valid(标签验证集)的训练和验证数据。

下面是用 jupyter notebook 进行的代码执行的代码:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# 装载数据
data = pd.read_csv('E:\数据集\melb_data.csv')
#选择测试集列和预测目标列
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
y = data.Price

#将数据分为训练集和测试集
X_train, X_valid, y_train, y_valid = train_test_split(X, y)
#本例使用XGBoost库
#先导入该库,
from xgboost import XGBRegressor
xg_model = XGBRegressor()
#利用训练集训练模型 语句简单
xg_model.fit(X_train,y_train)
#本例使用XGBoost库
#先导入该库,
from xgboost import XGBRegressor
xg_model = XGBRegressor()
#利用训练集训练模型 语句简单
xg_model.fit(X_train,y_train)
#Out
'''XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=12, num_parallel_tree=1,
             objective='reg:squarederror', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)'''

#先对模型进行预测和用测试集进行评估,得出绝对误差
from sklearn.metrics import mean_absolute_error
predictions = xg_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
#Out
'''Mean Absolute Error: 235552.95046713916'''

#XGBoost有几个参数对训练结果比较重要,改变参数值,查看评估结果
#learning_rate——学习率(默认是0.1),学习率的出现可以很好的解决过拟合问题,我们改为0.15和0.05看一下不同结果
xg_model = XGBRegressor(learning_rate=0.05)
xg_model.fit(X_train, y_train)
#out
'''XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.05, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=12, num_parallel_tree=1,
             objective='reg:squarederror', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)'''

predictions = xg_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
#out
'''Mean Absolute Error: 248468.6911450663'''

xg_model = XGBRegressor(learning_rate=0.15)
xg_model.fit(X_train, y_train)
predictions = xg_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
#out
'''Mean Absolute Error: 234624.95226435937'''

#n_estimators参数,指的是弱估计器的数量,即树的个数,太低容易导致欠拟合(对训练集和测试集的训练结果误差都很大),过高容易导致过拟合(只对训练集效果很好,但对测试集效果差)
xg_model = XGBRegressor(n_estimators=500)
xg_model.fit(X_train, y_train)
predictions = xg_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
#out
'''Mean Absolute Error: 247669.14656434095'''

xg_model = XGBRegressor(n_estimators=1000)
xg_model.fit(X_train, y_train)
predictions = xg_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
#Out
'''Mean Absolute Error: 255654.5402660162'''

#early_stopping_rounds参数,该参数的意思是早点让训练结束,因为在迭代次数到了一定数量时,训练误差会在一个值范围内波动,甚至出现下降的
#的现象,这样就会出现过拟合现象,early_stopping_rounds参数一般设置在40左右,意思是当迭代40轮后,训练误差若出现上升的现象,便提前终止训练。
xg_model = XGBRegressor()
xg_model.fit(X_train, y_train,
             early_stopping_rounds=5,eval_set=[(X_valid, y_valid)])
predictions = xg_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
#Out
'''[0] validation_0-rmse:941935.18750
[1] validation_0-rmse:732831.68750
[2] validation_0-rmse:599871.62500
[3] validation_0-rmse:516429.90625
[4] validation_0-rmse:467964.43750
[5] validation_0-rmse:441869.50000
[6] validation_0-rmse:426009.87500
[7] validation_0-rmse:416771.93750
[8] validation_0-rmse:411553.68750
[9] validation_0-rmse:408549.15625
[10] validation_0-rmse:406357.93750
[11] validation_0-rmse:403870.65625
[12] validation_0-rmse:402537.96875
[13] validation_0-rmse:402280.25000
[14] validation_0-rmse:400586.65625
[15] validation_0-rmse:399610.59375
[16] validation_0-rmse:398340.71875
[17] validation_0-rmse:397867.90625
[18] validation_0-rmse:397690.25000
[19] validation_0-rmse:397726.62500
[20] validation_0-rmse:396976.93750
[21] validation_0-rmse:396865.03125
[22] validation_0-rmse:395752.12500
[23] validation_0-rmse:392401.65625
[24] validation_0-rmse:393190.93750
[25] validation_0-rmse:392763.21875
[26] validation_0-rmse:392791.09375
[27] validation_0-rmse:391552.53125
[28] validation_0-rmse:391745.37500
[29] validation_0-rmse:391624.40625
[30] validation_0-rmse:391096.18750
[31] validation_0-rmse:391777.71875
[32] validation_0-rmse:392427.34375
[33] validation_0-rmse:391748.43750
[34] validation_0-rmse:391423.40625
[35] validation_0-rmse:391501.90625
Mean Absolute Error: 243436.07564893225'''

xg_model = XGBRegressor()
xg_model.fit(X_train, y_train,
             early_stopping_rounds=30,eval_set=[(X_valid, y_valid)])
predictions = xg_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
#Out
'''[0] validation_0-rmse:941935.18750
[1] validation_0-rmse:732831.68750
[2] validation_0-rmse:599871.62500
[3] validation_0-rmse:516429.90625
[4] validation_0-rmse:467964.43750
[5] validation_0-rmse:441869.50000
[6] validation_0-rmse:426009.87500
[7] validation_0-rmse:416771.93750
[8] validation_0-rmse:411553.68750
[9] validation_0-rmse:408549.15625
[10] validation_0-rmse:406357.93750
[11] validation_0-rmse:403870.65625
[12] validation_0-rmse:402537.96875
[13] validation_0-rmse:402280.25000
[14] validation_0-rmse:400586.65625
[15] validation_0-rmse:399610.59375
[16] validation_0-rmse:398340.71875
[17] validation_0-rmse:397867.90625
[18] validation_0-rmse:397690.25000
[19] validation_0-rmse:397726.62500
[20] validation_0-rmse:396976.93750
[21] validation_0-rmse:396865.03125
[22] validation_0-rmse:395752.12500
[23] validation_0-rmse:392401.65625
[24] validation_0-rmse:393190.93750
[25] validation_0-rmse:392763.21875
[26] validation_0-rmse:392791.09375
[27] validation_0-rmse:391552.53125
[28] validation_0-rmse:391745.37500
[29] validation_0-rmse:391624.40625
[30] validation_0-rmse:391096.18750
[31] validation_0-rmse:391777.71875
[32] validation_0-rmse:392427.34375
[33] validation_0-rmse:391748.43750
[34] validation_0-rmse:391423.40625
[35] validation_0-rmse:391501.90625
[36] validation_0-rmse:390788.90625
[37] validation_0-rmse:390342.50000
[38] validation_0-rmse:388972.56250
[39] validation_0-rmse:388548.37500
[40] validation_0-rmse:389271.59375
[41] validation_0-rmse:388126.15625
[42] validation_0-rmse:387813.06250
[43] validation_0-rmse:387755.25000
[44] validation_0-rmse:388325.31250
[45] validation_0-rmse:388446.96875
[46] validation_0-rmse:388394.09375
[47] validation_0-rmse:388425.00000
[48] validation_0-rmse:388244.31250
[49] validation_0-rmse:387900.65625
[50] validation_0-rmse:387721.56250
[51] validation_0-rmse:387184.31250
[52] validation_0-rmse:386687.31250
[53] validation_0-rmse:386059.87500
[54] validation_0-rmse:386105.56250
[55] validation_0-rmse:386133.31250
[56] validation_0-rmse:385496.62500
[57] validation_0-rmse:385332.71875
[58] validation_0-rmse:385390.18750
[59] validation_0-rmse:385281.46875
[60] validation_0-rmse:385243.71875
[61] validation_0-rmse:385267.50000
[62] validation_0-rmse:385012.65625
[63] validation_0-rmse:385141.46875
[64] validation_0-rmse:384997.34375
[65] validation_0-rmse:385355.09375
[66] validation_0-rmse:385625.09375
[67] validation_0-rmse:385546.90625
[68] validation_0-rmse:385723.62500
[69] validation_0-rmse:385636.68750
[70] validation_0-rmse:385617.34375
[71] validation_0-rmse:385682.62500
[72] validation_0-rmse:385741.12500
[73] validation_0-rmse:385583.09375
[74] validation_0-rmse:385650.28125
[75] validation_0-rmse:385895.87500
[76] validation_0-rmse:385654.53125
[77] validation_0-rmse:385794.40625
[78] validation_0-rmse:385448.96875
[79] validation_0-rmse:385472.40625
[80] validation_0-rmse:385382.06250
[81] validation_0-rmse:385556.25000
[82] validation_0-rmse:385969.62500
[83] validation_0-rmse:385744.06250
[84] validation_0-rmse:385602.81250
[85] validation_0-rmse:385691.06250
[86] validation_0-rmse:385536.59375
[87] validation_0-rmse:385594.53125
[88] validation_0-rmse:385643.12500
[89] validation_0-rmse:385773.93750
[90] validation_0-rmse:385573.40625
[91] validation_0-rmse:385784.81250
[92] validation_0-rmse:385986.00000
[93] validation_0-rmse:386088.37500
Mean Absolute Error: 237037.17186349412'''

以上便是这次 Xgboost 的安装及简单实用,大家可以根据自己训练模型的需要,去学习它更多的 API 方法。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK