【sklearn】模型选取+参数选择
source link: https://www.guofei.site/2019/09/28/model_selection.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
【sklearn】模型选取+参数选择
2019年09月28日Author: Guofei
文章归类: 2-1-有监督学习 ,文章编号: 201
版权声明:本文作者是郭飞。转载随意,但需要标明原文链接,并通知本人
原文链接:https://www.guofei.site/2019/09/28/model_selection.html
两年前总结过模型选取和参数选择的理论+代码。机器学习模型汇总, 【模型评价】理论与实现, 【交叉验证】介绍与实现
两年过去了,理论没太大变化,但 sklearn 增加了很多好用的新功能(点赞)。
现在对这几篇博客,删掉代码,保留理论部分。在这篇博客里重新总结一遍代码。(过段时间会把旧博客删掉)
本文结构
- 先讲
GridSearchCV
,这是一个网格搜索的方法,然后介绍RandomizedSearchCV
等,它们的用法类似 - 进行搜索时,需要定义需要计算的
scoring
,从而输出分数,并寻找最佳模型,第二部分就介绍score
- 进行搜索时,有一些具体的 CV(cross validation) 方法,例如,
Kfold
GridSearchCV
from sklearn import neural_network
mlp=neural_network.MLPClassifier(max_iter=1000)
param_grid = {
'hidden_layer_sizes':[(10, ), (20, ), (5, 5)],
'activation':['logistic', 'tanh', 'relu'],
'alpha':[0.001, 0.01, 0.1, 0.4, 1]
}
gscv = model_selection.GridSearchCV(estimator=mlp,
param_grid=param_grid,
scoring='accuracy', # 打分
cv=gkf.split(X,y,groups), # cv 方法
return_train_score=True, # 默认不返回 train 的score
refit=True, # 默认为 True, 用最好的模型+全量数据再次训练,用 gscv.best_estimator_ 获取最好模型
n_jobs=-1)
gscv.fit(X,y)
gscv.cv_results_
关于score
https://scikit-learn.org/stable/modules/model_evaluation.html
- 关于best model,需要
refit=True
gscv.best_score_ gscv.best_params_ best_model = gscv.best_estimator_ best_model.score(test_data, test_target)
- 如果我需要多个score,而且需要
refit=True
再次训练最好的模型。那么我们需要指定以哪种 score 作为判断最好的标准。例如:from sklearn import tree dtc=tree.DecisionTreeClassifier() gscv = model_selection.GridSearchCV(estimator=dtc, param_grid={'min_samples_split':[2,3,4]}, scoring=['accuracy','f1','roc_auc'], refit='accuracy')
model_selection.RandomizedSearchCV
sklearn.model_selection.RandomizedSearchCV.
好处在吴恩达的课程上说过,就是你不确定哪些变量实际上不重要时,用随机搜索比网格搜索“有效”搜索更多。
estimator
param_distributions # 一个dict,value要么是dict,要么是带rvs方法的对象(例如 scipy.stats.distributions)
scoring # 同上
n_jobs
cv
refit
return_train_score
random_state
使用方法类似 GridSearchCV
其它 cv
[model_selection.learning_curve](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html#sklearn.model_selection.learning_curve)
计算 train_size 不断扩大时,指标表现- validation_curve,参数不同时,各个评价指标的曲线
score
https://scikit-learn.org/stable/modules/model_evaluation.html
官网总结很好,还有一个自定义score函数,没有摘抄下来。
classification
import sklearn.metrics as metrics
metrics.confusion_matrix(test_target, test_predict, labels=['CYT','NUC']) # label可以控制显示哪些标签
- accuracy_score: metrics.accuracy_score(test_target, test_predict)
- precision_score:metrics.precision_score(test_target, test_predict)
- recall_score:metrics.recall_score(test_target, test_predict)
- metrics.f1_score(test_target, test_predict)
- metrics.fbeta_score(test_target, test_predict)
一个总结 precision, recall, f1-score, support 的表
metrics.classification_report(test_target, test_predict)
P-R曲线
metrics.precision_recall_curve(y_true, y_predict_proba)
ROC曲线
test_proba_predict = model.predict_proba(test_data)[:,0]
train_proba_predict=model.predict_proba(train_data)[:,0]
fpr_test, tpr_test, th_test = metrics.roc_curve(test_target=='CYT', test_proba_predict)
fpr_train, tpr_train, th_train = metrics.roc_curve(train_target=='CYT', train_proba_predict)
plt.figure(figsize=[6,6])
plt.plot(fpr_test, tpr_test, color='blue',label='test')
plt.plot(fpr_train, tpr_train, color='red',label='train')
plt.legend()
plt.title('ROC curve')
- AUC:metrics.roc_auc_score(y_true, y_predict_proba),返回AUC值
回归模型的评价指标
- 误差绝对值的平均值 metrics.mean_absolute_error(y_true,y_pred)
- 误差平方的绝对值(MSE) metrics.mean_squared_error(y_true,y_pred)
make_scorer
from sklearn import metrics
def my_custom_loss_func(y_true, y_pred):
diff = np.abs(y_true - y_pred).max()
return np.log1p(diff)
score = metrics.make_scorer(my_custom_loss_func, greater_is_better=False)
Kfold
splitter, 这里摘抄最常用的
需要注意的点:
- 默认不 shuffle,而是按照顺序去做分割(重复n次的除外,重复n次只能 shuffle)
split()
方法返回一个 generator,存放的是index,而不是值本身
调包+做数据
import sklearn.model_selection as cross_validation
from sklearn import datasets
from sklearn import model_selection
from sklearn import metrics
X, y = datasets.make_classification(n_samples=10,
n_features=10,
n_informative=2,
n_redundant=3, # 用 n_informative 线性组合出这么多个特征
n_repeated=3, # 用 n_informative+n_redundant 线性组合出这么多个特征
n_classes=2,
n_clusters_per_class=1,
weights=[0.2, 0.8], # class 数量不均衡
scale=[5] + [1] * 8 + [3] # feature 的 scale
)
下面是一些主要的方法
- Kfold:Split dataset into k consecutive folds (without shuffling by default).
kf = model_selection.KFold(n_splits=4,shuffle=True,random_state=0) for train_index, test_index in kf.split(X, y): print("TRAIN:", train_index, "TEST:", test_index)
- StratifiedKFold分层抽样,保证每个集合y值的频率都与整体相等,也是不shuffle
skf = model_selection.StratifiedKFold(n_splits=3,shuffle=False,random_state=0) for train_index, test_index in skf.split(X, y): print("TRAIN:", train_index, "TEST:", test_index)
- RepeatedKFold, KFold 的变种,执行 n_repeats 次
rkf = model_selection.RepeatedKFold(n_splits=2, n_repeats=5,random_state=0)
- RepeatedStratifiedKFold, StratifiedKFold 的变种,执行n_repeats次
rskf = model_selection.RepeatedStratifiedKFold(n_splits=2, n_repeats=5,random_state=0) for train_index, test_index in rskf.split(X, y): print("TRAIN:", train_index, "TEST:", test_index)
- LeaveOneOut 留一法
loo = model_selection.LeaveOneOut() for train_index, test_index in loo.split(X, y): print("TRAIN:", train_index, "TEST:", test_index)
- GroupKFold 保证同一个group只能出现在同一个集合中
gkf=model_selection.GroupKFold(n_splits=2) groups=[0]*5+[1]*5 for train_index, test_index in gkf.split(X, y,groups): print("TRAIN:", train_index, "TEST:", test_index)
- LeaveOneGroupOut GroupKFold的变种,每次只留一个group,保证同一个group只能出现在同一个集合中
logo = model_selection.LeaveOneGroupOut() groups=[0]*2+[1]*8 for train_index, test_index in logo.split(X, y,groups): print("TRAIN:", train_index, "TEST:", test_index)
参考文献
您的支持将鼓励我继续创作!
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK