4

【sklearn】模型选取+参数选择

 2 years ago
source link: https://www.guofei.site/2019/09/28/model_selection.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

【sklearn】模型选取+参数选择

2019年09月28日

Author: Guofei

文章归类: 2-1-有监督学习 ,文章编号: 201


版权声明:本文作者是郭飞。转载随意,但需要标明原文链接,并通知本人
原文链接:https://www.guofei.site/2019/09/28/model_selection.html

Edit

两年前总结过模型选取和参数选择的理论+代码。机器学习模型汇总, 【模型评价】理论与实现, 【交叉验证】介绍与实现
两年过去了,理论没太大变化,但 sklearn 增加了很多好用的新功能(点赞)。
现在对这几篇博客,删掉代码,保留理论部分。在这篇博客里重新总结一遍代码。(过段时间会把旧博客删掉)

本文结构

  1. 先讲 GridSearchCV,这是一个网格搜索的方法,然后介绍 RandomizedSearchCV 等,它们的用法类似
  2. 进行搜索时,需要定义需要计算的 scoring,从而输出分数,并寻找最佳模型,第二部分就介绍 score
  3. 进行搜索时,有一些具体的 CV(cross validation) 方法,例如,Kfold

GridSearchCV

from sklearn import neural_network
mlp=neural_network.MLPClassifier(max_iter=1000)

param_grid = {
    'hidden_layer_sizes':[(10, ), (20, ), (5, 5)],
    'activation':['logistic', 'tanh', 'relu'],
    'alpha':[0.001, 0.01, 0.1, 0.4, 1]
}

gscv = model_selection.GridSearchCV(estimator=mlp,
                                   param_grid=param_grid,
                                   scoring='accuracy', # 打分
                                   cv=gkf.split(X,y,groups), # cv 方法
                                   return_train_score=True, # 默认不返回 train 的score
                                   refit=True, # 默认为 True, 用最好的模型+全量数据再次训练,用 gscv.best_estimator_ 获取最好模型
                                   n_jobs=-1)

gscv.fit(X,y)
gscv.cv_results_

关于score

https://scikit-learn.org/stable/modules/model_evaluation.html

  1. 关于best model,需要 refit=True
    gscv.best_score_
    gscv.best_params_
    best_model = gscv.best_estimator_
    best_model.score(test_data, test_target)
    
  2. 如果我需要多个score,而且需要 refit=True 再次训练最好的模型。那么我们需要指定以哪种 score 作为判断最好的标准。例如:
    from sklearn import tree
    dtc=tree.DecisionTreeClassifier()
    gscv = model_selection.GridSearchCV(estimator=dtc,
                                    param_grid={'min_samples_split':[2,3,4]},
                                    scoring=['accuracy','f1','roc_auc'],
                                    refit='accuracy')
    

model_selection.RandomizedSearchCV

sklearn.model_selection.RandomizedSearchCV.
好处在吴恩达的课程上说过,就是你不确定哪些变量实际上不重要时,用随机搜索比网格搜索“有效”搜索更多。

estimator
param_distributions # 一个dict,value要么是dict,要么是带rvs方法的对象(例如 scipy.stats.distributions)
scoring # 同上
n_jobs
cv
refit
return_train_score
random_state

使用方法类似 GridSearchCV

其它 cv

  • [model_selection.learning_curve](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html#sklearn.model_selection.learning_curve) 计算 train_size 不断扩大时,指标表现
  • validation_curve,参数不同时,各个评价指标的曲线

score

https://scikit-learn.org/stable/modules/model_evaluation.html

官网总结很好,还有一个自定义score函数,没有摘抄下来。

classification

import sklearn.metrics as metrics
metrics.confusion_matrix(test_target, test_predict, labels=['CYT','NUC']) # label可以控制显示哪些标签
  • accuracy_score: metrics.accuracy_score(test_target, test_predict)
  • precision_score:metrics.precision_score(test_target, test_predict)
  • recall_score:metrics.recall_score(test_target, test_predict)
  • metrics.f1_score(test_target, test_predict)
  • metrics.fbeta_score(test_target, test_predict)

一个总结 precision, recall, f1-score, support 的表

metrics.classification_report(test_target, test_predict)

P-R曲线

metrics.precision_recall_curve(y_true, y_predict_proba)

ROC曲线

test_proba_predict = model.predict_proba(test_data)[:,0]
train_proba_predict=model.predict_proba(train_data)[:,0]
fpr_test, tpr_test, th_test = metrics.roc_curve(test_target=='CYT', test_proba_predict)
fpr_train, tpr_train, th_train = metrics.roc_curve(train_target=='CYT', train_proba_predict)
plt.figure(figsize=[6,6])
plt.plot(fpr_test, tpr_test, color='blue',label='test')
plt.plot(fpr_train, tpr_train, color='red',label='train')
plt.legend()
plt.title('ROC curve')
  • AUC:metrics.roc_auc_score(y_true, y_predict_proba),返回AUC值

回归模型的评价指标

  • 误差绝对值的平均值 metrics.mean_absolute_error(y_true,y_pred)
  • 误差平方的绝对值(MSE) metrics.mean_squared_error(y_true,y_pred)

make_scorer

from sklearn import metrics

def my_custom_loss_func(y_true, y_pred):
    diff = np.abs(y_true - y_pred).max()
    return np.log1p(diff)

score = metrics.make_scorer(my_custom_loss_func, greater_is_better=False)

Kfold

splitter, 这里摘抄最常用的

需要注意的点:

  1. 默认不 shuffle,而是按照顺序去做分割(重复n次的除外,重复n次只能 shuffle)
  2. split() 方法返回一个 generator,存放的是index,而不是值本身

调包+做数据

import sklearn.model_selection as cross_validation
from sklearn import datasets
from sklearn import model_selection
from sklearn import metrics
X, y = datasets.make_classification(n_samples=10,
                                    n_features=10,
                                    n_informative=2,
                                    n_redundant=3,  # 用 n_informative 线性组合出这么多个特征
                                    n_repeated=3,  # 用 n_informative+n_redundant 线性组合出这么多个特征
                                    n_classes=2,
                                    n_clusters_per_class=1,
                                    weights=[0.2, 0.8],  # class 数量不均衡
                                    scale=[5] + [1] * 8 + [3]  # feature 的 scale
                                    )

下面是一些主要的方法

  • Kfold:Split dataset into k consecutive folds (without shuffling by default).
    kf = model_selection.KFold(n_splits=4,shuffle=True,random_state=0)
    for train_index, test_index in kf.split(X, y):
      print("TRAIN:", train_index, "TEST:", test_index)
    
  • StratifiedKFold分层抽样,保证每个集合y值的频率都与整体相等,也是不shuffle
    skf = model_selection.StratifiedKFold(n_splits=3,shuffle=False,random_state=0)
    for train_index, test_index in skf.split(X, y):
      print("TRAIN:", train_index, "TEST:", test_index)
    
  • RepeatedKFold, KFold 的变种,执行 n_repeats 次
    rkf = model_selection.RepeatedKFold(n_splits=2, n_repeats=5,random_state=0)
    
  • RepeatedStratifiedKFold, StratifiedKFold 的变种,执行n_repeats次
    rskf = model_selection.RepeatedStratifiedKFold(n_splits=2, n_repeats=5,random_state=0)
    for train_index, test_index in rskf.split(X, y):
      print("TRAIN:", train_index, "TEST:", test_index)
    
  • LeaveOneOut 留一法
    loo = model_selection.LeaveOneOut()
    for train_index, test_index in loo.split(X, y):
      print("TRAIN:", train_index, "TEST:", test_index)
    
  • GroupKFold 保证同一个group只能出现在同一个集合中
    gkf=model_selection.GroupKFold(n_splits=2)
    groups=[0]*5+[1]*5
    for train_index, test_index in gkf.split(X, y,groups):
      print("TRAIN:", train_index, "TEST:", test_index)
    
  • LeaveOneGroupOut GroupKFold的变种,每次只留一个group,保证同一个group只能出现在同一个集合中
    logo = model_selection.LeaveOneGroupOut()
    groups=[0]*2+[1]*8
    for train_index, test_index in logo.split(X, y,groups):
      print("TRAIN:", train_index, "TEST:", test_index)
    

参考文献

sklearn官网


您的支持将鼓励我继续创作!

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK