4

scikit-learn 笔记 - 模型选择与评估

 1 year ago
source link: https://airgiser.github.io/2018/11/02/scikit-learn-model-eval/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

scikit-learn 笔记 - 模型选择与评估

Posted on

2018-11-02

|

Edited on 2019-03-14

| Views: 11

为了防止模型过拟合训练数据,需要从数据集划分一部分数据到测试集,在独立的测试集上评估性能(泛化能力)。

scikit-learn 提供了多种评估方法,包括:

  • 学习器的 score 方法返回默认的评估指标。
  • 使用交叉验证的模型评估工具内部使用了相应的评分策略
  • metrics 模块实现了用于各种不同目的的性能评估指标。
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

# load the iris data set
iris = datasets.load_iris()
print iris.data.shape, iris.target.shape

# split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.4, random_state=0)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape

# evaluation classifier
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print clf.score(X_test, y_test)
pred = clf.predict(X_test)
print(metrics.classification_report(y_test, pred))
print(metrics.confusion_matrix(y_test, pred))

交叉验证 (Cross Validation)

在评估不同的模型参数(超参数)的效果时,直接使用测试集将有可能存在过拟合测试集的风险(多次人为调整模型参数再用测试集验证择优,相当于让模型某种程度学习了测试集),此时在测试集上评估泛化性能就不再准确。

我们可以选择将数据集划分为训练集、验证集、测试集,在验证集上调试模型参数,但这样每个集合的样本量就会减少,在总的样本数较少时,就不太合适。

K 折交叉验证 (k-fold CV)

K 折交叉验证将数据集划分为 k 个子集,从中挑选一个作为测试集,剩下所有的作为训练集。
每次评估性能时挑选不同的子集作为测试集。这样就可以更充分地利用数据,解决数据样本较少时数据集划分的问题。

import numpy as np
from sklearn.model_selection import KFold

# Example of 2-fold cross-validation on a dataset with 4 samples
X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])
y = np.array([0, 1, 0, 1])
kf = KFold(n_splits=2)
for train_indices, test_indices in kf.split(X):
print("%s %s" % (train_indices, test_indices))
X_train, X_test, y_train, y_test = X[train_indices], X[test_indices], y[train_indices], y[test_indices]

++注意:如果数据存在某种顺序,在划分数据集时应该先进行洗牌 (shuffling, 打乱数据顺序)++。

超参数调试

超参数是那些无法通过模型直接学得的参数,在 scikit-learn 中这些参数通常是在构造学习器的时候指定,超参数的选择对模型的性能往往有关键的影响,我们可以结合交叉验证在超参数空间中搜索得到合适的超参数。

GridSearchCV

GridSearchCV 用于遍历多种参数组合,通过交叉验证确定具有最佳效果的参数。

from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV

iris = datasets.load_iris()

# Dictionary with parameters names (string) as keys
# and lists of parameter settings to try as values.
# This enables searching over any sequence of parameter settings.
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

# Run fit with all sets of parameters,
# and retained the best combination.
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)

print clf.best_params_

http://scikit-learn.org/stable/model_selection.html

http://scikit-learn.org/stable/modules/model_evaluation.html

http://scikit-learn.org/stable/modules/cross_validation.html

http://scikit-learn.org/stable/modules/cross_validation.html#k-fold

http://scikit-learn.org/stable/modules/grid_search.html#grid-search

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK