24

Spark 模型选择和调参

 3 years ago
source link: https://flashgene.com/archives/149705.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

本站内容均来自兴趣收集,如不慎侵害的您的相关权益,请留言告知,我们将尽快删除.谢谢.

Spark – ML Tuning

官方文档: https://spark.apache.org/docs/2.2.0/ml-tuning.html

这一章节主要讲述如何通过使用MLlib的工具来调试模型算法和pipeline,内置的交叉验证和其他工具允许用户优化模型和pipeline中的超参数;

目录:

模型选择,也就是调参;

交叉验证;

训练集、验证集划分;

模型选择(调参)

机器学习的一个重要工作就是模型选择,或者说根据给定任务使用数据来发现最优的模型和参数,也叫做调试,既可以针对单个模型进行调试,也可以针对整个pipeline的各个环节进行调试,使用者可以一次对整个pipeline进行调试而不是每次一个pipeline中的部分;

MLlib支持CrossValidator和TrainValidationSplit等模型选择工具,这些工具需要下列参数:

Estimator:待调试的算法或者Pipeline;

参数Map列表:用于搜索的参数空间;

Evaluator:衡量模型在集外测试集上表现的方法;

这些工具工作方式如下:

分割数据到训练集和测试集;

对每一组训练&测试数据,应用所有参数空间中的可选参数组合:

对每一组参数组合,使用其设置到算法上,得到对应的model,并验证该model的性能;

选择得到最好性能的模型使用的参数组合;

Evaluator针对回归问题可以是RegressionEvaluator,针对二分数据可以是BinaryClassificationEvaluator,针对多分类问题的MulticlassClassificationEvaluator,默认的验证方法可以通过setMetricName来修改;

交叉验证

CrossValidator首先将数据分到一个个的fold中,使用这些fold集合作为训练集和测试集,如果k=3,那幺CrossValidator将生成3个(训练,测试)组合,也就是通过3个fold排列组合得到的,每一组使用2个fold作为训练集,另一个fold作为测试集,为了验证一个指定的参数组合,CrossValidator需要计算3个模型的平均性能,每个模型都是通过之前的一组训练&测试集训练得到;

确认了最佳参数后,CrossValidator最终会使用全部数据和最佳参数组合来重新训练预测;

例子:通过交叉验证进行模型选择;

注意:交叉验证在整个参数网格上是十分耗时的,下面的例子中,参数网格中numFeatures有3个可取值,regParam有2个可取值,CrossValidator使用2个fold,这将会训练3*2*2个不同的模型,在实际工作中,通常会设置更多的参数、更多的参数取值以及更多的fold,换句话说,CrossValidator本身就是十分奢侈的,无论如何,与手工调试相比,它依然是一种更加合理和自动化的调参手段;

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# Prepare training documents, which are labeled.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
    (8, "e spark program", 1.0),
    (9, "a e c l", 0.0),
    (10, "spark compile", 1.0),
    (11, "hadoop software", 0.0)
], ["id", "text", "label"])
# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice
# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)
# Prepare test documents, which are unlabeled.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "mapreduce spark"),
    (7, "apache hadoop")
], ["id", "text"])
# Make predictions on test documents. cvModel uses the best model found (lrModel).
prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    print(row)

划分训练、验证集

对于超参数调试,Spark还支持TrainValidationSplit,它一次只能验证一组参数,这与CrossValidator一次进行k次截然不同,因此它更加快速,但是如果训练集不够大的化就无法得到一个真实的结果;

不像是CrossValidator,TrainValidationSplit创建一个训练、测试组合,它根据trainRatio将数据分为两部分,假设trainRatio=0.75,那幺数据集的75%作为训练集,25%用于验证;

与CrossValidator类似的是,TrainValidationSplit最终也会使用最佳参数和全部数据来训练一个预测器;

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
# Prepare training and test data.
data = spark.read.format("libsvm")\
    .load("data/mllib/sample_linear_regression_data.txt")
train, test = data.randomSplit([0.9, 0.1], seed=12345)
lr = LinearRegression(maxIter=10)
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# TrainValidationSplit will try all combinations of values and determine best model using
# the evaluator.
paramGrid = ParamGridBuilder()\
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .addGrid(lr.fitIntercept, [False, True])\
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
    .build()
# In this case the estimator is simply the linear regression.
# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
tvs = TrainValidationSplit(estimator=lr,
                           estimatorParamMaps=paramGrid,
                           evaluator=RegressionEvaluator(),
                           # 80% of the data will be used for training, 20% for validation.
                           trainRatio=0.8)
# Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(train)
# Make predictions on test data. model is the model with combination of parameters
# that performed best.
model.transform(test)\
    .select("features", "label", "prediction")\
    .show()

Download as PDF


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK