75

谷歌BigQuery ML VS StreamingPro MLSQL

 5 years ago
source link: http://www.jianshu.com/p/ef57277130bd?amp%3Butm_medium=referral
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

前言

今天看到了一篇 AI前线的文章 谷歌BigQuery ML正式上岗,只会用SQL也能玩转机器学习! 。正好自己也在力推 StreamingPro的MLSQL。

今天就来对比下这两款产品。

运行方式

MLSQL支持Run as Application 和 Run as Service。MLSQL Run as Service很简单,你可以直接在自己电脑上体验: Five Minute Quick Tutorial

BigQuery ML 则是云端产品,从表象上来看,应该也是Run As Service。

语法功能使用

BigQuery ML 训练一个算法的方式为:

CREATE OR REPLACE MODEL flights.arrdelay
OPTIONS
 (model_type='linear_reg', labels=['arr_delay']) AS
SELECT
 arr_delay,
 carrier,
 origin,
 dest,
 dep_delay,
 taxi_out,
 distance
FROM
 `cloud-training-demos.flights.tzcorr`
WHERE
 arr_delay IS NOT NULL

BigQuery ML 也对原有的SQL语法做了增强,添加了新的关键之,但是总体是遵循SQL原有语法形态的。

完成相同功能,在MLSQL中中的做法如下:

select arr_delay, carrier, origin, dest, dep_delay,
taxi_out, distance from db.table 
as lrCorpus;

train lrCorpus as LogisticRegressor.`/tmp/linear_regression_model`
where inputCol="features"
and labelCol="label"
;

同样的,MLSQL也对SQL进行扩展和变更,就模型训练而言,改变会更大些。对应的,训练完成后,你可以load 数据查看效果,结果类似这样:

+--------------------+--------+--------------------+-------------------+-------+-------------+-------------+--------------------+
|           modelPath|algIndex|                 alg|              score| status|    startTime|      endTime|         trainParams|
+--------------------+--------+--------------------+-------------------+-------+-------------+-------------+--------------------+
|/tmp/william/tmp/...|       1|org.apache.spark....|-1.9704115113779945|success|1532659750073|1532659757320|Map(ratingCol -> ...|
|/tmp/william/tmp/...|       0|org.apache.spark....|-1.8446490919033698|success|1532659757327|1532659760394|Map(ratingCol -> ...|
+--------------------+--------+--------------------+-------------------+-------+-------------+-------------+--------------------+

在预测方面,BigQuery ML语法如下:

SELECT * FROM ML.PREDICT(MODEL flights.arrdelay,
(
SELECT
 carrier,
 origin,
 dest,
 dep_delay,
 taxi_out,
 distance,
 arr_delay AS actual_arr_delay
FROM
 `cloud-training-demos.flights.tzcorr`
WHERE
 arr_delay IS NOT NULL
LIMIT 10))

ML指定模型名称就可以调用对应的预测函数。在MLSQL里,则需要分两步:

先注册模型,这样就能得到一个函数(pa_lr_predict),名字你自己定义。

register LogisticRegressor.`/tmp/linear_regression_model` as pa_lr_predict options
modelVersion="1" ;

接着就可以使用了:

select pa_lr_predict(features) from lrCorpus limit 10 as predict_result;

和数据平台集成

BigQuery ML 也支持利用SQL对数据做复杂处理,因此可以很好的给模型准备数据。MLSQL也支持非常复杂的数据处理。

除了算法以外

“数据处理模型”以及SQL函数

值得一提的是,MLSQL提供了非常多的“数据处理模型”以及SQL函数。比如我要把文本数据转化为tfidf,一条指令即可:

-- 把文本字段转化为tf/idf向量,可以自定义词典
train orginal_text_corpus as TfIdfInPlace.`/tmp/tfidfinplace`
where inputCol="content"
-- 分词相关配置
and ignoreNature="true"
and dicPaths="...."
-- 停用词路径
and stopWordPath="/tmp/tfidf/stopwords"
-- 高权重词路径
and priorityDicPath="/tmp/tfidf/prioritywords"
-- 高权重词加权倍数
and priority="5.0"
-- ngram 配置
and nGram="2,3"
-- split 配置,以split为分隔符分词,
and split=""
;

-- lwys_corpus_with_featurize 表里content字段目前已经是向量了
load parquet.`/tmp/tfidf/data` 
as lwys_corpus_with_featurize;

支持自定义实现算法

除了MLSQL里已经实现的算法,你也可以用python脚本来完成自定义算法。目前通过PythonAlg模块支持SKlearn, Tensorflow, Xgboost, Fasttext等众多python算法框架。Tensorflow则支持Cluster模式。具体参看这里 MLSQL自定义算法

部署

BigQuery ML 和MLSQL都支持直接在SQL里使用其预测功能。MLSQL还支持将模型部署成API服务。具体做法超级简单:

  1. 单机模型运行StreamingPro.
  2. 通过接口或者配置注册算法模型 register NaiveBayes. /tmp/bayes_model as bayes_predict;
  3. 访问预测接口
http://127.0.0.1:9003/model/predict? pipeline= bayes_predict&data=[[1,2,3...]]&dataType=vector

MLSQL 可以实现end2end模式部署,复用所有数据处理流程。更多参看 MLSQL部署

模型多版本管理

训练时将keepVersion="true",每次运行都会保留上一次版本。具体参看 模型版本管理

多个算法/多组参数并行运行

如果算法自身已经是分布式计算的,那么MLSQL允许多组参数顺序执行。比如这个:

train data as ALSInPlace.`/tmp/als` where
-- 第一组参数
`fitParam.0.maxIter`="5"
and `fitParam.0.regParam` = "0.01"
and `fitParam.0.userCol` = "userId"
and `fitParam.0.itemCol` = "movieId"
and `fitParam.0.ratingCol` = "rating"
-- 第二组参数    
and `fitParam.1.maxIter`="1"
and `fitParam.1.regParam` = "0.1"
and `fitParam.1.userCol` = "userId"
and `fitParam.1.itemCol` = "movieId"
and `fitParam.1.ratingCol` = "rating"
-- 计算rmse     
and evaluateTable="test"
and ratingCol="rating"
-- 针对用户做推荐,推荐数量为10  
and `userRec` = "10"
-- 针对内容推荐用户,推荐数量为10
-- and `itemRec` = "10"
and coldStartStrategy="drop"

这是一个协同推荐的一个算法,使用者配置了两组参数,因为该算法本身是分布式的,所以两组参数会串行运行。

-- train sklearn model
train data as PythonAlg.`${modelPath}` 

-- specify the location of the training script 
where pythonScriptPath="${sklearnTrainPath}"

-- kafka params for log
and `kafkaParam.bootstrap.servers`="${kafkaDomain}"
and `kafkaParam.topic`="test"
and `kafkaParam.group_id`="g_test-2"
and `kafkaParam.userName`="pi-algo"
-- distribute training data, so the python training script can read 
and  enableDataLocal="true"
and  dataLocalFormat="json"

-- sklearn params
-- use SVC
and `fitParam.0.moduleName`="sklearn.svm"
and `fitParam.0.className`="SVC"
and `fitParam.0.featureCol`="features"
and `fitParam.0.labelCol`="label"
and `fitParam.0.class_weight`="balanced"
and `fitParam.0.verbose`="true"

and `fitParam.1.moduleName`="sklearn.naive_bayes"
and `fitParam.1.className`="GaussianNB"
and `fitParam.1.featureCol`="features"
and `fitParam.1.labelCol`="label"
and `fitParam.1.class_weight`="balanced"
and `fitParam.1.labelSize`="26"

-- python env
and `systemParam.pythonPath`="python"
and `systemParam.pythonParam`="-u"
and `systemParam.pythonVer`="2.7";

上面这个则是并行运行两个算法SVC/GaussianNB。因为每个算法自身无法分布式运行,所以MLSQL允许你并行运行这两个算法。

总结

BigQuery ML只是Google BigQuery服务的一部分。所以其实和其对比还有失偏颇。MLSQL把数据平台和算法平台合二为一,在上面你可以做ETL,流式,也可以做算法,大家都统一用一套SQL语法。MLSQL还提供了大量使用的“数据处理模型”和SQL函数,这些无论对于训练还是预测都有非常大的帮助,可以使得数据预处理逻辑在训练和预测时得到复用,基本无需额外开发,实现端到端的部署,减少企业成本。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK