Hudi实战 | CDH6.3.0如何运行Hudi DeltaStreamer

3 years ago

source link: https://mp.weixin.qq.com/s?__biz=MzU5OTQ1MDEzMA%3D%3D&%3Bmid=2247488398&%3Bidx=1&%3Bsn=e063ff251d670846b812823b5a23c31a
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

1.首先从 https://github.com/apache/hudi.git 将hudi clone到自己本地idea，使用以下命令编译hudi

mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true -Dhadoop.version=3.0.0

注意：目前hudi使用的是 hadoop2.7.3 版本，CDH6.3.0环境使用 hadoop3.0.0 ，所以在打包的时候需要加上 -Dhadoop.version=3.0.0 参数

2.使用MR查询hudi-hive表任务所需配置

将hudi-hadoop-mr-0.6.0.jar上传到/opt/cloudera/parcels/CDH-6.3.0/jars
之后软连接到此目录 /opt/cloudera/parcels/CDH-6.3.0/lib/hive/lib
执行安装MR框架JAR

riUrYvQ.png!mobile

新建hive辅助路径 /data/hive/jars (根据你的需求命名)并且在CHD界面配置

bQvURz6.png!mobile

将以下jar包上传至辅助路径下

hudi-hadoop-mr-bundle-0.6.0.jar（如果数据存储在aliyunOSS则需要以下三个jar一并放置在hive辅助路径下）
aliyun-sdk-oss-3.8.1.jar
hadoop-aliyun-3.2.1.jar
jdom-1.1.jar

3.运行使用hive用户执行赋权命令

GRANT all on uri 'oss://data-lake/xxxxx' to role xxxx;

运行一个delastreamer-hudi任务

spark-submit --name xxxx \ --driver-memory 2G \ --num-executors 4 \ --executor-memory 4G \ --executor-cores 1 \ --deploy-mode cluster \ --conf spark.executor.userClassPathFirst=true \ --jars hdfs://nameservice1/data_lake/jars/hive-jdbc-2.1.1.jar,hdfs://nameservice1/data_lake/jars/hive-service-2.1.1.jar,hdfs://nameservice1/data_lake/jars/jdom-1.1.jar,hdfs://nameservice1/data_lake/jars/hadoop-aliyun-3.2.1.jar,hdfs://nameservice1/data_lake/jars/aliyun-sdk-oss-3.8.1.jar \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer hdfs://nameservice1/data_lake/jars/data_lake_1.jar \ --op INSERT \ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \ --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ --target-table t3_ts_iov_event_push_detail \ --table-type COPY_ON_WRITE \ --source-ordering-field updateTime \ --continuous \ --source-limit 100000 \ --target-base-path oss://data-lake/xxxxxx \ --enable-hive-sync \ --transformer-class org.apache.hudi.utilities.transform.AddStringDateColumnTransform \ --props hdfs://nameservice1/data_lake/xxxxxx/kafka-source.properties

Recommend

微信 mp.weixin.qq.com 4 years ago
Cache

Bloom Filter在Hudi中的应用

Bloom Filter可以用于检索一个元素是否在一个集合中。它的优点是空间效率和查询时间都远远超过一般的算法，主要缺点是存在一定的误判率：当其判断元素存在时，实际上元素可能并不存在。而当判定不存在时，则元素一定不存在，Bloom Filter...

微信 mp.weixin.qq.com 4 years ago
Cache

Upsert在Hudi中的实现分析

介绍 Hudi支持 Upsert 语义，即将数据插入更新至Hudi数据集中，在借助索引机制完成数据查...

163

微信 mp.weixin.qq.com 4 years ago
Cache

Apache Hudi与Delta Lake对比

1. 引入在类Hadoop系统上支持ACID有了更大的吸引力，其中Databricks的Delta Lake和Uber开源的Hudi也成为了主要贡献者和竞争对手。两者都通过在“parquet”文件格式中提供不同的抽象以解决主要问题；很难选择一个...

www.hechunbo.com 3 years ago
Cache

cdh6.3 预安装准备及常见问题

爱码爱生活 cdh6.3 预安装准备及常见问题cdh6.3 安装...

blog.csdn.net 3 years ago
Cache

CDH6安装

CDH6安装 ...

www.lzhpo.com 2 years ago
Cache

Centos7使用CDH6.3.0安装大数据集群

修改网络和主机名cdh6-master[root@cdh6-master ~]# hostnamectl set-hostname cdh6-master[root@cdh6-master ~]# vi /etc/sysconfig/network-scripts/ifcfg-ens33TYPE=EthernetPROXY_METHOD=noneBROWSER_ONLY=noBOOTPROTO=staticDEF...

blog.51cto.com 1 year ago
Cache

Flink SQL Hudi 实战

Flink SQL Hudi 实战原创 hyunbar777 2022-07-28 20:43:09...

blog.51cto.com 1 year ago
Cache

CDH6.3.2开启kerberos认证

CDH6.3.2开启kerberos认证 1、查看hosts文件 cat /etc/hosts 192.168.1.210 cdh-1 192.168.1.211 cdh-2 192.168.1.212 cdh-3 2、安装...

www.51cto.com 1 year ago
Cache

大数据Hadoop之—Apache Hudi 数据湖实战操作

Hudi（Hadoop Upserts Deletes and Incrementals），简称Hudi，是一个流式数据湖平台，支持对海量数据快速更新，内置表格式，支持事务的存储层、一系列表服务、数据服务(开箱即用的摄取工具)以及完善的运维监控工具，它可以以极低的延迟将数据快速存储到HDFS或云存...

blog.51cto.com 1 year ago
Cache

CDH6.2.1的hive 2.1.1升级到2.3.9后的beeline报错处理

CDH6.2.1的hive 2.1.1升级到2.3.9后的beeline报错处理精选原创江南独孤客 2022-11-25...

Hudi实战 | CDH6.3.0如何运行Hudi DeltaStreamer

Recommend

About Joyk