Challenges Faced While Integrating Pyspark With HBase and the Solution

This article will explain the challenges and troubleshooting steps involved while writing spark DataFrame into HBase Table using Pyspark.

Refer below to the pyspark example code:

df = sc.parallelize([('a', 'def'), ('b', 'abc')]).toDF(schema=['col0', 'col1'])
catalog = ''.join("""{
     "table":{"namespace":"default", "name":"smTable"},
     "rowkey":"c1",
     "columns":{
    "col0":{"cf":"rowkey", "col":"c1", "type":"string"},
    "col1":{"cf":"t1", "col":"c2", "type":"string"}
   }
      }""".split())
df.write.options(catalog=catalog)\
.format('org.apache.spark.sql.execution.datasources.hbase').save()

First Approach

By using Default services and Lib provided by CDH(5.16), below are the errors we have encountered while trying to save spark data frame into Hbase table: File "/opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/lib/spark/python/lib/py4j-0.10.7

-src.zip/py4j/protocol.py", line 328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o62.save.

: java.lang.RuntimeException: org.apache.hadoop.hbase.spark.DefaultSource does

not allow create table as select.

Conclusion

As per the Cloudera support team, if you are using Cloudera distribution Apache Spark 1.6, then there is no official way to write to HBASE using PYSAPRK.

https://stackoverflow.com/questions/46924171/error-while-writing-to-hbase-table-using-pyspark?noredirect=1

At this point, we have decided to use Apache Spark 2.4 Version — the latest one — and Hortonworks Connector for connecting Spark to HBase since there is no connector provided by CDH.

Second Approach

I launched the pyspark shell using the below code and tried to save the DF created using the above pyspark code...

pyspark --master local --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11

--repositories http://repo.hortonworks.com/content/groups/public/

--files /etc/hbase/conf/hbase-site.xml

...and got the below error while saving the spark DF to HBase table:

File "/home/demoadmin/spark-2.4.0-bin-hadoop2.6/python/lib/py4j-0.10.7

-src.zip/py4j/protocol.py", line 328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o64.save.

: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

Conclusion

NoSuchMethodErrordue to library conflicts, on comparing json4s-jackson jar, we found that in json4s 3.4.1, the interface parse() added a new parameter.

Third Approach

Based on prior experience, we have used spark 2.1.0 due to compatibility issues and SHC connector 1.1.1-2.1-s_2.11 for writing the spark DataFrame to HBase...

pyspark --master local --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11

--repositories http://repo.hortonworks.com/content/groups/public/

--files /etc/hbase/conf/hbase-site.xml

df = sc.parallelize([('a', 'def'), ('b', 'abc')]).toDF(schema=['col0', 'col1'])
catalog = ''.join("""{
     "table":{"namespace":"default", "name":"smTable"},
     "rowkey":"c1",
     "columns":{
    "col0":{"cf":"rowkey", "col":"c1", "type":"string"},
    "col1":{"cf":"t1", "col":"c2", "type":"string"}
   }
      }""".split())
df.write.options(catalog=catalog,newtable=5)\
.format('org.apache.spark.sql.execution.datasources.hbase').save()

...and it worked.

So, if you are trying to save spark Dataframe in HBase using pyspark, then my suggestion is to use SHC Hortonworks connector with the below version to avoid any Library conflicts.

Apache spark 2.1.0
Hortonworks SHC connector 1.1.1-2.1-s_2.11

Happy coding!

First Approach

Conclusion

Second Approach

Conclusion

Third Approach

Recommend

Webpack Loader 高手进阶(三)

阿里大牛：深入分析spring事务传播行为

CVPixelBufferRef 与 CVOpenGLTextureRef: 图像处理中内存与显存的交互

Migrating Projects from Dep to Go Modules

The Challenge of Fixed-Bid Software Projects

MLSQL数据源开发指南

Gitlab CI 与 Kubernetes 的结合

synchronized的使用（一）

恶意软件盯上了加密货币，两家以色列公司受到攻击

Ionic 4, Angular 7 and Cordova Crop and Upload Image

About Joyk