28

Challenges Faced While Integrating Pyspark With HBase and the Solution

 5 years ago
source link: https://www.tuicool.com/articles/hit/jiyAnuj
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

This article will explain the challenges and troubleshooting steps involved while writing spark DataFrame into HBase Table using Pyspark.

Refer below to the pyspark example code:

df = sc.parallelize([('a', 'def'), ('b', 'abc')]).toDF(schema=['col0', 'col1'])
catalog = ''.join("""{
     "table":{"namespace":"default", "name":"smTable"},
     "rowkey":"c1",
     "columns":{
    "col0":{"cf":"rowkey", "col":"c1", "type":"string"},
    "col1":{"cf":"t1", "col":"c2", "type":"string"}
   }
      }""".split())
df.write.options(catalog=catalog)\
.format('org.apache.spark.sql.execution.datasources.hbase').save()

First Approach 

By using Default services and Lib provided by CDH(5.16), below are the errors we have encountered while trying to save spark data frame into Hbase table: File "/opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/lib/spark/python/lib/py4j-0.10.7

-src.zip/py4j/protocol.py", line 328, in get_return_value  

py4j.protocol.Py4JJavaError: An error occurred while calling o62.save.  

: java.lang.RuntimeException: org.apache.hadoop.hbase.spark.DefaultSource does

not allow create table as select.  

Conclusion

As per the Cloudera support team, if you are using Cloudera distribution Apache Spark 1.6, then there is no official way to write to HBASE using PYSAPRK.

https://stackoverflow.com/questions/46924171/error-while-writing-to-hbase-table-using-pyspark?noredirect=1

At this point, we have decided to use Apache Spark 2.4 Version — the latest one — and Hortonworks Connector for connecting Spark to HBase since there is no connector provided by CDH.

Second Approach 

I launched the pyspark shell using the below code and tried to save the DF created using the above pyspark code...

pyspark --master local --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11

--repositories http://repo.hortonworks.com/content/groups/public/

--files /etc/hbase/conf/hbase-site.xml  

...and got the below error while saving the spark DF to HBase table:

File "/home/demoadmin/spark-2.4.0-bin-hadoop2.6/python/lib/py4j-0.10.7

-src.zip/py4j/protocol.py", line 328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o64.save.

: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods  

Conclusion

NoSuchMethodErrordue to library conflicts, on comparing json4s-jackson jar, we found that in json4s 3.4.1, the interface parse() added a new parameter.

Third Approach

Based on prior experience, we have used spark 2.1.0 due to compatibility issues and SHC connector 1.1.1-2.1-s_2.11 for writing the spark DataFrame to HBase...

pyspark --master local --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11

--repositories http://repo.hortonworks.com/content/groups/public/

--files /etc/hbase/conf/hbase-site.xml

df = sc.parallelize([('a', 'def'), ('b', 'abc')]).toDF(schema=['col0', 'col1'])
catalog = ''.join("""{
     "table":{"namespace":"default", "name":"smTable"},
     "rowkey":"c1",
     "columns":{
    "col0":{"cf":"rowkey", "col":"c1", "type":"string"},
    "col1":{"cf":"t1", "col":"c2", "type":"string"}
   }
      }""".split())
df.write.options(catalog=catalog,newtable=5)\
.format('org.apache.spark.sql.execution.datasources.hbase').save()

...and it worked.

So, if you are trying to save spark Dataframe in HBase using pyspark, then my suggestion is to use SHC Hortonworks connector with the below version to avoid any Library conflicts.

  • Apache spark 2.1.0

  • Hortonworks SHC connector 1.1.1-2.1-s_2.11

Happy coding!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK