3

Using HANA Cloud, data lake Files with a Jupyter Notebook and Pyspark

 1 year ago
source link: https://blogs.sap.com/2022/11/09/using-hana-cloud-data-lake-files-with-a-jupyter-notebook-and-pyspark/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
November 9, 2022 7 minute read

Using HANA Cloud, data lake Files with a Jupyter Notebook and Pyspark

  To best follow this post and try things out yourself, you should:

  1. Have some basic knowledge of the Python programming language (PySpark)
  2. Have a Data Lake instance provisioned and configured. configure the HANA Data Lake File Container.
  3. Have the Instance ID for your Data Lake Instance.
  4. Have access to a Jupyter notebook (io).

Overview:

Data Lake Files includes a driver which enables access to the file system directly from Spark.   It implements the Hadoop FileSystem interface to allow platforms and applications in the Hadoop ecosystem to work with data lake Files for data storage. In this blog, we will get to see how we can easily configure and establish a connection with HDLFS and see how to write, read and delete a file from within the Files store.

Step 1:  Download the Data Lake Files Spark Driver from:

  • The data lake Client install can be installed using the steps outlined in SAP HANA Cloud, Data Lake Client Interfaces. Once the data lake client is installed, the hdlfs spark driver is in the HDLFS folder.
image1-5.png

Step 2: Set up the Connection From Jupyter to HANA Cloud, data lake Files

As part of configuring access to Data Lake Files, you will create a client certificate and key. To communicate with data lake Files from your Jupyter notebook, the client.crt and client.key must be provided in a keystore package, and this package needs to be uploaded onto your Jupyter notebook instance.

Here is an example of how you can generate a Create a. pkcs12 package from your client certificate and key using Openssl:

openssl pkcs12 \

-export \

-inkey </path/to/client-key-file> \

-in </path/to/client-certificate-file> \

-out </path/to/client-keystore.p12> \

-password pass:<password-p12-file> \

This is how it will look in the Command prompt:

image2-4.png

Once this is done, the. pkcs12 file will be created in the given path. It will look something like below.  Keep a note of the keystore password, as you will need it later.

image3-4.png

Now, we upload the .pkcs12 file and the Spark Driver from HDLFS directory to the Jupyter notebook instance.

Click on the upload arrow, and then upload the 2 files. This will get uploaded to the workbook home.

image4-4.png

Step 3: Understand the Code to configure and setup a connection with the HANA Data Lake Files Store

The entire code will be at the bottom of the post so people can copy that.

The below code block shows how to configure and setup a connection with the HANA Data Lake Files Store. You can paste it into code blocks in your notebook to execute it.

image5-6.png

In the following code block, it is explained how to setup the SSL configuration, the Operations config, Driver’s configuration and format of the URI.

 To reference a particular parameter property, we call the sc.jsc.hadoopConfiguartion().set() to set Sparks Global Hadoop Configuration. “_jsc” is the Java Spark Context which is a proxy into the SparkContext in that JVM.

# —– ssl configuration — –

#—- it will define the location of the client keystore, the password of the client keystore and the type of the truststore file.

sc._jsc.hadoopConfiguration().set(“fs.hdlfs.ssl.keystore.location”, keystoreLocation)

sc._jsc.hadoopConfiguration().set(“fs.hdlfs.ssl.keystore.password”, keystorePwd)

sc._jsc.hadoopConfiguration().set(“fs.hdlfs.ssl.keystore.type”, “PKCS12”)

# —– operations configuration —- it is going to configure the operations parameters where the CREATE Mode is set to be DEFAULT which will read, write and delete files

sc._jsc.hadoopConfiguration().set(“fs.hdlfs.operation.create.mode”, “DEFAULT”)

# —– driver configuration —- An implementation of org.apache.hadoop.fs.FileSystem targeting SAP HANA Data Lake Files. To allow Spark to load the driver, specify the configuration parameters to make the system aware of the new hdlfs:// scheme for referring to files in data lake Files.

sc._jsc.hadoopConfiguration().set(“fs.AbstractFileSystem.hdlfs.impl”, “com.sap.hana.datalake.files.Hdlfs”)

sc._jsc.hadoopConfiguration().set(“fs.hdlfs.impl”, “com.sap.hana.datalake.files.HdlfsFileSystem”)

sc._jsc.hadoopConfiguration().set(“mapreduce.fileoutputcommitter.algorithm.version”,”2″)

#— uri is in format hdlfs://<filecontainer>.<endpointSuffix>/path/to/file

#—– Once the driver is known to Spark, files can be referred to by their URI as hdlfs://<files-rest-api-endpoint>/path/to/file.

sc._jsc.hadoopConfiguration().set(“fs.defaultFS”, “hdlfs://” + hdlfsEndpoint)

sc._jsc.hadoopConfiguration().set(“fs.hdlfs.filecontainer”, filecontainer)

Documentation link for Data Lake Files Driver Configurations for Apache Spark: Data Lake Files Driver Configurations for Apache Spark

The below code block will show us how it uses the hadoop configuration that we setup before to connect and read files (if any) from HDLFS.

hadoop = sc._jvm.org.apache.hadoop

fs = hadoop.fs.FileSystem

conf = sc._jsc.hadoopConfiguration()

path = hadoop.fs.Path(‘/’)

[str(f.getPath()) for f in fs.get(conf).listStatus(path)]

image6-3.png

The additional code block will show how one can Write, Read, and Delete any given file from the Directory.

Step 3: How to Read, Write and Delete a file to the Data Lake File Container

Let’s look at the code block which will show us how to Read a file with Pyspark which is present inside the Directory path that we mentioned.

Read a File:

We can read all CSV files from a directory into DataFrame just by-passing directory as a path to the csv () method. delimiter option is used to specify the column delimiter of the CSV file. By default, it is comma (,) character, but can be set to any character like pipe (|)tab (\t)space using this option.

df = spark.read.options(delimiter=’|’).csv(“/Ordersdata.csv”)

image7-3.png

Write/Create a file:

The below code block which will show us how to “Write” a file with Pyspark inside the Directory path that we mentioned.

df.write.csv(“TPCH_SF100/ORDERS/File.csv”)

image8-4.png

To view the file if it was created in the File Container, one can switch over to DBX and see if the file was created or not. Refer below screenshot.

image9-1.pngimage10-2.png

Delete a File:

In order to delete a file/directories from HDFS we follow similar steps as read and write operation.
For deleting a file we use – fs.delete(path, true), true will indicate that the file in the path is deleted successfully  and false indicates files are not deleted recursively,

#To delete a file from the File Container

path = hadoop.fs.Path(‘/File.csv’)

fs.get(conf).delete(path, True)

Before using the delete function, the Ordersdata.csv is present in the File Container.

image11-1.png

After the using the Delete function, the Ordersdata.csv gets deleted from the file container.

image12-1.png

Appendix:

The entire code:

import os
#include hdlfs spark driver in pyspark shell
os.environ['PYSPARK_SUBMIT_ARGS'] =  '--jars /home/jovyan/work/sap-hdlfs-1.1.9.jar pyspark-shell'

import pyspark 
from pyspark.sql.session import SparkSession
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)


keystoreLocation = "/home/jovyan/work/mycert.p12";  # ----- the location of the keystore p.12 file in the home directory
keystorePwd = "Password1"; 		         # ----- the password that you entered while creating the keystore file
hdlfsEndpoint = "<your HANA Data Lake Files endpoint>"; # ----- the Rest API Endpoint of the Data Lake instance
filecontainer = "<your HANA Data Lake Files instance ID>"; # ----- This is the Instance ID


# ----- ssl configuration ---
sc._jsc.hadoopConfiguration().set("fs.hdlfs.ssl.keystore.location", keystoreLocation)
sc._jsc.hadoopConfiguration().set("fs.hdlfs.ssl.keystore.password", keystorePwd)
sc._jsc.hadoopConfiguration().set("fs.hdlfs.ssl.keystore.type", "PKCS12")

# ----- operations configuration ----
sc._jsc.hadoopConfiguration().set("fs.hdlfs.operation.create.mode", "DEFAULT")

# ----- driver configuration ----
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.hdlfs.impl", "com.sap.hana.datalake.files.Hdlfs")
sc._jsc.hadoopConfiguration().set("fs.hdlfs.impl", "com.sap.hana.datalake.files.HdlfsFileSystem")
sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")

# uri is in format hdlfs://<filecontainer>.<endpointSuffix>/path/to/file
sc._jsc.hadoopConfiguration().set("fs.defaultFS", "hdlfs://" + hdlfsEndpoint)
sc._jsc.hadoopConfiguration().set("fs.hdlfs.filecontainer", filecontainer)


# -- Read the files from the File Container
hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()
path = hadoop.fs.Path('/')
[str(f.getPath()) for f in fs.get(conf).listStatus(path)]

# -- Read a File
df = spark.read.options(delimiter='|').csv("/Ordersdata.csv")

# -- Write a File
df.write.csv("TPCH_SF100/ORDERS/File.csv")

Conclusion:

That’s how one can easily use a Jupyter notebook and Pyspark to easily configure and establish a connection with HDLFS and see how to write, read and delete a file from within the Files store.

Thanks for reading!

Would love to read any suggestions or feedbacks on the blog post. Please do give a like if you found the information useful also feel free to follow me to get information on similar content.

Request everyone reading the blog to also go through the following links for any further assistance. 

SAP HANA Cloud, data lake — post and answer questions here,

and read other posts on the topic you wish to discover here


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK