Ronit Nagwekar

November 9, 2022 7 minute read

Using HANA Cloud, data lake Files with a Jupyter Notebook and Pyspark

To best follow this post and try things out yourself, you should:

Have some basic knowledge of the Python programming language (PySpark)
Have a Data Lake instance provisioned and configured. configure the HANA Data Lake File Container.
Have the Instance ID for your Data Lake Instance.
Have access to a Jupyter notebook (io).

Overview:

Data Lake Files includes a driver which enables access to the file system directly from Spark. It implements the Hadoop FileSystem interface to allow platforms and applications in the Hadoop ecosystem to work with data lake Files for data storage. In this blog, we will get to see how we can easily configure and establish a connection with HDLFS and see how to write, read and delete a file from within the Files store.

Step 1: Download the Data Lake Files Spark Driver from:

The data lake Client install can be installed using the steps outlined in SAP HANA Cloud, Data Lake Client Interfaces. Once the data lake client is installed, the hdlfs spark driver is in the HDLFS folder.

Also, one can download the driver directly from SAP HANA Data Lake Files Client Library on Maven.org.

Step 2: Set up the Connection From Jupyter to HANA Cloud, data lake Files

As part of configuring access to Data Lake Files, you will create a client certificate and key. To communicate with data lake Files from your Jupyter notebook, the client.crt and client.key must be provided in a keystore package, and this package needs to be uploaded onto your Jupyter notebook instance.

Here is an example of how you can generate a Create a. pkcs12 package from your client certificate and key using Openssl:

openssl pkcs12 \

-export \

-inkey </path/to/client-key-file> \

-in </path/to/client-certificate-file> \

-out </path/to/client-keystore.p12> \

-password pass:<password-p12-file> \

This is how it will look in the Command prompt:

Once this is done, the. pkcs12 file will be created in the given path. It will look something like below. Keep a note of the keystore password, as you will need it later.

Now, we upload the .pkcs12 file and the Spark Driver from HDLFS directory to the Jupyter notebook instance.

Click on the upload arrow, and then upload the 2 files. This will get uploaded to the workbook home.

Step 3: Understand the Code to configure and setup a connection with the HANA Data Lake Files Store

The entire code will be at the bottom of the post so people can copy that.

The below code block shows how to configure and setup a connection with the HANA Data Lake Files Store. You can paste it into code blocks in your notebook to execute it.

In the following code block, it is explained how to setup the SSL configuration, the Operations config, Driver’s configuration and format of the URI.

To reference a particular parameter property, we call the sc.jsc.hadoopConfiguartion().set() to set Sparks Global Hadoop Configuration. “_jsc” is the Java Spark Context which is a proxy into the SparkContext in that JVM.

# —– ssl configuration — –

#—- it will define the location of the client keystore, the password of the client keystore and the type of the truststore file.

sc._jsc.hadoopConfiguration().set(“fs.hdlfs.ssl.keystore.location”, keystoreLocation)

sc._jsc.hadoopConfiguration().set(“fs.hdlfs.ssl.keystore.password”, keystorePwd)

sc._jsc.hadoopConfiguration().set(“fs.hdlfs.ssl.keystore.type”, “PKCS12”)

# —– operations configuration —- it is going to configure the operations parameters where the CREATE Mode is set to be DEFAULT which will read, write and delete files

sc._jsc.hadoopConfiguration().set(“fs.hdlfs.operation.create.mode”, “DEFAULT”)

# —– driver configuration —- An implementation of org.apache.hadoop.fs.FileSystem targeting SAP HANA Data Lake Files. To allow Spark to load the driver, specify the configuration parameters to make the system aware of the new hdlfs:// scheme for referring to files in data lake Files.

sc._jsc.hadoopConfiguration().set(“fs.AbstractFileSystem.hdlfs.impl”, “com.sap.hana.datalake.files.Hdlfs”)

sc._jsc.hadoopConfiguration().set(“fs.hdlfs.impl”, “com.sap.hana.datalake.files.HdlfsFileSystem”)

sc._jsc.hadoopConfiguration().set(“mapreduce.fileoutputcommitter.algorithm.version”,”2″)

#— uri is in format hdlfs://<filecontainer>.<endpointSuffix>/path/to/file

#—– Once the driver is known to Spark, files can be referred to by their URI as hdlfs://<files-rest-api-endpoint>/path/to/file.

sc._jsc.hadoopConfiguration().set(“fs.defaultFS”, “hdlfs://” + hdlfsEndpoint)

sc._jsc.hadoopConfiguration().set(“fs.hdlfs.filecontainer”, filecontainer)

Documentation link for Data Lake Files Driver Configurations for Apache Spark: Data Lake Files Driver Configurations for Apache Spark

The below code block will show us how it uses the hadoop configuration that we setup before to connect and read files (if any) from HDLFS.

hadoop = sc._jvm.org.apache.hadoop

fs = hadoop.fs.FileSystem

conf = sc._jsc.hadoopConfiguration()

path = hadoop.fs.Path(‘/’)

[str(f.getPath()) for f in fs.get(conf).listStatus(path)]

The additional code block will show how one can Write, Read, and Delete any given file from the Directory.

Step 3: How to Read, Write and Delete a file to the Data Lake File Container

Let’s look at the code block which will show us how to Read a file with Pyspark which is present inside the Directory path that we mentioned.

Read a File:

We can read all CSV files from a directory into DataFrame just by-passing directory as a path to the csv () method. delimiter option is used to specify the column delimiter of the CSV file. By default, it is comma (,) character, but can be set to any character like pipe (|), tab (\t), space using this option.

df = spark.read.options(delimiter=’|’).csv(“/Ordersdata.csv”)

Write/Create a file:

The below code block which will show us how to “Write” a file with Pyspark inside the Directory path that we mentioned.

df.write.csv(“TPCH_SF100/ORDERS/File.csv”)

To view the file if it was created in the File Container, one can switch over to DBX and see if the file was created or not. Refer below screenshot.

Delete a File:

In order to delete a file/directories from HDFS we follow similar steps as read and write operation.
For deleting a file we use – fs.delete(path, true), true will indicate that the file in the path is deleted successfully and false indicates files are not deleted recursively,

#To delete a file from the File Container

path = hadoop.fs.Path(‘/File.csv’)

fs.get(conf).delete(path, True)

Before using the delete function, the Ordersdata.csv is present in the File Container.

After the using the Delete function, the Ordersdata.csv gets deleted from the file container.

Appendix:

The entire code:

import os
#include hdlfs spark driver in pyspark shell
os.environ['PYSPARK_SUBMIT_ARGS'] =  '--jars /home/jovyan/work/sap-hdlfs-1.1.9.jar pyspark-shell'

import pyspark 
from pyspark.sql.session import SparkSession
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)


keystoreLocation = "/home/jovyan/work/mycert.p12";  # ----- the location of the keystore p.12 file in the home directory
keystorePwd = "Password1"; 		         # ----- the password that you entered while creating the keystore file
hdlfsEndpoint = "<your HANA Data Lake Files endpoint>"; # ----- the Rest API Endpoint of the Data Lake instance
filecontainer = "<your HANA Data Lake Files instance ID>"; # ----- This is the Instance ID


# ----- ssl configuration ---
sc._jsc.hadoopConfiguration().set("fs.hdlfs.ssl.keystore.location", keystoreLocation)
sc._jsc.hadoopConfiguration().set("fs.hdlfs.ssl.keystore.password", keystorePwd)
sc._jsc.hadoopConfiguration().set("fs.hdlfs.ssl.keystore.type", "PKCS12")

# ----- operations configuration ----
sc._jsc.hadoopConfiguration().set("fs.hdlfs.operation.create.mode", "DEFAULT")

# ----- driver configuration ----
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.hdlfs.impl", "com.sap.hana.datalake.files.Hdlfs")
sc._jsc.hadoopConfiguration().set("fs.hdlfs.impl", "com.sap.hana.datalake.files.HdlfsFileSystem")
sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")

# uri is in format hdlfs://<filecontainer>.<endpointSuffix>/path/to/file
sc._jsc.hadoopConfiguration().set("fs.defaultFS", "hdlfs://" + hdlfsEndpoint)
sc._jsc.hadoopConfiguration().set("fs.hdlfs.filecontainer", filecontainer)


# -- Read the files from the File Container
hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()
path = hadoop.fs.Path('/')
[str(f.getPath()) for f in fs.get(conf).listStatus(path)]

# -- Read a File
df = spark.read.options(delimiter='|').csv("/Ordersdata.csv")

# -- Write a File
df.write.csv("TPCH_SF100/ORDERS/File.csv")

Conclusion:

That’s how one can easily use a Jupyter notebook and Pyspark to easily configure and establish a connection with HDLFS and see how to write, read and delete a file from within the Files store.

Thanks for reading!

Would love to read any suggestions or feedbacks on the blog post. Please do give a like if you found the information useful also feel free to follow me to get information on similar content.

Request everyone reading the blog to also go through the following links for any further assistance.

SAP HANA Cloud, data lake — post and answer questions here,

and read other posts on the topic you wish to discover here

Using HANA Cloud, data lake Files with a Jupyter Notebook and Pyspark

Using HANA Cloud, data lake Files with a Jupyter Notebook and Pyspark

Recommend

成人用具LOVENSE的专利维权，万圣节爆款《德州电锯杀人狂》被律所代理，案件号：22-cv...

java安全之CC1浅学(1) - gk0d

Coursera全球首席执行长表示，同时具备线下和线上资历的人更受欢迎

Google adds .rsvp domains to Google Registry in time for the holiday season

Attacks on politicians whipped up by abuse, MP Charlotte Nichols says

Inside the ‘Election Integrity App’ Built to Purge US Voter Rolls | WIRED

阿里巴巴的主战场，张勇在互联网大会上明确了

What are the Common Web Security Vulnerabilities?

如果维基百科需要付费才能使用，你愿意为它支付多少钱/年的订阅费？

香港的Web3从业者们，出走新加坡还是选择留下？

About Joyk