23

Python and Oracle Cloud: loading data

 5 years ago
source link: https://www.tuicool.com/articles/hit/jma2YbU
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Using OCI Python SDK and Autonomous DWH API you can easily load data for your Data Science and ML work

MJF7z2z.jpg!webRV7R7vJ.jpg!web

Introduction

In a previous article , I have explored how we can use Python and popular Open Source frameworks, like Pandas and scikit-learn, to analyze data stored with Oracle Autonomous Data Warehouse (ADWC) . In this shorter story, I want to show you how you can easily load even big data files using OCI Python SDK .

Automation “on steroids”

One of the big features of the Cloud is Automation. In the Cloud World, it is easier to dynamically provision resources (like VM, or even Databases) and manage them using an API.

Oracle Cloud Infrastructure (OCI)has many tools, within which you can choose your preferred, to automate a large set of tasks (I would say: everything I need):

  • REST API
  • OCI CLI
  • SDK

OCI Command Line Interface (CLI) is really good. But, if you want to develop your highly customized tools, perfectly tailored on your shoulders, I would recommend using SDK .

As you can see in the documentation , we have:

  • Python SDK
  • Java SDK
  • Ruby SDK
  • Go SDK

Since we’re talking about Data Science and ML Engineering, I would without any doubt choose Python .

With OCI Python SDK you have a complete and easy-to-use set of classes, well-documented, enabling you to build Python scripts to automate a large part of the pipeline for your daily work with data. And, moreover, you can even use it inside a Jupyter Notebook (or Jupyter Lab).

So, some days ago, I started with the evening task to build, for one of my customers, some examples to show how to load data inside Oracle ADWC, starting from flat files. Then, as a nice side-effect, I decided to explore the subject a little more and created a script to start/stop VM and even start/stop my ADWC instance.

You can find the set of examples inside my G ithub repo

In this story, I want to explain, with more details, how to load data, starting from flat files.

Installation? Easy!

Well, obviously there is the documentation , but, if you’re using PyPi, installation is only one line of code:

pip install oci

It works also with Anaconda.

If you want to be one of the first to try the new features added, simply check out the OCI Python Github repository .

Configuration

Then, you need to do some configuration tasks, needed to ensure secure communication between your working environment and your OCI Cloud. I’ll spend some time to clarify here since I don’t want you to lose precious time on this part.

First, you need to create an API signing key-pair , in PEM format. You must keep the private safe on your environment (the computer running the Python program) and store the public key in your Cloud profile. When you store it, you can read from Cloud UI the fingerprint.

Some more details can be found here:

Then, with every type of clients (ObjectStorageClient, ComputeClient, ….) you need to supply a set of (key, value) pair, packed as a Dictionary, in order to specify:

  • your OCI userid
  • a key (API signing key) used to secure communication
  • the fingerprint of this key
  • your Cloud tenant identifier
  • OCI Region

This is an example of what you need to prepare (anonymized):

config = {
“user”: “ocid1.user.oc1..XXXXX”,
“key_file”: “/Users/lsaetta/Progetti/xxxx/oci_api_key.pem”,
“fingerprint”: “YYYYY”,
“tenancy”: “ocid1.tenancy.oc1..ZZZZZZ”,
“region”: “eu-frankfurt-1”
}

For user and your Cloud tenant, you need to use not the name, but the identifier OCID, that you can copy from the OCI Cloud Console.

Loading files in the Cloud

At this point, you’re ready to load a set of files as Objects in the ObjectStore.

This is the code to load a single file, I’ll comment sooner:

As you can see, you need:

  • to create a client, in this case an ObjectStorageClient, and pass to the constructor the configuration Dictionary
  • call the method put_object with bucket_name, file_name, file_path as parameters

That’s all.

In the Github repository, you’ll find an enhanced version, handling a set of files.

Store credentials inside the DB

Now, it is time to copy data from ObjectStore to ADWC tables.

First of all: we need to create a set of credentials and store them inside the DB, in order to enable PL/SQL code to access the ObjectStorage Bucket.

Go to the OCI UI and create, inside your profile (top right icon and UserSetting), an AuthToken:

bEBbequ.jpg!webFrUJfiv.jpg!web
Authentication Token

Copy from the UI the Authentication Token. Remember to do that right after created, otherwise, you will not able to re-read it.

Second, store the Cloud Credentials inside the DB using this code; This step needs to be done only once:

import cx_Oracle
connection = cx_Oracle.connect(USER, PWD, 'adwc5_medium')
# need to create a cursor
cur = connection.cursor()
statement = """
BEGIN
DBMS_CLOUD.CREATE_CREDENTIAL(
credential_name => 'DEF_CRED_NAME',
username => '[email protected]',
password => '<AUTH Token>'
);
END;
"""
cur.execute(statement)

Take care: if your user is a federated one, the name is the fully qualified name. It has a prefix like:

oracleidentitycloudservice/

and you must specify, in the CREATE_CREDENTIAL call, the fully qualified name.

Loading data in Oracle ADWC

Since we have now the flat files in the ObjectStore, we can employ the PL/SQL package DBMS_CLOUD to load the data inside ADWC tables.

Ok, it is PL/SQL code, but not very complicated, and you can easily cut&paste. We’ll wrap this code inside Python code.

We’re imaging the simpler scenario: one Oracle table for flat file and the format of the file is the same as the table (one-to-one correspondence between file and table fields)

We have to create the table inside the DB. For example:

CREATE TABLE CHANNELS
   (channel_id char(1),
    channel_desc varchar2(20),
    channel_class varchar2(20)
   );
/

Then we can load the data contained in one file in this table, with this code:

import cx_Oracle
connection = cx_Oracle.connect(USER, PWD, 'adwc5_medium')
# need to create a cursor
cur = connection.cursor()
statement = """
BEGIN
DBMS_CLOUD.COPY_DATA(
table_name =>'CHANNELS1',
credential_name =>'DEF_CRED_NAME',
file_uri_list =>'https://swiftobjectstorage.eu-frankfurt-1.oraclecloud.com/v1/tenant-name/BUCKET_NAME/channels1.txt',
format => json_object('delimiter' value ',')
);
END;
"""
cur.execute(statement)

One detail: to reference the Object you need to use the SWIFT-type URL. What you need to change in the above example is:

  • the Region Name (here it is eu-frankurt-1)
  • the name of your tenant
  • the name of the bucket in Object Store
  • the name of the Object (here: channels1.txt)

As we can see, the DBMS_CLOUD package has all the required API to:

  • store credentials inside the DB
  • COPY from ObjectStore to DB

It provides also some other nice features. For example, you can use Apache Parquet format for the files, instead of CSV.

Want to check?

If you want to check that data has really been loaded, you can use this snippet of code:

import pandas as pd
import cx_Oracle
connection = cx_Oracle.connect(USER, PWD, 'adwc5_medium')
channel1 = pd.read_sql("select * from CHANNELS1", 
                   con=connection)
# if it is a Notebook
channel1.head()

External table

Another option for Data Loading is to create in the DB an External Table , referring to the file. A soon as you have the external table working, you can use DML instructions to load data inside the DB. For example, you can use CTAS: CREATE TABLE AS SELECT.

The procedure to create the external table is:

DBMS_CLOUD.CREATE_EXTERNAL_TABLE

more details here .

Conclusion

As a Data Scientist, you need to work with data. Sometimes you’ll need to load by yourself. Sometimes, you’ll have a DBA to load for you.

In this article, I have shown how to use OCI Python SDK to:

  • load files in the ObjectStore
  • copy data from ObjectStore to ADWC tables

In addition, I have given a close look at the power of OCI Python SDK. With it, you can create your highly customized set of tools to automate your daily work. You can even create your ADWC instance or scale it up&down (adding cores and storage space).

In a future article, I’ll return to Machine Learning. It is almost time to explore OML4Py .

One more thing

If you want to find more examples using OCI Python SDK, you can have a look here


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK