GitHub - IBM/data-access-and-governance-using-cpd-on-aws: Data Access and Govern...

 7 months ago
source link: https://github.com/IBM/data-access-and-governance-using-cpd-on-aws
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Data Access and Governance using CPD on AWS

In this Code Pattern, we will demonstrate how to create hybrid cloud data architecture using IBM Cloud Pak for Data (CP4D) and AWS together.

Data fabric is a highly scalable, distributed data architecture comprising both shared data assets and streamlined data integration and governance capabilities that can be used to tackle modern data challenges. A typical data fabric solution will comprising of components such as Data Catalog, Data Integration, Data Governance, Data Visualisation etc.

In this Data Access & Governance Code Pattern, you will solve the challenges faced by different personas in Data Analytics.

  • Data scientists spend 80% of their time in discovering, curating, and cleansing the data. How to provide them quality data for analytics?
  • Data Engineers face lots of challenges while integrating data from multiple data sources. How can they quickly and efficiently collect and integrate data?
  • Data Steward deals with data privacy and protection challenges. How to ensure that the data is governed, and no sensitive information is shared with data consumers?

These Gaps could now be addressed by the governed data fabric architecture using IBM Cloud Pak for Data.

After completing this Code Pattern, you will understand how to :

  • How to create connection between external data sources and IBM Cloud Pak for Data.
  • How to ingest data from multiple data sources.
  • How to clean, filter, or reshape data.
  • How to query data from multiple data sources without copying or moving the data.
  • How to create data integration pipeline to transform and integrate data from heterogeneous data sources.
  • How to protect sensitive data (such as PII) to be shared with data consumers.
  • How to schedule job to run data integration pipeline periodically.


  1. Create external connection between external data sources (eg. Amazon S3, and Amazon Aurora PostgreSQL) and IBM Cloud Pak for Data.
  2. Use IBM Data Virtualization to query data from multiple data sources without creating data replica.
  3. Use IBM DataStage to create ETL pipeline
  4. Use IBM Data Refinery Flow to clean, and filter the data.
  5. Use IBM Watson Knowledge Studio to profile and govern the data.
  6. Supply the data to AI based predictive system such as Amazon SageMaker or Jupyter Notebook to create machine learning models.



Include Components

These are the components & services which are included in this Code Pattern.

  • IBM Cloud Pak for Data: IBM Cloud Pak for Data is a data and AI platform with a data fabric that makes all data available for AI and analytics, on any cloud.

  • IBM DataStage Integration tool that helps you design, develop and run jobs that move and transform data.

  • IBM Data Refinery: IBM Data Refinery is a cloud service that provides a self-service data preparation client to transform raw data into data ready for analytics.

  • IBM Watson Knowledge Catalog: Activate business-ready data for AI and analytics with intelligent cataloging, backed by active metadata and policy management.

  • Amazon S3: Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance.

  • Amazon Redshift: Accelerate your time to insights with fast, easy, and secure cloud data warehousing at scale.

  • Amazon Aurora: Designed for unparalleled high performance and availability at global scale with full MySQL and PostgreSQL compatibility

Featured Technologies

  • Data Fabric: A data fabric is an architectural approach to simplify data access in an organization to facilitate self-service data consumption.
  • Analytics: Uncover insights with data collection, organization, and analysis.
  • Data Management: Organize and maintain data processes through the information lifecycle.
  • Data Privacy: Using user data responsibly.

Data Access and Governance Use-cases

This code pattern covers two use-cases per below.

Select Tables


  1. Data Virtualization

Research shows that up to 68% of data is not analyzed in most organizations and up to 82% of enterprises are inhibited by data silos Data silos are one of the common problems which delays Data Analytics in most of the organizations. One of the primary reason why organizations delay data analytics is due to Data Silos challenges. Enterprise data spreads across multiple data sources, so Data Engineers face challenges to integrate data from various different kinds of data sources.

In this lab, you will learn how to solve data silos challenges without copying or moving data using the Data Virtualization service offered by IBM Cloud Pak for Data.

Data Virtualization(DV) refers to the process of abstracting heterogeneous data sources into a single data access layer which delivers integrated data to users or applications. DV allows you to query data from multiple data sources without creating data replicas.

In this Data Virtualization lab we will create connections between external data sources and IBM Data Virtualization service and then you will create a single virtualized view by joining data from different data sources such as Amazon S3 and Amazon Aurora PostgreSQL.

In addition to that you will also learn how to govern and protect virtualized views using IBM Watson Knowledge Catalog to avoid data protection and data privacy issues.

Learning Objectives:

In this tutorial you will learn:

  1. How to make a connection between external data sources & IBM Data Virtualization service
  2. How to virtualize the data and create a single, joined VIEW.

Learning Outcome:

At the end of the lab you will be able to create a VIEW to query data from multiple data sources without creating data replica. This solves data silos challenges faced by enterprises which force them to copy the data into centralized repository for analytics.


  1. IBM Cloud Pak for Data
  2. Data Virtualization on IBM Cloud Pak for Data
  3. External Data Sources (Amazon S3, Amazon Aurora PostgreSQL)
  4. IBM Watson Knowledge Catalog

Estimated time:

It should take you approximately 15-20 minutes to complete this lab.

Lab Steps:

Step 1: Ensure IBM Data Virtualization service is enabled and provisioned

  1. Login to IBM Cloud Pak for Data with valid credentials

To perform this lab you need IBM Cloud Pak for Data's credential with admin role. Credentials include both username and password. If you do not have the credentials then ask the lab instructor to provide one.

  1. Validate if IBM Data Virtualization service is installed and provisioned

Follow the below steps to make sure IBM Data Virtualization service service is installed and provisioned in the provided IBM Cloud Pak for Data instance.


  1. Go to hamburger navigation menu and click on services and then click on services catalog to see all available services for this instance

DV Install Check
  1. Now in the search box type data virtualisation to search data virtualization service and click on it.

  2. Inside data virtualisation service you should be able to see that the service is enabled as shown in above picture. If it is not, ask the lab instructor and get data virtualisation service enabled.

  3. Now click on instances and the status of the data virtualisation instance to verify if Data virtualisation service is running without any issues.

2. Create New Connection

Click the Navigation Menu and select Data Virtualization. Then click on ‘Add connection’ to create data sources as connection in Data Virtualization. In this lab you will create connection with Amazon S3, and Amazon RDS data sources.

  1. Click navigation menu expand 'Data' and then click 'Data Virtualization'

DV Menu
  1. Click Add connection + then click new connection.

DV Homepage

Amazon Aurora Connection
  1. Select 'Amazon RDS for PostgreSQL' connection type and fill the details provided by Instructor to create Amazon Aurora PostgreSQL Connection in Data Virtualization then click on Create.

Amazon Aurora Connection

Amazon Aurora Connection Details
  1. Click skip

  1. Similarly add Amazon S3 datasource by selecting connection type 'Amazon S3' and fill connection details provided by instructor.
  1. Once The data connections are added as a data source for Data Virtualization, and you should see the Amazon S3 and Amazon RDS PostgreSQL connection listed on the Data sources screen.

Connection List

3. Create virtual tables and views using data virtualization

Congratulations! You have successfully added data sources to data virtualization. now you can virtualize the table within the data virtualization. Once tables are virtualized you can create VIEW using those virtual tables.

  1. Open the Data Virtualization menu and click on Virtualization to expand and then click Virtualize.

Virtualize Menu
  1. You might see several tables under Tables tab. Select the tables (ts_wallonia_region_table and ts_flanders_region_table) to create virtual tables. You can also search for or filter the tables. When selected, click Add to cart, then View cart.

Create Virtual Table
  1. In this screen, You need to choose whether you want to assign the virtualized data to a data request, a project, or to the virtualized data. Select the Virtualize Data option and provide a unique table name and then click Virtualize to create Virtual table.

Create Virtual Table Options
  1. Click on confirmation to create Virtual tables.

Create Virtual Table Confirmation

Create Virtual Table Confirmation 1
  1. Now it is time to create the joined VIEW by joining two virtualized tables/objects. This will allow you to query multiple data sources without copying or moving the data. To create VIEW we need to join two virtual tables which we created in the previous step. Let's click on Data Virtualization menu and expand Virtualization and then click Virtualized Data option.

Virtualized Data Menu
  1. Select the virtual tables which you have created in the previous step. (eg here ts_wallonia_virtual_table and ts_flanders_virtual_table) and click Join.

Select Tables
  1. After selecting the tables and click Join.

Select Tables
  1. Join two tables by specify join key and then click on Preview to preview joined table.

Select Tables

Select Tables
  1. If you see above result on preview, then click Next and edit column name as shown in below image. Then click Next.

Edit Column
  1. Once after ensuring all details, provide unique view name (eg. FLANDERS_WALLONIA_MERGED_VIEW_V1) and click Create View.

Select Tables
  1. Go back to 'Virutalized Data' menu and Select joined view you created in previous step (eg. FLANDERS_WALLONIA_MERGED_VIEW_V1) and brussels table to create view.

Select Tables

Select Tables

Select Tables
  1. in the above preview, there are multiple duplicate region string. So, let's customize SQL query.

Select Tables

Select Tables
  1. Once after all steps are performed, you will see below view in 'Virtualized Data' option.

Select Tables
  1. Publish the view to Catalog where you can govern the view using IBM Watson Knowledge Catalog. You will learn more about governance and data protection (masking) in Governance lab.

Select Tables

Select Tables
  1. Click navigation bar and expand Catalogs and click All Catalogs and select default catalog to see the view we created in last step.

Select Tables

Select Tables


This lab you have learned how to use Data Virtualization on IBM Cloud Pak for Data to virtualize data and create merged VIEW to query multiple data sources without moving or copying the data.

About Joyk

Aggregate valuable and interesting links.
Joyk means Joy of geeK