9

Apache Ranger and AWS EMR Automated Installation 2 - DZone

 1 year ago
source link: https://dzone.com/articles/apache-ranger-aws-emr-automated-installation-2
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

In the first article of this series, we got a full picture of EMR and Ranger integration solutions. From now on, we will start to introduce concrete solutions one by one. This article is against “Scenario 1: OpenLDAP + EMR-Native Ranger.” We will introduce the architecture of solution, give detailed installation step descriptions, and verify installed environment.

1. Solution Overview 

1.1 Architecture

Architecture

In this solution, OpenLDAP plays the authentication provider, all user accounts data store on it, and Ranger plays the authorization controller. Because we select the EMR-native Ranger solution, which strongly depends on Kerberos, a Kerberos KDC is required. In this solution, we recommend choosing a cluster-dedicated KDC created by EMR instead of an external KDC. This can help us save the job of installing Kerberos. If you have an existing KDC, this solution also supports it.

To unify user accounts data, OpenLDAP and Kerberos must be integrated together. There is a series of jobs to do, i.e., enable SASL/GSSAPI, map two systems accounts, enable pass-through authentication, and so on. For Ranger, it will sync the account data from OpenLDAP to grant privileges against user accounts from OpenLDAP, meanwhile, EMR clusters need to install a series of Ranger plugins. These plugins will check with the Ranger server to assure the current user has permission to perform an action. An EMR cluster will also sync accounts data from OpenLDAP via SSSD so a user can log into nodes of the EMR clusters and submit jobs.

1.2 Authentication in Detail

Let’s deep dive into the authentication part. OpenLDAP and Kerberos are two mutually independent authentication mechanisms. How to integrate them is the subject of authentication. There is another series of articles dedicated to this elaborate this topic, and this installer completely followed all operations of this series:

  • “OpenLDAP and Kerberos Based Authentication Solution (1): Integrating Backend Database”
  • “OpenLDAP and Kerberos Based Authentication Solution (2): Synchronizing SSSD”
  • “OpenLDAP and Kerberos Based Authentication Solution (3): Deeply Integrating with SASL/GSSAPI”

Generally, the installer will finish the following jobs:

Administrator-User
  1. Install OpenLDAP.
  2. Install SSSD on all nodes of the EMR cluster.
  3. Migrate the Kerberos backend database to OpenLDAP and save the account data of the two systems into a single record.
  4. Install and configure SASL/GSSAPI to enable the Kerberos accounts to log into OpenLDAP.
  5. Configure OpenLDAP to map Kerberos accounts to OpenLDAP accounts.
  6. Enable saslauthd, unify Open LDAP and Kerberos accounts password.
  7. Configure SSH and enable users to log in with an OpenLDAP account.
  8. Configure SSH and enable the users to log in with a Kerberos account via GSSAPI.

1.3 Authorization in Detail

For authorization, Ranger is absolutely the leading role. If we deep dive into it, its architecture looks as follows:

Authorization in Detail

The installer will finish the following jobs:

  1. Install MySQL as a Policy DB for Ranger.
  2. Install Solr as an Audit Store for Ranger.
  3. Install Ranger Admin.
  4. Install Ranger UserSync.
  5. Install the EMRFS(S3) Ranger plugin.
  6. Install the Spark Ranger plugin.
  7. Install the Hive Ranger plugin.
  8. Install the Trino Ranger plugin (Not available yet at the time of writing).

2. Installation and Integration

Generally, the installation and integration process can be divided into three stages: 

  1. Prerequisites
  2. All-In-One install
  3. Create the EMR cluster

The following diagram illustrates the progress in detail:

Progress in Detail Diagram

At stage 1, we need to do some preparatory work. At stage 2, we start to install and integrate. Here are two options at this stage: one is an all-in-one installation driven by a command-line-based workflow. The other is a step-by-step installation.

For most cases, an all-in-one installation is always the best choice; however, your installation workflow may be interrupted by unforeseen errors. If you want to continue installing from the last failed step, please try the step-by-step installation. You may want to re-try a step with different argument values to find the right one, step-by-step is also a better choice. At stage 3, we need to create an EMR cluster ourselves with output artifacts in stage 2, i.e., IAM roles and EMR security configuration.

As a design principle, the installer does not include any actions to create an EMR cluster. You should always create your cluster yourself because an EMR cluster, in practice, could have any unpredictable settings, i.e., application-specific (HDFS, Yarn, etc.) configuration, step scripts, bootstrap scripts, and so on. It is unadvised to couple Ranger’s installation with EMR clusters creation.

However, there is a little bit of overlap in the execution sequence between stages 2 and 3. When creating an EMR cluster based on EMR-native Ranger, it is required to provide a copy of the security configuration and Ranger-specific IAM roles. They must be available before creating an EMR cluster, and besides, while creating a cluster, it also needs to interact with the Ranger server (server address is assigned in security configuration). On the other hand, some operations in an all-in-one installation need to perform on all nodes of the cluster or KDC. This requires an EMR cluster to be ready. To solve this circular dependency, the installer will output some artifacts first depending on the EMR cluster. Next, indicate that users create their own cluster with these artifacts; meanwhile, the installation progress will be pending and keep monitoring the target clusters’ status. Once it’s ready, the installation progress will resume and continue to perform REST actions.

Notes:

  1. The installer will treat the local host as a Ranger server to install everything on Ranger. For non-Ranger operations, i.e., installing OpenLDAP or migrating Kerberos DB, it will initiate remote operations via SSH. So, you can stay on the Ranger server to execute command lines, meaning there is no need to switch among multiple hosts.
  2. For the sake of Kerberos, all host addresses must use FQDN. Both IP and hostnames without domain names are unaccepted.

2.1 Prerequisites

2.1.1 Create EC2 Instances as Ranger and OpenLDAP Server

First, we need to prepare two EC2 instances. One as the server of Ranger and the other as the server of OpenLDAP. When creating instances, please select Amazon Linux 2 image and guarantee network connections among instances and that the cluster that is to be created is reachable.

As a best practice, it’s recommended to add the Ranger server into the ElasticMapReduce-master security group because Ranger is very close to the EMR cluster; it can be regarded as a non-EMR-build-in master service. For OpenLDAP, we have to make sure its port 389 is reachable from Ranger and all nodes of the EMR cluster to be created or to be simple. You can also add OpenLDAP to the ElasticMapReduce-master security group.

2.1.2 Download Installer

After EC2 instances are ready, pick the Ranger server, log in via SSH, and run the following commands to download the installer package:

Shell
sudo yum -y install git
git clone https://github.com/bluishglc/ranger-emr-cli-installer.git

2.1.3 Upload SSH Key File

As mentioned before, the installer is based on a local host (Ranger server). To perform remote installing actions on OpenLDAP or the EMR cluster, an SSH private key is required, so we should upload it to the Ranger server and make a note of the file path; it will be the value of the variable SSH_KEY.

2.1.4 Export Environment-Specific Variables

During the installation, the following environment-specific arguments will be passed more than once, it’s recommended to export them first, and then all command lines just refer to these variables instead of literals.

Shell
export REGION='TO_BE_REPLACED'
export ACCESS_KEY_ID='TO_BE_REPLACED'
export SECRET_ACCESS_KEY='TO_BE_REPLACED'
export SSH_KEY='TO_BE_REPLACED'
export OPENLDAP_HOST='TO_BE_REPLACED'

The following are comments of the above variables:

  • REGION: AWS Region, i.e., cn-north-1, us-east-1, and so on.
  • ACCESS_KEY_ID: AWS access key id of your IAM account. Be sure your account has enough privileges, it’s better having admin permissions.
  • SECRET_ACCESS_KEY: AWS secret access key of your IAM account.
  • SSH_KEY: SSH private key file path on the local host you just uploaded.
  • OPENLDAP_HOST: FQDN of the OpenLDAP server.

Please carefully replace the above variables’ value according to your environment, and remember to use FQDN as the hostname, i.e., OPENLDAP_HOST. The following is a copy of the example:

Shell
export REGION='cn-north-1'
export ACCESS_KEY_ID='<change-to-your-aws-access-key-id>'
export SECRET_ACCESS_KEY='<change-to-your-aws-secret-access-key>'
export SSH_KEY='/home/ec2-user/key.pem'
export OPENLDAP_HOST='ip-10-0-14-0.cn-north-1.compute.internal'

2.2 All-In-One Installation

2.2.1 Quick Start

Now, let’s start an all-in-one installation. Execute this command line:

Shell
sudo sh ./ranger-emr-cli-installer/bin/setup.sh install \
    --region "$REGION" \
    --access-key-id "$ACCESS_KEY_ID" \
    --secret-access-key "$SECRET_ACCESS_KEY" \
    --ssh-key "$SSH_KEY" \
    --solution 'emr-native' \
    --auth-provider 'openldap' \
    --openldap-host "$OPENLDAP_HOST" \
    --openldap-base-dn 'dc=example,dc=com' \
    --openldap-root-cn 'admin' \
    --openldap-root-password 'Admin1234!' \
    --openldap-user-dn-pattern 'uid={0},ou=users,dc=example,dc=com' \
    --openldap-group-search-filter '(member=uid={0},ou=users,dc=example,dc=com)' \
    --openldap-user-object-class 'inetOrgPerson' \
    --example-users 'example-user-1,example-user-2' \
    --ranger-plugins 'emr-native-emrfs,emr-native-spark,emr-native-hive'

For the parameters specification of the above command line, please refer to the appendix. If everything goes well, the command line will execute steps 2.1 to 2.7 in the workflow diagram. This may take ten minutes or more, depending on the bandwidth of your network. Next, it will suspend and indicate the user to create an EMR cluster with the two artifacts:

  1. An EC2 instance profile named EMR_EC2_RangerRole.
  2. An EMR security configuration named Ranger@<YOUR-RANGER-HOST-FQDN>.

They are just created by the command line in steps 2.2 and 2.4. You can find them from the EMR web console when creating your cluster. The following is a snapshot of the command line for this moment:

Create EMR Cluster

Next, we should switch to the EMR web console to create a cluster. Be sure to select the EC2 instance profile and security configuration prompted in the command line console. For Kerberos KDC, please also fill in and make a note of the “realm” and “KDC admin password.” They will be used in the command line soon. The following is a snapshot of the EMR web console for this moment:

Permissions

Once the EMR cluster starts to create, five cluster-related information items will be certain; they are:

  1. Cluster id: get from the summary tab on the web console.
  2. Kerberos realm: entered by you in the “authentication and encryption” section. See the above snapshot. Note that for region us-east-1, the default realm is EC2.INTERNAL; for other regions, the default realm is  COMPUTE.INTERNAL.
  3. Kerberos KDC admin password: entered by you in the “authentication and encryption” section, see the above snapshot.
  4. Kerberos KDC host: get from the hardware tab on the web console, which usually is the master node.
  5. Confirm if it allows the Hue to integrate with LDAP or not. If yes, after the cluster is ready, the installer will update the EMR configuration with a Hue-specific setting. Be careful because this action will overwrite the EMR existing configuration.

Now, we need to go back to the command line terminal and enter “y” for the CLI prompt. “Have you created the cluster? [y/n]:” (you don’t need wart for the cluster to become completely ready), then the command line will ask you to enter the above four information items one by one because they are required for next phase of installation, then confirm by entering “y” again. The installation process will resume and if the assigned EMR cluster is not ready yet, the command line will keep monitoring until it goes into “WAITING” status, the following is a snapshot for this moment of the command line:

Command Line Snapshot

When the cluster is ready (status is “WAITING”), the command line will continue to execute from steps 2.9 to 2.13 of the workflow and finally end with an “ALL DONE” message.

2.2.2 Customization

Now that an all-in-one installation is done, we introduce more about customization. Generally, this installer follows the principle of “Convention over Configuration.” Most parameters are preset by default values, an equivalent version with a full parameter list of the above command line is as follows:

Shell
sudo sh ./ranger-emr-cli-installer/bin/setup.sh install \
    --region "$REGION" \
    --access-key-id "$ACCESS_KEY_ID" \
    --secret-access-key "$SECRET_ACCESS_KEY" \
    --ssh-key "$SSH_KEY" \
    --solution 'emr-native' \
    --auth-provider 'openldap' \
    --openldap-host "$OPENLDAP_HOST" \
    --openldap-base-dn 'dc=example,dc=com' \
    --openldap-root-cn 'admin' \
    --openldap-root-password 'Admin1234!' \
    --openldap-user-dn-pattern 'uid={0},ou=users,dc=example,dc=com' \
    --openldap-group-search-filter '(member=uid={0},ou=users,dc=example,dc=com)' \
    --openldap-user-object-class 'inetOrgPerson' \
    --example-users 'example-user-1,example-user-2' \
    --ranger-plugins 'emr-native-emrfs,emr-native-spark,emr-native-hive' \
    --java-home '/usr/lib/jvm/java' \
    --skip-install-mysql 'false' \
    --skip-migrate-kerberos-db 'false' \
    --skip-install-solr 'false' \
    --skip-install-openldap 'false' \
    --skip-configure-hue 'false' \
    --ranger-host $(hostname -f) \
    --ranger-version '2.1.0' \
    --mysql-host $(hostname -f) \
    --mysql-root-password 'Admin1234!' \
    --mysql-ranger-db-user-password 'Admin1234!' \
    --solr-host $(hostname -f) \
    --ranger-bind-dn 'cn=ranger,ou=services,dc=example,dc=com' \
    --ranger-bind-password 'Admin1234!' \
    --hue-bind-dn 'cn=hue,ou=services,dc=example,dc=com' \
    --hue-bind-password 'Admin1234!' \
    --sssd-bind-dn 'cn=sssd,ou=services,dc=example,dc=com' \
    --sssd-bind-password 'Admin1234!' \
    --restart-interval 30

The full-parameters version gives us a complete perspective of all custom options. In the following scenarios, you may change some options’ values:

  1. If you want to change the default organization name dc=example,dc=com or default password Admin1234!, please run the full-parameters version and replace them with your own values.
  2. If you need to integrate with external facilities, i.e., a centralized OpenLDAP or an existing MySQL or Solr, please add the corresponding --skip-xxx-xxx options and set it to true.
  3. If you have other pre-defined Bind DN for Hue, Ranger, and SSSD, please add the corresponding --xxx-bind-dn and --xxx-bind-password options to set them. Note that the Bind DN for Hue, Ranger, and SSSD will be created automatically when installing OpenLDAP, but they are fixed with the naming pattern:cn=hue|ranger|sssd,ou=services,<your-base-dn> not the given value of “--xxx-bind-dn” option, so if you assign another DN with the “--xxx-bind-dn” option, you must create this DN by yourself in advance. The reason this install does not create the DN assigned by the “--xxx-bind-dn” option is that a DN actually is a tree path. To create it, we must create all nodes in the path, it is not cost-effective to implement such a small but complicated function.
  4. By default, an all-in-one installation will migrate the cluster Kerberos database to OpenLDAP so as to better accounts management, but if you run an external Kerberos KDC, please be sure if you really need to migrate an external KDC’s database to Open LDAP. If not, please add --skip-migrate-kerberos-db 'true' in the command line to skip it.

2.3 Step-By-Step Installation

As an alternative, you can also select a step-by-step installation instead of an all-in-one installation. We give the command line of each step. As for comments on each parameter, please refer to the appendix.

2.3.1 Init EC2

This step will finish some fundamental jobs, i.e., install AWS CLI, JDK, and so on.

Shell
sudo sh ./ranger-emr-cli-installer/bin/setup.sh init-ec2 \
    --region "$REGION" \
    --access-key-id "$ACCESS_KEY_ID" \
    --secret-access-key "$SECRET_ACCESS_KEY"

2.3.2 Create IAM Roles

This step will create three IAM roles, which are required for EMR.

Shell
sudo sh ./ranger-emr-cli-installer/bin/setup.sh create-iam-roles \
    --region "$REGION"

2.3.3 Create Ranger Secrets

This step will create SSL/TLS-related keys, certificates, and keystores for Ranger because EMR-native Ranger requires SSL/TLS connections to the server. These artifacts will upload to the AWS secrets manager and are referred to by the EMR security configuration.

Shell
sudo sh ./ranger-emr-cli-installer/bin/setup.sh create-ranger-secrets \
    --region "$REGION"

2.3.4 Create EMR Security Configuration

This step will create a copy of the EMR security configuration. The configuration includes Kerberos and Ranger-related information. When creating a cluster, EMR will read them and get corresponding resources, i.e., secrets and interact with the Ranger server; this address is assigned in the security configuration.

Shell
sudo sh ./ranger-emr-cli-installer/bin/setup.sh create-emr-security-configuration \
    --region "$REGION" \
    --solution 'emr-native' \
    --auth-provider 'openldap'

2.3.5 Install OpenLDAP

This step will install OpenLDAP on the given OpenLDAP host, as mentioned above. Although this action is performed on an OpenLDAP server, you don’t need to log into the OpenLDAP server. You just need to run the command line on the local host (the Ranger server).

Shell
sudo sh ./ranger-emr-cli-installer/bin/setup.sh install-openldap \
    --region "$REGION" \
    --access-key-id "$ACCESS_KEY_ID" \
    --secret-access-key "$SECRET_ACCESS_KEY" \
    --ssh-key "$SSH_KEY" \
    --solution 'emr-native' \
    --auth-provider 'openldap' \
    --openldap-host "$OPENLDAP_HOST" \
    --openldap-base-dn 'dc=example,dc=com' \
    --openldap-root-cn 'admin' \
    --openldap-root-password 'Admin1234!'

2.3.6 Install Ranger

This step will install all server-side components of Ranger, including MySQL, Solr, Ranger Admin, and Ranger UserSync.

Shell
sudo sh ./ranger-emr-cli-installer/bin/setup.sh install-ranger \
    --region "$REGION" \
    --access-key-id "$ACCESS_KEY_ID" \
    --secret-access-key "$SECRET_ACCESS_KEY" \
    --solution 'emr-native' \
    --auth-provider 'openldap' \
    --openldap-host "$OPENLDAP_HOST" \
    --openldap-base-dn 'dc=example,dc=com' \
    --ranger-bind-dn 'cn=ranger,ou=services,dc=example,dc=com' \
    --ranger-bind-password 'Admin1234!' \
    --openldap-user-dn-pattern 'uid={0},ou=users,dc=example,dc=com' \
    --openldap-group-search-filter '(member=uid={0},ou=users,dc=example,dc=com)' \
    --openldap-user-object-class 'inetOrgPerson'

2.3.7 Install Ranger Plugins

This step will install EMRFS, Spark, and Hive plugins from the Ranger server side. There is the other half job, which installs these plugins (they are: EMR Secret Agent, EMR Record Server, and so on) on the agent side; however, it will be done automatically by EMR when creating the cluster.

Shell
sudo sh ./ranger-emr-cli-installer/bin/setup.sh install-ranger-plugins \
    --region "$REGION" \
    --solution 'emr-native' \
    --auth-provider 'openldap' \
    --ranger-plugins 'emr-native-emrfs,emr-native-spark,emr-native-hive'

2.3.8 Create EMR Cluster

For step-by-step installation, there is no interactive process for creating an EMR cluster, so feel free to create a cluster on the EMR web console, but we have to wait until the cluster is completely ready (in “WAITING” status), then export the following environment-specific variables:

Shell
export EMR_CLUSTER_ID='TO_BE_REPLACED'
export KERBEROS_REALM='TO_BE_REPLACED'
export KERBEROS_KDC_HOST='TO_BE_REPLACED'

The following is a copy of the example:

Shell
export EMR_CLUSTER_ID='j-8SRQM6X4ZVT8'
export KERBEROS_REALM='COMPUTE.INTERNAL'
export KERBEROS_KDC_HOST='ip-10-0-3-104.cn-north-1.compute.internal'

2.3.9 Migrate Kerberos DB

The default database of Kerberos is file-based. This should be stored on KDC. This step will migrate all principals’ data to OpenLDAP. Please pay more attention to this step if you run an external KDC, and it is not dedicated to your EMR cluster. You may skip this step unless you are sure you need to migrate an external KDC to your OpenLDAP.

Shell
sudo sh ./ranger-emr-cli-installer/bin/setup.sh migrate-kerberos-db \
    --region $REGION \
    --ssh-key "$SSH_KEY" \
    --kerberos-realm "$KERBEROS_REALM" \
    --kerberos-kdc-host "$KERBEROS_KDC_HOST" \
    --openldap-host "$OPENLDAP_HOST" \
    --openldap-base-dn 'dc=example,dc=com' \
    --openldap-root-cn 'admin' \
    --openldap-root-password 'Admin1234!'

2.3.10 Enable SASL/GSSAPI

This step will enable SASL/GSSAPI. This is a key action of OpenLDAP and Kerberos integration, it will perform remote actions on OpenLDAP, Kerberos KDC, and each node of the EMR cluster. As before, you need to run it on a local host.

Shell
sudo sh ./ranger-emr-cli-installer/bin/setup.sh enable-sasl-gssapi \
    --region "$REGION" \
    --ssh-key "$SSH_KEY" \
    --kerberos-realm "$KERBEROS_REALM" \
    --kerberos-kdc-host "$KERBEROS_KDC_HOST" \
    --openldap-host "$OPENLDAP_HOST" \
    --openldap-base-dn 'dc=example,dc=com' \
    --openldap-root-cn 'admin' \
    --openldap-root-password 'Admin1234!' \
    --emr-cluster-id "$EMR_CLUSTER_ID"

2.3.11 Install SSSD

This step will install and config SSSD on each node of the EMR cluster. The same to installing Open LDAP; we should still keep a local host to run the command line, and it will perform on remote nodes via SSH.

Shell
sudo ./ranger-emr-cli-installer/bin/setup.sh install-sssd \
    --region "$REGION" \
    --ssh-key "$SSH_KEY" \
    --openldap-host "$OPENLDAP_HOST" \
    --openldap-base-dn 'dc=example,dc=com' \
    --sssd-bind-dn 'cn=sssd,ou=services,dc=example,dc=com' \
    --sssd-bind-password 'Admin1234!' \
    --emr-cluster-id "$EMR_CLUSTER_ID"

2.3.12 Configure Hue

This step will update the Hue configuration of EMR, as highlighted in the all-in-one installation. If you have other customized EMR configurations, please skip this step, but you can still manually merge the generated JSON file for the Hue configuration by the command line into your own JSON.

Shell
sudo sh ./ranger-emr-cli-installer/bin/setup.sh configure-hue \
    --region "$REGION" \
    --auth-provider 'openldap' \
    --openldap-host "$OPENLDAP_HOST" \
    --openldap-base-dn 'dc=example,dc=com' \
    --openldap-user-object-class 'inetOrgPerson' \
    --hue-bind-dn 'cn=hue,ou=services,dc=example,dc=com' \
    --hue-bind-password 'Admin1234!' \
    --emr-cluster-id "$EMR_CLUSTER_ID"

2.3.13 Create Example Users

This step will create two example users to facilitate the following verification:

Shell
sudo sh ./ranger-emr-cli-installer/bin/setup.sh add-example-users \
    --region "$REGION" \
    --ssh-key "$SSH_KEY" \
    --solution 'emr-native' \
    --auth-provider 'openldap' \
    --kerberos-kdc-host "$KERBEROS_KDC_HOST" \
    --openldap-host "$OPENLDAP_HOST" \
    --openldap-base-dn 'dc=example,dc=com' \
    --openldap-root-cn 'admin' \
    --openldap-root-password 'Admin1234!' \
    --example-users 'example-user-1,example-user-2'

3. Verification

After the installation and integration are completed, it’s time to check if Ranger works or not. The verification jobs are divided into three parts, which are against Hive, EMRFS (S3), and Spark. First, let us log into OpenLDAP via a client, i.e., LdapAdmin or Apache Directory Studio, then check out all DN; it should look as follows:

Verification

Next, open the Ranger web console, the address is: https://<YOUR-RANGER-HOST>:6182, the default admin account/password is: admin/admin. After logging in, we should open the “Users/Groups/Roles” page first to see if the example users on OpenLDAP are already synchronized to Ranger as follows:

Ranger Web Console

3.1 Hive Access Control Verification

Usually, there are a set of pre-defined policies for the Hive plugin after installation. To eliminate interference and keep verification simple, let’s remove them first:

Hive Access

Any policy changes on the Ranger web console will sync to the agent side (EMR cluster nodes) within 30 seconds. We can run the following commands on master node to see if the local policy file is updated:

Shell
# run on master node of emr cluster
for i in {1..10}; do
    printf "\n%100s\n\n"|tr ' ' '='
    sudo stat /etc/hive/ranger_policy_cache/hiveServer2_hive.json
    sleep 3
done

Once the local policy file is up to date, removing-all-policies action becomes effective. Next, log into Hue with the OpenLDAP account “example-user-1” created by the installer. Open Hive editor, enter the following SQL (remember to replace “ranger-test” with your own bucket) to create a test table (change ‘ranger-test’ to your own bucket name):

-- run in hue hive editor
create table ranger_test (
  id bigint
)
row format delimited
stored as textfile location 's3://ranger-test/';

Next, run it and an error occurs:

Error

It shows that example-user-1 is blocked by database-related permissions, this proves the Hive plugin is working. Next, we go back to Ranger and add a Hive policy named “all - database, table, column” as follows:

Working Hive Plug In

It grants example-user-1 all privileges on all databases, tables, and columns. Then check the policy file again on the master node with the previous command line. Once updated, go back to Hue, re-run that SQL, and we will get another error at this time:

Hive Privileges

As shown, the SQL is blocked when reading “s3://ranger-test,” actually, example-user-1 has no permissions to access any URL, including “s3://.” We need to grant url-related permissions to this user, so go back to Ranger again and add a Hive policy named “all - url” as follows:

No Permissions

It grants example-user-1 all privileges on any url, including “s3://,” then check the policy file again, and switch to Hue. Next, run that SQL a third time and it will go well as follows:

3rd SQL

At the end, to prepare for the next EMRFS/Spark verification, we need to insert some example data into the table and double-check if example-user-1 has full read and write permissions on the table:

insert into ranger_test(id) values(1);
insert into ranger_test(id) values(2);
insert into ranger_test(id) values(3);
select * from ranger_test;

The execution result is:

Execution Result

By now, Hive access control verifications are passed.

3.2 EMRFS (S3) Access Control Verification

Log into Hue with the account: “example-user-1,” open Scala editor, and enter the following Spark codes:

Scala
# run in scala editor of hue
spark.read.csv("s3://ranger-test/").show;

This line of codes try to read files on S3, but it will run into the following errors:

Errors

It shows example-user-1 has no permission on the S3 bucket “ranger-test.” This proves the EMRFS plugin is working and has successfully blocked unauthorized S3 access. Let’s log into Ranger and add an EMRFS policy named “all - ranger-test” as follows:

EMRFS Policy

It will grant example-user-1 all privileges on the “ranger-test” bucket. Similar to checking the Hive policy file, we can also run the following command to check if the EMRFS policy file is updated:

Shell
# run on master node of emr cluster
for i in {1..10}; do
    printf "\n%100s\n\n"|tr ' ' '='
    sudo stat /emr/secretagent/ranger_policy_cache/emrS3RangerPlugin_emrfs.json
    sleep 3
done

After updating, go back to Hue, re-run the previous Spark codes, and it will succeed as follows:

Successful Spark CodeBy now, EMRFS access control verifications are passed.

3.3 Spark Access Control Verification

Log into Hue with the account “example-user-1,” open Scala editor, and enter the following Spark codes:

Scala
# run in scala editor of hue
spark.sql("select * from ranger_test").show

This line of codes try to ranger_test table via Spark SQL, but it will run into the following errors:

Spark SQLIt shows the current user has no permission on the default database. This proves the Spark plugin is working and has successfully blocked unauthorized database/tables access.

Let’s log into Ranger and add a Spark policy named “all - database, table, column,” as follows:

Log into RangerIt will grant example-user-1 all privileges on all databases/tables/columns. Similar to checking the Hive policy file, we can also run the following command to check if the Spark policy file is updated:

Shell
# run on master node of emr cluster
for i in {1..10}; do
    printf "\n%100s\n\n"|tr ' ' '='
    sudo stat /etc/emr-record-server/ranger_policy_cache/emrSparkRangerPlugin_spark.json 
    sleep 3
done

After updating, go back to Hue, re-run the previous Spark codes, and it will succeed as follows:

ScalaBy now, the Spark access control verifications are passed.

4. FAQ

4.1 How To Integrate an External KDC?

Keep everything as usual, but when creating an EMR cluster, do NOT select the generated security configuration by CLI. Instead of creating another one manually, copy all values from the generated security configuration except for the “Authentication” part. For “Authentication,” select “External KDC” and fill in your own values, and when entering the Kerberos KDC host in the command line console or exporting KERBEROS_KDC_HOST, also use your external KDC host name. Lastly, be sure if you still need to migrate an external Kerberos database to Open LDAP. If not, skip it with --skip-migrate-kerberos-db true.

4.2 Can I Rerun the All-In-One Installation Command Line?

Yes, and you don’t need to take any cleanup actions.

5. Appendix

The following is parameter specification:

Parameter
Comment
--region
The AWS region. 
--access-key-id
The AWS  access key id of your IAM account.
--secret-access-key
The AWS secret access key of your IAM account.
--ssh-key
The SSH private key file path.
--solution
The solution name, accepted values ‘open-source’ or ‘emr-native.’
--auth-provider
The authentication provider, accepted values ‘ad’ or ‘OpenLDAP.’
--openldap-host
The FQDN of the OpenLDAP host.
--openldap-base-dn
The Base DN of OpenLDAP, for example: ‘dc=example,dc=com,’ change it according to your env.
--openldap-root-cn
The CN of the root account, for example: ‘admin,’ change it according to your env.
--openldap-root-password
The password of the root account, for example: ‘Admin1234!,’ change it according to your env.
--ranger-bind-dn
The Bind DN for Ranger, for example: ‘cn=ranger,ou=services,dc=example,dc=com,’ this should be an existing DN on Windows AD / OpenLDAP, change it according to your env.
--ranger-bind-password
The password of Ranger Bind DN, for example: ‘Admin1234!,’ change it according to your env.
--openldap-user-dn-pattern
The DN pattern for Ranger to search users on OpenLDAP, for example: ‘uid={0},ou=users,dc=example,dc=com,’ change it according to your env.
--openldap-group-search-filter
The filter for Ranger to search groups on OpenLDAP, for example ‘(member=uid={0},ou=users,dc=example,dc=com),’ change it according to your env.
--openldap-user-object-class
The user object class for Ranger to search users, for example: ‘inetOrgPerson,’ change it according to your env.
--hue-bind-dn
The Bind DN for Hue, for example: ‘cn=hue,ou=services,dc=example,dc=com,’ this should be an existing DN on Windows AD/OpenLDAP, change it according to your env.
--hue-bind-password
The password of the Hue Bind DN, for example: ‘Admin1234!,’ change it according to your env.
--example-users
The example users to be created on OpenLDAP and Kerberos so as to demo Ranger's feature, this parameter is optional, if omitted, no example users will be created.
--ranger-bind-dn
The Bind DN for ranger, for example: ‘cn=ranger,ou=services,dc=example,dc=com,’ this should be an existing DN on Windows AD/OpenLDAP, change it according to your env.
--ranger-bind-password
The password of Bind DN, for example: ‘Admin1234!,’ change it according to your env.
--hue-bind-dn
The Bind DN for Ranger, for example: ‘cn=hue,ou=services,dc=example,dc=com,’ this should be an existing dn on Windows AD/OpenLDAP, change it according to your env.
--hue-bind-password
The password of Hue Bind DN, for example: ‘Admin1234!,’ change it according to your env.
--sssd-bind-dn
The Bind DN for SSSD, for example: ‘cn=sssd,ou=services,dc=example,dc=com,’ this should be an existing dn on Windows AD/OpenLDAP, change it according to your env.
--sssd-bind-password
The password of SSSD Bind DN, for example: ‘Admin1234!,’ change it according to your env.
--ranger-plugins
The Ranger plugins to be installed, comma separated for multiple values. For example: ‘emr-native-emrfs,emr-native-spark,emr-native-hive,’ change it according to your env.
--skip-configure-hue
Skip to configure Hue, accepted values ‘true’ or ‘false,’ the default value is ‘false.’
--skip-migrate-kerberos-db
Skip to migrate Kerberos database, accepted values ‘true‘ or ‘false,’ the default value is ‘false.’

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK