Data Processing Scaled Up and Out with Dask and RAPIDS: Installing a Data Scienc...

Rafal Lokuciejewski

This blog post tutorial shows how a scalable and high-performance environment for machine learning can be set up using the ingredients GPUs, Kubernetes clusters, Dask and Jupyter. In the first article of our blog series we have set up a Kubernetes cluster with access to GPUs. In this part we will add containerized applications to the cluster to be able to run data processing workloads in our cluster. Being more precise: we will prepare a notebook image that has CUDA installed which is required if we want to use GPU-based frameworks. Furthermore, the image should contain Dask, Rapids and Dask-Rapids. As soon as the image is ready, we will deploy JupyterHub which spawns said notebook image as a container for each user.

We will use JupyterLab notebooks as an interactive environment to start our data processing algorithms. In other words, JupyterLab will act as a Dask client. As we want to provide an environment not only for one data scientist but for a group of users, we decided to install JupyterHub on our Kubernetes cluster. JupyterHub makes it possible to serve a pre-configured data science environment to a group of users.

Permissions for Dask-Clients

At first, we have to care about the permissions of our JupyterLab instances. When being used as a Dask-Client it needs to have sufficient permissions to start new pods acting as Dask-workers. As we decided to install JupyterHub, no extra configuration is required since JupyterHub uses a Service Account with sufficient permissions per default. If you want to use Dask from a different environment, you will have to make sure to grant correct permissions for your client to create, delete, view etc. of your Dask-worker-pods via a Service Account.

Docker Image for Jupyter

JupyterHub is a multi-tenant version of JupyterLab. The hub creates a pod in the cluster for each user and pulls the notebook image that runs on that pod. There are official Jupyter-specific images like the Minimal-Notebook or the Data-Science-Notebook that are ready to use. However, to use the Rapids-Library, CUDA Toolkit is required. So we cannot use these base images and simply add Rapids and Dask to it.

It seems to be a good idea to create a base image which contains Jupyter and CUDA and use it to build an image with Rapids and Dask. Since Rapids and Dask are still in development and new versions are released frequently, keeping Jupyter and CUDA as a separate base image will make it easier to maintain our final image.

Fortunately, there are not only official Notebook images but also official images from NVIDIA with CUDA. We can simply combine both images. We will use the the base-notebook Dockerfile from hereand the 10.2-base-ubuntu-18.04 CUDA 10.2 Dockerfile from here . We then combine both of them into a single image. Keep in mind that for the base-notebook you need to have following files together with your Dockerfile:

fix-permissions
jupyter_notebook_config.py
start.sh
start-notebook-sh
start-singleuser.sh

All these files can be found in the base-notebook registry from the link above. The resulting Dockerfile is listed below:

ARG ROOT_CONTAINER=ubuntu:bionic-20200311@sha256:e5dd9dbb37df5b731a6688fa49f4003359f6f126958c9c928f937bec69836320

ARG BASE_CONTAINER=$ROOT_CONTAINER

FROM $BASE_CONTAINER

LABEL maintainer="Jupyter Project <[email protected]>"

ARG NB_USER="jovyan"

ARG NB_UID="1000"

ARG NB_GID="100"

USER root

# Install all OS dependencies for notebook server that starts but lacks all

# features (e.g., download as all possible file formats)

ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update \

&& apt-get install -yq --no-install-recommends \

wget \

bzip2 \

ca-certificates \

sudo \

locales \

fonts-liberation \

run-one \

&& apt-get clean && rm -rf /var/lib/apt/lists/*

RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen && \

locale-gen

# Configure environment

ENV CONDA_DIR=/opt/conda \

SHELL=/bin/bash \

NB_USER=$NB_USER \

NB_UID=$NB_UID \

NB_GID=$NB_GID \

LC_ALL=en_US.UTF-8 \

LANG=en_US.UTF-8 \

LANGUAGE=en_US.UTF-8

ENV PATH=$CONDA_DIR/bin:$PATH \

HOME=/home/$NB_USER

# Copy a script that we will use to correct permissions after running certain commands

COPY fix-permissions /usr/local/bin/fix-permissions

RUN chmod a+rx /usr/local/bin/fix-permissions

# Enable prompt color in the skeleton .bashrc before creating the default NB_USER

RUN sed -i 's/^#force_color_prompt=yes/force_color_prompt=yes/' /etc/skel/.bashrc

# Create NB_USER wtih name jovyan user with UID=1000 and in the 'users' group

# and make sure these dirs are writable by the `users` group.

RUN echo "auth requisite pam_deny.so" >> /etc/pam.d/su && \

sed -i.bak -e 's/^%admin/#%admin/' /etc/sudoers && \

sed -i.bak -e 's/^%sudo/#%sudo/' /etc/sudoers && \

useradd -m -s /bin/bash -N -u $NB_UID $NB_USER && \

mkdir -p $CONDA_DIR && \

chown $NB_USER:$NB_GID $CONDA_DIR && \

chmod g+w /etc/passwd && \

fix-permissions $HOME && \

fix-permissions $CONDA_DIR

USER $NB_UID

WORKDIR $HOME

ARG PYTHON_VERSION=default

# Setup work directory for backward-compatibility

RUN mkdir /home/$NB_USER/work && \

fix-permissions /home/$NB_USER

ENV MINICONDA_VERSION=4.6.14 \

CONDA_VERSION=4.7.10

RUN cd /tmp && \

wget --quiet https://repo.continuum.io/miniconda/Miniconda3-${MINICONDA_VERSION}-Linux-x86_64.sh && \

echo "718259965f234088d785cad1fbd7de03 *Miniconda3-${MINICONDA_VERSION}-Linux-x86_64.sh" | md5sum -c - && \

/bin/bash Miniconda3-${MINICONDA_VERSION}-Linux-x86_64.sh -f -b -p $CONDA_DIR && \

rm Miniconda3-${MINICONDA_VERSION}-Linux-x86_64.sh && \

echo "conda ${CONDA_VERSION}" >> $CONDA_DIR/conda-meta/pinned && \

$CONDA_DIR/bin/conda config --system --prepend channels conda-forge && \

$CONDA_DIR/bin/conda config --system --set auto_update_conda false && \

$CONDA_DIR/bin/conda config --system --set show_channel_urls true && \

$CONDA_DIR/bin/conda install --quiet --yes conda && \

$CONDA_DIR/bin/conda update --all --quiet --yes && \

conda list python | grep '^python ' | tr -s ' ' | cut -d '.' -f 1,2 | sed 's/$/.*/' >> $CONDA_DIR/conda-meta/pinned && \

conda clean --all -f -y && \

rm -rf /home/$NB_USER/.cache/yarn && \

fix-permissions $CONDA_DIR && \

fix-permissions /home/$NB_USER

# Install Tini

RUN conda install --quiet --yes 'tini=0.18.0' && \

conda list tini | grep tini | tr -s ' ' | cut -d ' ' -f 1,2 >> $CONDA_DIR/conda-meta/pinned && \

conda clean --all -f -y && \

fix-permissions $CONDA_DIR && \

fix-permissions /home/$NB_USER

# Install Jupyter Notebook, Lab, and Hub

# Generate a notebook server config

# Cleanup temporary files

# Correct permissions

# Do all this in a single RUN command to avoid duplicating all of the

# files across image layers when the permissions change

RUN conda install --quiet --yes \

'notebook=6.0.3' \

'jupyterhub=1.1.0' \

'jupyterlab=2.0.1' && \

conda clean --all -f -y && \

npm cache clean --force && \

jupyter notebook --generate-config && \

rm -rf $CONDA_DIR/share/jupyter/lab/staging && \

rm -rf /home/$NB_USER/.cache/yarn && \

fix-permissions $CONDA_DIR && \

fix-permissions /home/$NB_USER

EXPOSE 8888

# Configure container startup

ENTRYPOINT ["tini", "-g", "--"]

CMD ["start-notebook.sh"]

# Copy local files as late as possible to avoid cache busting

COPY start.sh start-notebook.sh start-singleuser.sh /usr/local/bin/

COPY jupyter_notebook_config.py /etc/jupyter/

# Fix permissions on /etc/jupyter as root

USER root

RUN fix-permissions /etc/jupyter/

# Switch back to jovyan to avoid accidental container runs as root

USER $NB_UID

##################CUDA

USER root

#FROM ubuntu:18.04

LABEL maintainer "NVIDIA CORPORATION <[email protected]>"

RUN apt-get update && apt-get install -y --no-install-recommends \

gnupg2 curl ca-certificates && \

curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - && \

echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list && \

echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list && \

apt-get purge --autoremove -y curl && \

rm -rf /var/lib/apt/lists/*

ENV CUDA_VERSION 10.2.89

ENV CUDA_PKG_VERSION 10-2=$CUDA_VERSION-1

# For libraries in the cuda-compat-* package: https://docs.nvidia.com/cuda/eula/index.html#attachment-a

RUN apt-get update && apt-get install -y --no-install-recommends \

cuda-cudart-$CUDA_PKG_VERSION \

cuda-compat-10-2 && \

ln -s cuda-10.2 /usr/local/cuda && \

rm -rf /var/lib/apt/lists/*

# Required for nvidia-docker v1

RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \

echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf

ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}

ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64

# nvidia-container-runtime

ENV NVIDIA_VISIBLE_DEVICES all

ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

ENV NVIDIA_REQUIRE_CUDA "cuda>=10.2 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 brand=tesla,driver>=418,driver<419"

USER $NB_UID

It is important to enable the root user for the CUDA part and then to switch back to normal user settings afterwards.

We have to build this image and push it to a repository of our choice. Then we have a base image with Jupyter and CUDA. To create the final image on top, we need to install the Rapids-Library (cuDF and cuML), Dask, Dask-cuDF and Dask-cuML. The none-Dask-Rapids is required for the Dask version. This can be easily done in just a few steps and the Dockerfile looks like this:

FROM <your_registry>

###############################################cudf

RUN conda install -c rapidsai -c nvidia -c conda-forge \

-c defaults cudf=0.13 cuml=0.13 python=3.7

##############################################DASK

RUN conda install --yes \

-c conda-forge -c rapidsai -c nvidia -c defaults \

python-blosc \

cytoolz \

dask==2.15.0 \

nomkl \

numpy==1.18.1 \

pandas==0.25.3 \

tini==0.18.0 \

zstd==1.4.3 \

&& conda clean -tipsy \

&& find /opt/conda/ -type f,l -name '*.a' -delete \

&& find /opt/conda/ -type f,l -name '*.pyc' -delete \

&& find /opt/conda/ -type f,l -name '*.js.map' -delete \

&& find /opt/conda/lib/python*/site-packages/bokeh/server/static -type f,l -name '*.js' -not -name '*.min.js' -delete \

&& rm -rf /opt/conda/pkgs

RUN python3 -m pip install pip --upgrade

COPY requirements.txt /home/files/requirements.txt

RUN pip install --default-timeout=300 -r /home/files/requirements.txt

#USER $NB_UID

In line 5, cuDF and cuML are installed. Line 10 installs Dask and a few needed libraries like NumPy or Pandas. This part, in particular lines 12 to 19, was copied from the daskdev/dask:latest Dockerfile. We will discuss later, why copying it was a good idea.

Finally, in line 27, libraries specified in the requirements.txt (which needs to be accessible while building the image) are installed via pip. These libraries are dask-kubernetes, dask_cuda, dask_cudf, dask_cuml and GCSFS (needed to read from google Buckets).

Again, we build the image and push it to a repository.

Deploying Jupyterhub

Now we are ready to deploy the JupyterHub image into our Kubernetes Cluster. This link provides a lot of information about deploying it on Kubernetes. There you can find many details on how to customize and personalize your deployment. We will come straight to the point. Create a file config.yaml according to your configuration preferences. My config looks like this:

proxy:

secretToken: "<YOUR 32 BYTES SECURITY TOKEN> "

# Do not assign a public IP

service:

type: NodePort

singleuser:

defaultUrl: "/lab"

#The service account we created for Jupyter

serviceAccountName: jupyter-service-account

#The final image we built

image:

name: <REGISTRY PATH HERE>

tag: <TAG>

storage:

#customize sotrage for jupyter client (default 10 Gi)

capacity: 20Gi

#Mounts for NVIDIA Drivers

extraVolumes:

- name: nvidia-debug-tools

hostPath:

path: /home/kubernetes/bin/nvidia/bin

- name: nvidia-libraries

hostPath:

path: /home/kubernetes/bin/nvidia/lib64

#The NFS PVC

- name: my-pvc-nfs

persistentVolumeClaim:

claimName: nfs

extraVolumeMounts:

#Mount NVIDIA drivers paths

- name: nvidia-debug-tools

mountPath: /usr/local/bin/nvidia

- name: nvidia-libraries

mountPath: /usr/local/nvidia/lib64

#Mount the NFS

- name: my-pvc-nfs

mountPath: "/home/jovyan/mnt"

#Create 2 Profiles, Notebook with or without a GPU

profileList:

- display_name: "GPU Server"

description: "Spawns a notebook server with access to a GPU"

kubespawner_override:

extra_resource_limits:

nvidia.com/gpu: "1"

- display_name: "GPU Server"

description: "Spawns a notebook server without access to a GPU"

extraConfig:

# use jupyterLab by default

1_jupyterlab:

c.Spawner.cmd = ['jupyter-labhub']

#Create a simple authentication

auth:

type: dummy

dummy:

password: '<YOUR PASSWORD>'

whitelist:

users:

- <USER>

To create your 32 Bytes security token, simply run:

openssl rand -hex 32

… in the terminal and paste the result into line 2 of your config. Then, specify your image, mount the configMap for accessing the Bucket and path to the NVIDIA Drivers (this might or might not be necessary). You can create different profiles with different requests for resources. In the above example, a profile with access to the GPU and one without it are available. A simple password-based authentication is provided as well.

Now we can add the JupyterHub Helm chart repository:

helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/

helm repo update

After a while an “Update Complete. Happy Helming” info should appear. We are ready to deploy the Hub. From the directory with the config.yaml, run:

helm upgrade --install jupyterhub jupyterhub/jupyterhub --namespace kubeyard --version=0.8.2 --values config.yaml

You might want to add a –timeout flag with a higher value, like 1000, since the image is quite big and it sometimes results in timeout errors. The deployment should create a Hub and Proxy pod. As soon as both are running, we can port-forward the proxy to a 8000 port:

kubectl port-forward <PROXY-POD NAME> 8000

Outlook on Part 3

Finally, Jupyter is up and running and port-forwarding is enabled. Now we can access JupyterHub from the browser, log in (if authentication is on) and we see the workspace of our JupyterLab. In the next part of our series we will finally use the prepared infrastructure for data science and compare the efficiency of four various approaches – including usage of multiple GPUs!

Rafal Lokuciejewski

Permissions for Dask-Clients

Docker Image for Jupyter

Deploying Jupyterhub

Outlook on Part 3

Recommend

Publishing Android libraries to MavenCentral in 2021

Developing Micro Frontends with Single-Spa

Optimising your Probation Process using Performance Management

Ignore Blogging Blarney: How To Get Awesome Blog Results

Facebook wants us to control its AR glasses with electrical impulses from our mi...

Women In Marketing: How To Succeed

Content Marketing Plan: How To Create A Strategy That Will Make You Stand Out

在树莓派上设置家庭网络的家长控制 | Linux 中国

Cloud-Native CI/CD, Part 2: Azure DevOps vs AWS vs Google Cloud

Video: Application Modernization in Baby Steps

About Joyk