6

Data Processing Scaled Up and Out with Dask and RAPIDS: Installing a Data Scienc...

 3 years ago
source link: https://www.inovex.de/blog/data-processing-dask-rapids-installing-data-science-app-dask-client/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Rafal Lokuciejewski

This blog post tutorial shows how a scalable and high-performance environment for machine learning can be set up using the ingredients GPUs, Kubernetes clusters, Dask and Jupyter. In the first article of our blog series we have set up a Kubernetes cluster with access to GPUs. In this part we will add containerized applications to the cluster to be able to run data processing workloads in our cluster. Being more precise: we will prepare a notebook image that has CUDA installed which is required if we want to use GPU-based frameworks. Furthermore, the image should contain Dask, Rapids and Dask-Rapids. As soon as the image is ready, we will deploy JupyterHub which spawns said notebook image as a container for each user. 

We will use JupyterLab notebooks as an interactive environment to start our data processing algorithms. In other words, JupyterLab will act as a Dask client. As we want to provide an environment not only for one data scientist but for a group of users, we decided to install JupyterHub on our Kubernetes cluster. JupyterHub makes it possible to serve a pre-configured data science environment to a group of users.

Permissions for Dask-Clients

At first, we have to care about the permissions of our JupyterLab instances. When being used as a Dask-Client it needs to have sufficient permissions to start new pods acting as  Dask-workers. As we decided to install JupyterHub, no extra configuration is required since JupyterHub uses a Service Account with sufficient permissions per default. If you want to use Dask from a different environment, you will have to make sure to grant correct permissions for your client to create, delete, view etc. of your Dask-worker-pods via a Service Account.

Docker Image for Jupyter

JupyterHub is a multi-tenant version of JupyterLab. The hub creates a pod in the cluster for each user and pulls the notebook image that runs on that pod. There are official Jupyter-specific images like the Minimal-Notebook or the Data-Science-Notebook that are ready to use. However, to use the Rapids-Library, CUDA Toolkit is required. So we cannot use these base images and simply add Rapids and Dask to it.

It seems to be a good idea to create a base image which contains Jupyter and CUDA and use it to build an image with Rapids and Dask. Since Rapids and Dask are still in development and new versions are released frequently, keeping Jupyter and CUDA as a separate base image will make it easier to maintain our final image.

Fortunately, there are not only official Notebook images but also official images from NVIDIA with CUDA. We can simply combine both images. We will use the the base-notebook Dockerfile  from hereand the 10.2-base-ubuntu-18.04 CUDA 10.2 Dockerfile from  here . We then combine both of them into a single image. Keep in mind that for the base-notebook you need to have following files together with your Dockerfile:

  1.  fix-permissions
  2.  jupyter_notebook_config.py
  3.  start.sh
  4.  start-notebook-sh
  5.  start-singleuser.sh

All these files can be found in the base-notebook registry from the link above. The resulting Dockerfile is listed below:

ARG ROOT_CONTAINER=ubuntu:bionic-20200311@sha256:e5dd9dbb37df5b731a6688fa49f4003359f6f126958c9c928f937bec69836320
ARG BASE_CONTAINER=$ROOT_CONTAINER
FROM $BASE_CONTAINER
LABEL maintainer="Jupyter Project <[email protected]>"
ARG NB_USER="jovyan"
ARG NB_UID="1000"
ARG NB_GID="100"
USER root
# Install all OS dependencies for notebook server that starts but lacks all
# features (e.g., download as all possible file formats)
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update \
&& apt-get install -yq --no-install-recommends \
    wget \
    bzip2 \
    ca-certificates \
    sudo \
    locales \
    fonts-liberation \
    run-one \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen && \
    locale-gen
# Configure environment
ENV CONDA_DIR=/opt/conda \
    SHELL=/bin/bash \
    NB_USER=$NB_USER \
    NB_UID=$NB_UID \
    NB_GID=$NB_GID \
    LC_ALL=en_US.UTF-8 \
    LANG=en_US.UTF-8 \
    LANGUAGE=en_US.UTF-8
ENV PATH=$CONDA_DIR/bin:$PATH \
    HOME=/home/$NB_USER
# Copy a script that we will use to correct permissions after running certain commands
COPY fix-permissions /usr/local/bin/fix-permissions
RUN chmod a+rx /usr/local/bin/fix-permissions
# Enable prompt color in the skeleton .bashrc before creating the default NB_USER
RUN sed -i 's/^#force_color_prompt=yes/force_color_prompt=yes/' /etc/skel/.bashrc
# Create NB_USER wtih name jovyan user with UID=1000 and in the 'users' group
# and make sure these dirs are writable by the `users` group.
RUN echo "auth requisite pam_deny.so" >> /etc/pam.d/su && \
    sed -i.bak -e 's/^%admin/#%admin/' /etc/sudoers && \
    sed -i.bak -e 's/^%sudo/#%sudo/' /etc/sudoers && \
    useradd -m -s /bin/bash -N -u $NB_UID $NB_USER && \
    mkdir -p $CONDA_DIR && \
    chown $NB_USER:$NB_GID $CONDA_DIR && \
    chmod g+w /etc/passwd && \
    fix-permissions $HOME && \
    fix-permissions $CONDA_DIR
USER $NB_UID
WORKDIR $HOME
ARG PYTHON_VERSION=default
# Setup work directory for backward-compatibility
RUN mkdir /home/$NB_USER/work && \
    fix-permissions /home/$NB_USER
ENV MINICONDA_VERSION=4.6.14 \
    CONDA_VERSION=4.7.10
RUN cd /tmp && \
    wget --quiet https://repo.continuum.io/miniconda/Miniconda3-${MINICONDA_VERSION}-Linux-x86_64.sh && \
    echo "718259965f234088d785cad1fbd7de03 *Miniconda3-${MINICONDA_VERSION}-Linux-x86_64.sh" | md5sum -c - && \
    /bin/bash Miniconda3-${MINICONDA_VERSION}-Linux-x86_64.sh -f -b -p $CONDA_DIR && \
    rm Miniconda3-${MINICONDA_VERSION}-Linux-x86_64.sh && \
    echo "conda ${CONDA_VERSION}" >> $CONDA_DIR/conda-meta/pinned && \
    $CONDA_DIR/bin/conda config --system --prepend channels conda-forge && \
    $CONDA_DIR/bin/conda config --system --set auto_update_conda false && \
    $CONDA_DIR/bin/conda config --system --set show_channel_urls true && \
    $CONDA_DIR/bin/conda install --quiet --yes conda && \
    $CONDA_DIR/bin/conda update --all --quiet --yes && \
    conda list python | grep '^python ' | tr -s ' ' | cut -d '.' -f 1,2 | sed 's/$/.*/' >> $CONDA_DIR/conda-meta/pinned && \
    conda clean --all -f -y && \
    rm -rf /home/$NB_USER/.cache/yarn && \
    fix-permissions $CONDA_DIR && \
    fix-permissions /home/$NB_USER
# Install Tini
RUN conda install --quiet --yes 'tini=0.18.0' && \
    conda list tini | grep tini | tr -s ' ' | cut -d ' ' -f 1,2 >> $CONDA_DIR/conda-meta/pinned && \
    conda clean --all -f -y && \
    fix-permissions $CONDA_DIR && \
    fix-permissions /home/$NB_USER
# Install Jupyter Notebook, Lab, and Hub
# Generate a notebook server config
# Cleanup temporary files
# Correct permissions
# Do all this in a single RUN command to avoid duplicating all of the
# files across image layers when the permissions change
RUN conda install --quiet --yes \
    'notebook=6.0.3' \
    'jupyterhub=1.1.0' \
    'jupyterlab=2.0.1' && \
    conda clean --all -f -y && \
    npm cache clean --force && \
    jupyter notebook --generate-config && \
    rm -rf $CONDA_DIR/share/jupyter/lab/staging && \
    rm -rf /home/$NB_USER/.cache/yarn && \
    fix-permissions $CONDA_DIR && \
    fix-permissions /home/$NB_USER
EXPOSE 8888
# Configure container startup
ENTRYPOINT ["tini", "-g", "--"]
CMD ["start-notebook.sh"]
# Copy local files as late as possible to avoid cache busting
COPY start.sh start-notebook.sh start-singleuser.sh /usr/local/bin/
COPY jupyter_notebook_config.py /etc/jupyter/
# Fix permissions on /etc/jupyter as root
USER root
RUN fix-permissions /etc/jupyter/
# Switch back to jovyan to avoid accidental container runs as root
USER $NB_UID
##################CUDA
USER root
#FROM ubuntu:18.04
LABEL maintainer "NVIDIA CORPORATION <[email protected]>"
RUN apt-get update && apt-get install -y --no-install-recommends \
gnupg2 curl ca-certificates && \
    curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - && \
    echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list && \
    echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list && \
    apt-get purge --autoremove -y curl && \
rm -rf /var/lib/apt/lists/*
ENV CUDA_VERSION 10.2.89
ENV CUDA_PKG_VERSION 10-2=$CUDA_VERSION-1
# For libraries in the cuda-compat-* package: https://docs.nvidia.com/cuda/eula/index.html#attachment-a
RUN apt-get update && apt-get install -y --no-install-recommends \
        cuda-cudart-$CUDA_PKG_VERSION \
cuda-compat-10-2 && \
ln -s cuda-10.2 /usr/local/cuda && \
    rm -rf /var/lib/apt/lists/*
# Required for nvidia-docker v1
RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
    echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV NVIDIA_REQUIRE_CUDA "cuda>=10.2 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 brand=tesla,driver>=418,driver<419"
USER $NB_UID

It is important to enable the root user for the CUDA part and then to switch back to normal user settings afterwards.

We have to build this image and push it to a repository of our choice. Then we have a base image with Jupyter and CUDA. To create the final image on top, we need to install the Rapids-Library (cuDF and cuML), Dask, Dask-cuDF and Dask-cuML. The none-Dask-Rapids is required for the Dask version. This can be easily done in just a few steps and the Dockerfile looks like this:

FROM <your_registry>
###############################################cudf
RUN conda install -c rapidsai -c nvidia -c conda-forge \
    -c defaults cudf=0.13 cuml=0.13 python=3.7 
##############################################DASK
RUN conda install --yes \
    -c conda-forge -c rapidsai -c nvidia -c defaults \
    python-blosc \
    cytoolz \
    dask==2.15.0 \
    nomkl \
    numpy==1.18.1 \
    pandas==0.25.3 \
    tini==0.18.0 \
    zstd==1.4.3 \
    && conda clean -tipsy \
    && find /opt/conda/ -type f,l -name '*.a' -delete \
    && find /opt/conda/ -type f,l -name '*.pyc' -delete \
    && find /opt/conda/ -type f,l -name '*.js.map' -delete \
    && find /opt/conda/lib/python*/site-packages/bokeh/server/static -type f,l -name '*.js' -not -name '*.min.js' -delete \
    && rm -rf /opt/conda/pkgs
RUN python3 -m pip install pip --upgrade
COPY requirements.txt /home/files/requirements.txt
RUN pip install --default-timeout=300 -r /home/files/requirements.txt
#USER $NB_UID

In line 5, cuDF and cuML are installed. Line 10 installs Dask and a few needed libraries like NumPy or Pandas. This part, in particular lines 12 to 19, was copied from the daskdev/dask:latest Dockerfile. We will discuss later, why copying it was a good idea.

Finally, in line 27, libraries specified in the requirements.txt (which needs to be accessible while building the image) are installed via pip. These libraries are dask-kubernetes, dask_cuda, dask_cudf, dask_cuml  and GCSFS (needed to read from google Buckets).

Again, we build the image and push it to a repository.

Deploying Jupyterhub

Now we are ready to deploy the JupyterHub image into our Kubernetes Cluster. This link provides a lot of information about deploying it on Kubernetes. There you can find many details on how to customize and personalize your deployment. We will come straight to the point. Create a file config.yaml according to your configuration preferences. My config looks like this:

proxy:
secretToken: "<YOUR 32 BYTES SECURITY TOKEN> "
# Do not assign a public IP
service:
type: NodePort
singleuser:
defaultUrl: "/lab"
#The service account we created for Jupyter
serviceAccountName: jupyter-service-account
#The final image we built
image:
name: <REGISTRY PATH HERE>
tag: <TAG>
storage:
#customize sotrage for jupyter client (default 10 Gi)
capacity: 20Gi
#Mounts for NVIDIA Drivers
extraVolumes:
- name: nvidia-debug-tools
hostPath:
path: /home/kubernetes/bin/nvidia/bin
- name: nvidia-libraries
hostPath:
path: /home/kubernetes/bin/nvidia/lib64
#The NFS PVC
- name: my-pvc-nfs
persistentVolumeClaim:
claimName: nfs
extraVolumeMounts:
#Mount NVIDIA drivers paths
- name: nvidia-debug-tools
mountPath: /usr/local/bin/nvidia
- name: nvidia-libraries
mountPath: /usr/local/nvidia/lib64
#Mount the NFS
- name: my-pvc-nfs
mountPath: "/home/jovyan/mnt"
#Create 2 Profiles, Notebook with or without a GPU
profileList:
- display_name: "GPU Server"
description: "Spawns a notebook server with access to a GPU"
kubespawner_override:
extra_resource_limits:
nvidia.com/gpu: "1"
- display_name: "GPU Server"
description: "Spawns a notebook server without access to a GPU"
extraConfig:
# use jupyterLab by default
1_jupyterlab:
c.Spawner.cmd = ['jupyter-labhub']
#Create a simple authentication
auth:
type: dummy
dummy:
password: '<YOUR PASSWORD>'
whitelist:
users:
- <USER>

To create your 32 Bytes security token, simply run:

openssl rand -hex 32

… in the terminal and paste the result into line 2 of your config. Then, specify your image, mount the configMap for accessing the Bucket and path to the NVIDIA Drivers (this might or might not be necessary). You can create different profiles with different requests for resources. In the above example, a profile with access to the GPU and one without it are available. A simple password-based authentication is provided as well.

Now we can add the JupyterHub Helm chart repository:

helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm repo update

After a while an  “Update Complete. Happy Helming” info should appear. We are ready to deploy the Hub. From the directory with the config.yaml, run:

helm upgrade --install jupyterhub jupyterhub/jupyterhub --namespace kubeyard --version=0.8.2 --values config.yaml

You might want to add a  –timeout flag with a higher value, like 1000, since the image is quite big and it sometimes results in timeout errors. The deployment should create a Hub and Proxy pod. As soon as both are running, we can port-forward the proxy to a 8000 port:

kubectl port-forward <PROXY-POD NAME> 8000

Outlook on Part 3

Finally, Jupyter is up and running and port-forwarding is enabled. Now we can access JupyterHub from the browser, log in (if authentication is on) and we see the workspace of our JupyterLab.  In the next part of our series we will finally use the prepared infrastructure for data science and compare the efficiency of four various approaches – including usage of multiple GPUs!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK