Kubernetes / Machine Learning

How I Built an On-Premises AI Training Testbed with Kubernetes and Kubeflow

9 Apr 2021 10:14am, by Janakiram MSV

This post is the fourth in a series of articles exploring the Kubeflow machine learning platform. Check back each Friday for future installments. (Part 1, Part 2, and Part 3).

I recently built a four-node bare metal Kubernetes cluster comprising CPU and GPU hosts for all my AI experiments. Though it makes economic sense to leverage the public cloud for provisioning the infrastructure, I invested a fortune in the AI testbed that’s within my line of sight.

Firstly, GPUs in the public cloud are scarce resources. There is no guarantee that you would be able to provision a GPU host when you need it. Secondly, running GPU infrastructure in the cloud is expensive. AI accelerated VMs cost 5x than their CPU counterparts. Finally, I wanted to have a bare metal cluster with nodes that I can treat as cattle than pets.

I plan to experiment with multiple configurations through rapid provisioning and de-provisioning. For example, I want to understand how Nvidia Container Toolkit works with Docker Engine and Containerd by swapping the underlying container runtimes. Similarly, I am curious to see the tradeoffs between using Nvidia GPU Device Plugin and Nvidia GPU Operator. I also want to run diverse tools that serve and optimize the models for the edge, which includes Nvidia Triton Inference Server, Intel OpenVINO Toolkit, and ONNX Runtime.

After experimenting with various configurations, I finally managed to find the ideal set of tools and software to build and manage the AI testbed.

Through this article, I want to give you an insight into the choices I made while building my dream setup.

The Hardware: Hybrid Cluster with CPU and GPU Hosts

Since this is not a production setup but a lab, I wanted to optimize the cost. The cluster runs four CPU nodes and a GPU node powered by GeForce RTX 3090 GPU, the state-of-the-art AI accelerator.

The CPU node is an Intel NUC powered by an i7 CPU with four cores and eight threads. I added 32 GB of DDR4 RAM and 1TB NVMe storage to each host. This is a decent configuration for development.

The GPU node is a custom-built PC based on the latest AMD Ryzen Threadripper 3990X CPU with 64 Cores, Nvidia GeForce RTX 3090 GPU with 24 GB and 10496 CUDA Cores, 128 GB RAM, and 3 TB of NVMe storage.

All the hosts are connected to a managed Gigabit switch with a dedicated subnet and a static IP address.

This cluster’s collective power gives me 80 cores of CPU, 256 GB of RAM, and 7 TB of NVMe storage combined with a GPU with 10496 CUDA Cores and 24 GB memory.

The Software: Kubernetes 1.18 running on Ubuntu 18.04 LTS

Given Nvidia GPUs’ support and driver compatibility, I chose Ubuntu 18.04.5 LTS Server as the base operating system.

Though I want to experiment with multiple versions of Kubernetes and container runtimes, the base version of Kubernetes is 1.18.9, which is stable and compatible with Kubeflow 1.2.

Since I want to manage this infrastructure like a set of VMs running in the cloud, I configured the hosts to network boot through PXE. I use Clonezilla Live Server to manage the imaging and cloning of the cluster.

The PXE boot process allows me to move from one configuration to the other rapidly. The NAS server maintains bootable images with various pre-configured images of the hosts. For example, to move from 1.18.9 to the latest version of Kubernetes, it just takes 20 minutes to roll out a ready cluster.

For creating the baseline image of the cluster, I use either Kubespray or Nvidia DeepOps. To learn how to use DeepOps to configure GPU hosts, refer to this tutorial.

The process of configuring remote boot, cloning, imaging, and managing various versions deserves a separate article, which I plan to cover in the future.

Configuring the GPU Node

Nvidia provides two mechanisms for exposing the GPU to Kubernetes workloads. The first approach is to prepare the host by installing the Nvidia driver, CUDA runtime, followed by cuDNN libraries. Essentially, you configure the host as if it is a standalone deep learning machine. Within Kubernetes, you can install the Nvidia Device Plugin that talks to the software stack you configured to make the GPU accessible to the containers and pods.

The most elegant approach uses Nvidia GPU Operator, which doesn’t have any prerequisites other than having a GPU attached to the machine. You don’t even need to install the driver. Nvidia has containerized everything, including the driver, which runs as a daemonset on each node.

The Nvidia GPU Operator dramatically simplifies the configuration and preparation of the GPU host for Kubernetes. While you may not have the required software and tools such as nvidia-smi on the host, you can access everything through containers. The operator configures Nvidia Container Runtime Toolkit to bridge the gap between the physical GPU and the container runtime.

But, you can access the GPU and the tools through the Kubernetes pod. The below screenshot shows a pod accessing the GPU.

Nvidia GPU Operator spares you from manually installing and configuring the drivers, runtime, and other tools required to access the GPU.

If you are setting up your cluster with the latest build of Nvidia DeepOps, GPU Operator is the default option. When using Kubespray or kubeadm to set up your cluster, install the operator from the Helm Chart.

Container-Native Network and Storage Choices

My choice of CNI networking stack for Kubernetes is Calico. It’s a mature, secure, and reliable software for configuring and securing cloud native workloads running on Kubernetes.

When it comes to storage, you need overlay storage layer that supports shared volumes with RWX support. This is because multiple workloads running as a part of Kubeflow need to share artifacts such as datasets, configuration files, and models.

I prefer Portworx by Pure Storage as the de facto storage engine for all my workloads. You can try the free version, PX-Essentials, to configure your cluster. In the upcoming parts of this series, I will do a deep dive into installing and configuring Portworx for running ML workloads.

If you want to choose a storage backend that supports RWX volumes and claims, configure NFS on the master and then use the NFS client provisioner for the dynamic provisioning.

The below screenshot shows an NFS share exported from the master, made available through the NFS client provisioner.

The Helm Chart for the NFS client provisioner will also create the default storage class for dynamic provisioning.

We will take a closer look at the storage configuration and the need for RWX volumes when we explore Kubeflow Notebook Servers.

Kubeflow: The Cloud Native ML Platform

With all the prerequisites in place, I finally installed Kubeflow, which is the platform for all things ML on Kubernetes.

You can install it directly through the kfctl tool or rely on the Nvidia DeepOps installer. I chose the latter because of the simplicity and the integration with the other components of the stack.

For a quick overview of Kubeflow components, refer to the previous part of this series.

In the next part of this series, we will exploit this infrastructure to configure Kubeflow Notebook Servers that perform data preparation, training, and inference. Stay tuned.

How I Built an On-Premises AI Training Testbed with Kubernetes and Kubeflow

Kubernetes / Machine Learning

How I Built an On-Premises AI Training Testbed with Kubernetes and Kubeflow

9 Apr 2021 10:14am, by Janakiram MSV

The Hardware: Hybrid Cluster with CPU and GPU Hosts

The Software: Kubernetes 1.18 running on Ubuntu 18.04 LTS

Configuring the GPU Node

Container-Native Network and Storage Choices

Kubeflow: The Cloud Native ML Platform

Recommend

Releasing Windows 10 Build 19043.928 (21H1) to Beta & Release Preview Channe...

What's Cool in C# 8 and .NET Core 3

C++ coroutines: The lifetime of objects involved in the coroutine function

手把手教你用 Github Actions 部署前端项目

Meet a recent Microsoft Learn Student Ambassador graduate: Tarun Nanduri

[Last Week in .NET #37] … and I would deadlock 10,000 schemas…

Women on Sketchfab: Rosie Jarvis

Implementing a Domain with POCO (Plain Old CLR Objects)

1款工具助力Rancher HA快速部署，极速提升研发测试效率

阿里巴巴云原生 etcd 服务集群管控优化实践

About Joyk