39

GPU-as-a-Service on KubeFlow: Fast, Scalable and Efficient ML

 4 years ago
source link: https://towardsdatascience.com/gpu-as-a-service-on-kubeflow-fast-scalable-and-efficient-ml-c5783b95d192?gi=9e14af9c54b9
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Machine Learning (ML) and Deep Learning (DL) involve compute and data intensive tasks. In order to maximize our model accuracy, we want to train on larger datasets, evaluate a variety of algorithms, and try out different parameters for each algorithm (hyper-parameter tuning).

As our datasets and model complexity grow, so does the time we need to wait for our jobs to complete, leading to inefficient use of our time. We end up running fewer iterations and tests or working on smaller datasets as a result.

NVIDIA GPUs are a great tool to accelerate our data science work. They are the obvious choice when it comes to Deep Learning workloads and deliver better ROI than CPUs. With new developments like RAPIDS , NVIDIA is tackling data analytics and machine learning workloads (like XGBoost) efficiently (read the details in my previous post: Python Pandas at Extreme Performance ). For example, an analytic task of reading Json data, aggregating its metrics and writing back to a compressed (Parquet) file runs 1.4 sec on GPU versus 43.4 sec on CPU (that’s 30x faster!).

The Challenge: Sharing GPUs

CPUs have long supported technologies such as virtualization (hypervisors), virtual memory, and IO management. We can run many different workloads, virtual machines and containers on the same CPUs while ensuring maximum isolation. We can use a variety of clustering technologies to scale computation across multiple CPUs and systems and use schedulers to dynamically allocate computation to tasks.

GPUs, on the other hand, must be assigned to specific VMs or containers. This leads to inefficiency. When a GPU intensive task finishes, the GPU will stay idle. If we associate a notebook sever with a GPU (for example), we waste GPU resources whenever a task is not running — for example when we are writing or debugging code, or when we go to lunch. This is a problem, especially when GPU equipped servers are more expensive.

Some solutions exist which partition a single GPU to smaller virtual GPUs, but this doesn’t solve the problem, as the (mostly idle) fragment we get is too small or has too little memory to run our task, and given that it’s hard to isolate memory between tasks on the same GPU we can run into many potential glitches.

Solution: Dynamic GPU Allocation and Scale-Out

The solution users are looking for is one that can harness multiple GPUs for a single task (so it can complete faster) and allocate GPUs just for the duration of the task. This can be made possible by combining containers with orchestration, clustering and a shared data layer.

Let’s assume we write some code in our Jupyter notebook or IDE (e.g. PyCharm). We can execute it locally, but when we need scale, we turn on a knob and it runs 10–100x faster on a distributed cluster. Wouldn’t that be nice? Can we implement such a dream? Yes, we can, and I will show you a demo a little further into this article.

To achieve this, we need to be able to package and clone our code and libraries to multiple dynamically scheduled containers at run time. We need all those containers to share the same data and to implement task distribution/parallelism mechanism, as illustrated in the following diagram.

3IzEVjf.png!web

Dynamic GPU/CPU Allocation (image by author)

A new open-source ML Orchestration framework called MLRun allows us to define “serverless” ML functions which consist of code, configuration, packages and infrastructure dependencies (such as CPUs, GPUs, memory, storage, etc.). Those “serverless” functions can run locally in our notebook or run over one or more containers which are created dynamically for the duration of the task (or stay longer if needed), the client/notebook and containers can share the same code and data through a low-latency shared data plane (i.e. virtualized as one logical system).

MLRun builds on top of Kubernetes and KubeFlow , it uses the Kubernetes API to create and manage resources. It leverages KubeFlow custom resources (CRDs) for seamlessly running horizontal scaling workloads (such as Nuclio functions , Spark, Dask, Horovod …), KubeFlow SDK for attaching tasks to resources like storage volumes and secrets, and KubeFlow Pipelines to create multi-step execution graphs (DAG).

Every local or remote task which is executed through MLRun is tracked by the MLRun service controller, all inputs, outputs, logs, and artifacts are stored in a versioned database and can be browsed using a simple UI, SDK or REST API calls, i.e. a built-in job and artifact management. MLRun functions can be chained to form a pipeline, they support hyper-parameters and AutoML tasks, Git integration, Projects packing, but those are topics for different posts, read some more here .

MZVZ73R.png!web

GPU-as-a-Service Stack (image by author)

Example: Distributed Image Classification Using Keras and TensorFlow

In our example we have a 4-step pipeline based on the famous Cats & Dogs TensorFlow use-case:

  1. Data ingestion function — loading an image archive from AWS S3
  2. Data labeling function — labeling images as Dogs or Cats
  3. Distributed training function — Use TensorFlow/Keras with Horovod to train our model
  4. Deploy an interactive model serving function

You can see the full MLRun project and notebooks source here , we will focus on the 3rd and 4th steps. For this to work, you need to use Kubernetes with few open-source services running over it (Jupyter, KubeFlow Pipelines, KubeFlow MpiJob, Nuclio, MLRun) and shared file system access, or you can ask for an Iguazio cloud trial with those pre-integrated.

Our code can run locally (see the 1st and 2nd steps in the notebook), to run the distributed training on a cluster with GPUs we simply define our function as an MPIJob kind (one which uses MPI and Horovod to distribute our TensorFlow logic), we specify a link to the code, a container image (alternatively MLRun can build the image for us), the required number of containers, number of GPUs (per container), and attach it to a file mount (we apply iguazio low-latency v3io fabric mount, but other K8s shared file volume drivers or object storage solutions work just as well).

once we defined the function object all we need to do is run it with a bunch of parameters, inputs (datasets or files), and specify the default location for output artifacts (e.g. trained model files).

mprun = trainer.run(name='train', params=params, artifact_path='/User/mlrun/data', inputs=inputs, watch=True)

Note that in this example we didn’t need to move code or data, the fact that we used the same low-latency shared file system mounts across our notebook and worker containers means that we can just modify the code in Jupyter and re-run our job (all the job containers will see the new Py file changes), and all the job results will be instantly viewed in the Jupyter file browser or MLRun UI.

In addition to viewing the job progress interactively (watch=True) the run object (mprun) holds all the information on the run including pointers to the output artifacts, logs, status, etc. We can use MLRun web-based UI to track our job progress, compare experiment results or access versioned artifacts.

We use “.save()” to serialize and version our function object into a database, we can retrieve this function object later in a different notebook or CI/CD pipelines (no need to copy code and config between notebooks and teams).

If we want to deploy the generated model as an interactive serverless function all we need is to feed the “model” and “category_map” outputs into a “serving” function and deploy it to our test cluster.

MLRun orchestrates auto-scaling Nuclio functions which are super-fast and can be stateful (support GPU attachment, shared file mounts, state caching, streaming, etc.), the functions will auto scale to fit the workload and will scale to zero if requests will not arrive for a few minutes (consume zero resources). In this example we use “nuclio-serving” functions (Nuclio functions which host standard KFServing model classes), as you can see below, it only takes one command (deploy) to make it run as a live serverless function.

FFjeIra.png!web

Now we have a running inference function and we can test the end point using simple HTTP request with a url or even a binary image in the payload.

zqemyiy.png!web

End to End Workflow with KubeFlow Pipelines

Now that we tested each step of our pipeline manually, we may want to automate the process and potentially run it on a given schedule or be triggered by an event (e.g. a Git push). The next step is to define a KubeFlow Pipeline graph (DAG) which chains the 4 steps into a sequence and run the graph.

MLRun functions can be converted into KubeFlow Pipeline steps using a simple method (.as_step()), and specifying how step outputs are fed into other step inputs, check the full notebook example here , the following code demonstrates the graph DSL.

MLRun projects can have multiple workflows, and they can be launched with a single command or can be triggered by various events such as a Git push or HTTP REST call.

Once we run our pipeline we can track its progress using KubeFlow, MLRun will automatically register metrics, inputs, outputs and artifacts in KubeFlow UI without writing a single extra line of code (I guess you should try doing it first without MLRun to appreciate it :blush:).

jau2A3V.png!web

KubeFlow Output (image by author)

For a more basic project example you can see the MLRun Iris XGBoost Project , other demos can be found in MLRun Demos repository , and you can check MLRun readme and examples for tutorials and simple examples.

Summary

This article demonstrates how computational resources can be used efficiently to run data science jobs at scale, but more importantly, I demonstrated how data science development can be simplified and automated, allowing far greater productivity and faster time to market. Ping me on KubeFlow Slack if you have additional questions.

Check out my GPU As A Service presentation and live demo from KubeCon NA in November.

Yaron


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK