Distributed Deep Learning with Ansible, AWS and Pytorch Lightning. Part 1

How to automate and scale your deep learning experiments with Ansible, AWS cloud infrastructure and Pytorch Lightning library.

Alexander Reshytko

Jun 28 ·14min read

Let’s say you are a deep learning practitioner, but you don’t have an in-house GPU cluster or a machine learning platform at your disposal. Nobody trains their models on a CPU for almost a decade. Even worse, with models and datasets getting bigger, you have to deal with distributed deep learning and scale your training either in a model-parallel and/or data-parallel regimes. What can we do about it?

We can follow the modern cloud paradigm and utilize some GPU-as-a-service. It will allow you to allocate the necessary infrastructure dynamically on demand and release it once you have finished. It works well, but this is where the main complexity lies. Modern deep learning frameworks like PyTorch Lightning or Horovod make data-parallel distributed learning easy nowadays. The most annoying and time-consuming thing is creating a proper environment because often we have to do it manually. Even for services that hide a lot of infrastructure complexity from you — like Google Collab or Paperscape — some manual work still needs to be done.

I’m a strong believer in that manual work is your enemy. Why? Here is my list of personal concerns:

Reproducibility of results . Have you ever heard of a so-called human factor? We are very error-prone creatures and we are not good at memorizing something in great detail. The more human work some process involves the harder it will be to reproduce it in the future.
Mental distractions . Deep learning is an empirical endeavor and your progress in it relies deeply on your ability to iterate quickly and test as many hypotheses as you can. And due to that fact, anything that distracts you from your main task, — training and evaluating your models or analyzing the data, — negatively affects the success of an overall process.
Effectiveness. Computers do many things a lot faster than we, humans, do. When you have to repeat the same slow procedure over and over it all adds up.

Manual work is your enemy

In this article, I’ll describe how you can automate the way you conduct your deep learning experiments.

Automate your Deep Learning experiments

The following are three main ideas of this article:

Utilize cloud-based infrastructure to dynamically allocate resources for your training purposes;
Use DevOps automation toolset to manage all manual work on the experiment environment setup;
Write your training procedure in a modern deep learning framework that makes it capable of data-parallel distributed learning effortlessly.

6j6fymb.png!web

AWS EC2, Ansible and Pytorch Lightning. Image by author

To actually implement these ideas we will utilize AWS cloud infrastructure, Ansible automation tool, and PyTorch Lightning deep learning library.

Our work will be divided into two parts. In this article we will provide a minimal working example which:

Automatically creates and destroys EC2 instances for our deep learning cluster;
Establishes connectivity between them necessary for Pytorch and Pytorch Lightning distributed training;
Creates a local ssh config file to enable connection to the cluster;
Creates a Python virtual environment and installs all library dependencies for the experiment;
Provides a submit script to run distributed data-parallel workloads on the created cluster.

In the next article, we will add additional features and build a fully automated environment for distributed learning experiments.

Now, let’s take a brief overview of the chosen technology stack.

What is AWS EC2?

Source: https://github.com/gilbarbara/logos/tree/master/logos

AWS Elastic Compute Cloud (EC2) is a core AWS service that allows you to manage virtual machines in Amazon data centers. With this service you can dynamically create and destroy your machines either manually via AWS Console or via API provided by AWS SDK.

As of today, AWS provides a range of GPU-enabled instances for our purposes with one or multiple GPUs per instance and different choices of NVIDIA GPUs: Tesla GRID K520, M60, K80, T4, V100. See the official site for a full list.

What is Ansible?

Source: https://github.com/gilbarbara/logos/tree/master/logos

Ansible is a tool for software and infrastructure provisioning and configuration management. With Ansible you can remotely provision a whole cluster of remote servers, provision software on them, and monitor them.

It is an open-source project written in Python. It uses a declarative approach: you define a desired system state, and Ansible executes necessary actions. To do that you use ordinary YAML files. Declarative nature of Ansible also means that most of the instructions you define for it are idempotent: if you run it more than once it would not cause any undesirable side effects.

One of the distinctive features of Ansible is that it is agent-less, i.e. it doesn’t require any agent software to be installed on the manageable nodes. It operates solely via SSH protocol. So the only thing you need to ensure is the SSH connectivity between the control host on which you run Ansible commands and the inventory hosts you want to manage.

Ansible core concepts

Let’s dive a bit into the core concepts of Ansible. There are not many of those, so you can quickly get your head around them and start playing with this brilliant tool.

Inventory

Inventory is simply a list of hosts you want to manage with Ansible. They are organized into named groups. You can define inventory in an INI-formatted file if you have a static predefined infrastructure. Another way — use inventory plugins that will tell Ansible which hosts to operate on if your infrastructure is not known in advance or may change dynamically (like in our case here).

Modules

A module is the unit of work that you can perform in Ansible. There is a massive library of modules you can use in Ansible. And it constitutes an extremely extensible architecture. See the module index .

Variables

Nothing fancy here. You can define variables like in any programming language either to separate your logic from the data or to pass information between parts of your system. Ansible collects a lot of system information and stores them in predefined variables — facts. You can read more about variables in the official documentation .

Tasks

A task is a module invocation with some parameters. You can also define a name, variable to store the result, conditional and loop expressions for the task. Here is an example of a task that copies some local file into a remote computer’s file system when some_variable variable is defined:

ANJRj2Z.png!web

Copy task example

Plays

A play in Ansible is a way to apply a list of tasks to a group of hosts from inventory. You define a play as a dictionary in YAML. hosts parameter specifies an inventory group and tasks parameter contains a list of tasks.

Playbooks

A playbook is just a YAML file that contains a list of plays to run. The way to run a playbook is to pass it to ansible-playbook CLI that comes with Ansible installation.

Here’s a diagram to illustrate how these concepts interplay with each other:

MruYf2N.png!web

Ansible core concepts

There are also more advanced concepts in Ansible that allow you to write more modular code for complex scenarios. We’ll use some of them in Part 2 of the article.

What is Pytorch Lightning?

Source: Wikipedia

Pytorch Lightning is a high-level library on top of PyTorch. You can think of it as a Keras for PyTorch. There are a couple of features that make it stand out from the crowd of other PyTorch-based deep learning libraries:

It is transparent . As authors have written in the documentation, it is more a convention to write a Pytorch code than a separate framework. You don’t need to learn another library and you don’t need to make a huge effort to convert your ordinary PyTorch code to use it with Pytorch-Lightning. Your PyTorch Lightning code is actually your PyTorch code.
It hides a lot of boilerplate engineering code . Pytorch is a brilliant framework but when it comes to conducting full-featured experiments with it, you quickly end up with a lot of code that is not particularly related to the actual research you are doing. And you have to repeat this work every time. Pytorch Lighting provides this functionality for you. Specifically, it adds distributed data-parallel learning capability to your model with no modifications to the code required from you at all!
It is simple . All PyTorch Lightning code base revolves around a few number of abstractions:

LightningModule is a class that organizes your PyTorch code. The way you use PyTorch Lightning is by creating a custom class that is inherited from LightningModule and implementing its virtual methods. LightningModule itself is inherited from PyTorch Module.
Trainer automates your training procedure. Once you’ve organized your PyTorch code into a LightningModule, you pass its instance to a Trainer and it does the actual heavy lifting of training.
Callbacks , Loggers and Hooks are the means to customize the Trainer ’s behavior.

yIZ7viI.png!web

PyTorch Lightning Architecture

For more information read the official documentation .

Okay, enough talking, let’s start building.

Automate your Deep Learning experiments

What is AWS EC2?

What is Ansible?

Ansible core concepts

What is Pytorch Lightning?

Recommend

Automating Online Proctoring Using AI

How to sideload any application on Android TV

新工作第三十八和第三十九周

Hacker News front page in the style of a print newspaper

ELF: Better Symbol Lookup via Dt_gnu_hash

Implementing the Exponential Function

Bringing Kafka ACLs to Kubernetes the declarative way

Roy Fielding's Misappropriated REST Dissertation

Rust for JavaScript Developers - Tooling Ecosystem Overview

自如回应降租：均向业主按时打款个别在“友好协商”

About Joyk