4

How to train Deep Learning models on AWS Spot Instances using Spotty?

 3 years ago
source link: https://towardsdatascience.com/how-to-train-deep-learning-models-on-aws-spot-instances-using-spotty-8d9e0543d365
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How to train Deep Learning models on AWS Spot Instances using Spotty?

Image for post
Image for post

Spotty is a tool that drasticallysimplifies training of deep learning models on AWS.

Why will you ❤️ this tool?

  • it makes training on AWS GPU instances as simple as training on your local computer
  • it automatically manages all necessary AWS resources including AMIs, volumes, snapshots and SSH keys
  • it makes your model trainable on AWS by everyone with a couple of commands
  • it uses tmux to easily detach remote processes from SSH sessions
  • it saves you up to 70% of the costs by using AWS Spot Instances

To show how it works let’s take some non-trivial model and try to train it. I chose one of the implementations of Tacotron 2. It’s a speech synthesis system by Google.

Clone the repository of Tacotron 2 to your computer:

git clone https://github.com/Rayhane-mamah/Tacotron-2.git

Docker Image

Spotty trains models inside a Docker container. So we need to either find a publicly available Docker image that satisfies the model’s requirements or create a new Dockerfile with a proper environment.

This implementation of Tacotron uses Python 3 and TensorFlow, so we could use the official Tensorflow image from Docker Hub. But this image doesn’t satisfy all the requirements from the “requirements.txt” file. So we need to extend the image and install all necessary libraries on top of it.

Copy the “requirements.txt” file to the “docker/requirements-spotty.txt” file and create thedocker/Dockerfile.spotty file with the following content:

Here we’re extending the original TensorFlow image and installing all other requirements. This image will be built automatically when you start an instance.

Spotty Configuration File

Once we have the Dockerfile, we’re ready to write a Spotty configuration file. Create aspotty.yaml file in the root directory of your project.

Here you can find the full content of this file. It consists of 4 sections: project, container, instances,and scripts. Let’s look at them one by one.

Section 1: Project

This section contains the following parameters:

  1. name: name of the project. This name will be used in the names of all AWS resources created by Spotty for this project. For example, it will be used as a prefix for EBS volumes, or in the name of the S3 bucket that helps to synchronize the project’s code with the instance.
  2. syncFilters: synchronization filters. These filters will be used to skip some directories or files when synchronizing the project’s code with a running instance. In the example above we’re ignoring PyCharm configuration, Git files, Python cache files, and training data. Under the hood, Spotty is using these filters with the “aws s3 sync” command, so you can get more information about them here: Use of Exclude and Include Filter.

Section 2: Container

This section describes a Docker container for your project:

  1. projectDir: a directory inside the container where the local project will be synchronized once an instance is started. Make sure that either it’s a subdirectory of a volume mount path (see below) or it exactly matches a volume mount path, otherwise, all remote changes to the project’s code will be lost once the instance is stopped.
  2. volumeMounts: defines directories inside a container where EBS volumes should be mounted. EBS volumes themselves will be described in the instances section of the configuration file. Each element of this list describes one mount point, where the name parameter should match the corresponding EBS volume from the instance section (see below), and the mountPath parameter specifies a volume’s directory inside a container.
  3. file: a path to the Dockerfile that we created before. The Docker image will be built automatically once the instance is started. As an alternative approach, you could build the image locally and push it to Docker Hub, then you can directly specify the image by its name using the image parameter instead of the fileparameter.
  4. ports: ports that should be exposed by the instance. In the example above we opened 2 ports: 6006 for TensorBoard and 8888 for Jupyter Notebook.

Read more about other container parameters in the documentation.

Section 3: Instances

This section describes a list of instances with their parameters. Each instance contains the following parameters:

  1. name: name of the instance. This name will be used in the names of AWS resources that were created specifically for this instance. For example, EBS volumes and an EC2 instance itself. Also, this name can be used in the Spotty commands if you have more than one instance in the configuration file. For example, spotty start i1.
  2. provider: a cloud provider for the instance. At the moment Spotty supports only “aws” provider (Amazon Web Services), but Google Cloud Platform will be supported in the near future as well.
  3. parameters: parameters of the instance. They are specific to a cloud provider. See parameters for an AWS instance below.

AWS instance parameters:

  1. region: AWS region where a Spot Instance should be launched.
  2. instanceType: type of an EC2 instance. Read more about AWS GPU instances here.
  3. volumes: a list of EBS volumes that should be attached to the instance. To have a volume attached to the container’s filesystem, the name parameter should match one of the volumeMounts names from the container section. See the description of an EBS volume parameters below.
  4. dockerDataRoot: using this parameter we can change a directory where Docker stores all images including our built image. In the example above we make sure that it’s a directory on an attached EBS volume. So next time the image will not be rebuilt again, but just loaded from the Docker cache.

EBS volume parameters:

  1. size: size of the volume in GB.
  2. deletionPolicy: what to do with the volume once the instance is stopped using the spotty stop command. Possible values include: “create_snapshot (default), “update_snapshot”, “retain” and “delete”. Read more in the documentation: Volumes and Deletion Policies.
  3. mountDir: a directory where the volume will be mounted on the instance. By default, it will be mounted to the “/mnt/<ebs_volume_name>” directory. In the example above, we need to explicitly specify this directory for the “docker” volume, because we reuse this value in the dockerDataRoot parameter.

Read more about other AWS instance parameters in the documentation.

Section 4: Scripts

Scripts are optional but very useful. They can be run on the instance using the following command:

spotty run <SCRIPT_NAME>

For this project we’ve created 4 scripts:

  • preprocess: downloads the dataset and prepares it for training,
  • train: starts training,
  • tensorboard: runs TensorBoard on the port 6006,
  • jupyter: starts Jupyter Notebook server on the port 8888.

That’s it! The model is ready to be trained on AWS.

Spotty Installation

Requirements

Installation

Install Spotty using pip:

pip install -U spotty

Model Training

1. Start a Spot Instance with the Docker container:

spotty start
Image for post
Image for post

Once the instance is up and running, you will see its IP address. Use it to open TensorBoard and Jupyter Notebook later.

2. Download and preprocess the data for the Tacotron model. We already have a custom script in the configuration file to do that, just run:

spotty run preprocess
Image for post
Image for post

Once the data is processed, use the Ctrl + b, then x combination of keys to close the tmux pane.

3. Once the preprocessing is done, train the model. Run the “train” script:

spotty run train
Image for post
Image for post

You can detach this SSH session using the Ctrl + b, then d combination of keys. The training process won’t be interrupted. To reattach that session, just run the spotty run train command again.

TensorBoard

Start TensorBoard using the “tensorboard” script:

spotty run tensorboard

TensorBoard will be running on the port 6006. You can detach the SSH session using the Ctrl + b, then d combination of keys, TensorBoard will still be running.

Jupyter Notebook

You also can start Jupyter Notebook using the “jupyter” script:

spotty run jupyter

Jupyter Notebook will be running on the port 8888. Open it using the instance IP address and the token that you will see in the command output.

Download Checkpoints

If you need to download checkpoints or any other files from the running instance to your local machine, just use the download command:

spotty download -f 'logs-Tacotron-2/taco_pretrained/*'

SSH Connection

To connect to the running Docker container via SSH, use the following command:

spotty ssh

It uses a tmux session, so you can always detach it using the Ctrl + b, then d combination of keys and attach that session later using the spotty ssh command again.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK