How to train Deep Learning models on AWS Spot Instances using Spotty?
source link: https://towardsdatascience.com/how-to-train-deep-learning-models-on-aws-spot-instances-using-spotty-8d9e0543d365
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
How to train Deep Learning models on AWS Spot Instances using Spotty?
Spotty is a tool that drasticallysimplifies training of deep learning models on AWS.
Why will you ❤️ this tool?
- it makes training on AWS GPU instances as simple as training on your local computer
- it automatically manages all necessary AWS resources including AMIs, volumes, snapshots and SSH keys
- it makes your model trainable on AWS by everyone with a couple of commands
- it uses tmux to easily detach remote processes from SSH sessions
- it saves you up to 70% of the costs by using AWS Spot Instances
To show how it works let’s take some non-trivial model and try to train it. I chose one of the implementations of Tacotron 2. It’s a speech synthesis system by Google.
Clone the repository of Tacotron 2 to your computer:
git clone https://github.com/Rayhane-mamah/Tacotron-2.git
Docker Image
Spotty trains models inside a Docker container. So we need to either find a publicly available Docker image that satisfies the model’s requirements or create a new Dockerfile with a proper environment.
This implementation of Tacotron uses Python 3 and TensorFlow, so we could use the official Tensorflow image from Docker Hub. But this image doesn’t satisfy all the requirements from the “requirements.txt” file. So we need to extend the image and install all necessary libraries on top of it.
Copy the “requirements.txt” file to the “docker/requirements-spotty.txt” file and create thedocker/Dockerfile.spotty
file with the following content:
Here we’re extending the original TensorFlow image and installing all other requirements. This image will be built automatically when you start an instance.
Spotty Configuration File
Once we have the Dockerfile, we’re ready to write a Spotty configuration file. Create aspotty.yaml
file in the root directory of your project.
Here you can find the full content of this file. It consists of 4 sections: project, container, instances,and scripts. Let’s look at them one by one.
Section 1: Project
This section contains the following parameters:
name
: name of the project. This name will be used in the names of all AWS resources created by Spotty for this project. For example, it will be used as a prefix for EBS volumes, or in the name of the S3 bucket that helps to synchronize the project’s code with the instance.syncFilters
: synchronization filters. These filters will be used to skip some directories or files when synchronizing the project’s code with a running instance. In the example above we’re ignoring PyCharm configuration, Git files, Python cache files, and training data. Under the hood, Spotty is using these filters with the “aws s3 sync” command, so you can get more information about them here: Use of Exclude and Include Filter.
Section 2: Container
This section describes a Docker container for your project:
projectDir
: a directory inside the container where the local project will be synchronized once an instance is started. Make sure that either it’s a subdirectory of a volume mount path (see below) or it exactly matches a volume mount path, otherwise, all remote changes to the project’s code will be lost once the instance is stopped.volumeMounts
: defines directories inside a container where EBS volumes should be mounted. EBS volumes themselves will be described in theinstances
section of the configuration file. Each element of this list describes one mount point, where thename
parameter should match the corresponding EBS volume from theinstance
section (see below), and themountPath
parameter specifies a volume’s directory inside a container.file
: a path to the Dockerfile that we created before. The Docker image will be built automatically once the instance is started. As an alternative approach, you could build the image locally and push it to Docker Hub, then you can directly specify the image by its name using theimage
parameter instead of thefile
parameter.ports
: ports that should be exposed by the instance. In the example above we opened 2 ports: 6006 for TensorBoard and 8888 for Jupyter Notebook.
Read more about other container parameters in the documentation.
Section 3: Instances
This section describes a list of instances with their parameters. Each instance contains the following parameters:
name
: name of the instance. This name will be used in the names of AWS resources that were created specifically for this instance. For example, EBS volumes and an EC2 instance itself. Also, this name can be used in the Spotty commands if you have more than one instance in the configuration file. For example,spotty start i1
.provider
: a cloud provider for the instance. At the moment Spotty supports only “aws” provider (Amazon Web Services), but Google Cloud Platform will be supported in the near future as well.parameters
: parameters of the instance. They are specific to a cloud provider. See parameters for an AWS instance below.
AWS instance parameters:
region
: AWS region where a Spot Instance should be launched.instanceType
: type of an EC2 instance. Read more about AWS GPU instances here.volumes
: a list of EBS volumes that should be attached to the instance. To have a volume attached to the container’s filesystem, thename
parameter should match one of thevolumeMounts
names from thecontainer
section. See the description of an EBS volume parameters below.dockerDataRoot
: using this parameter we can change a directory where Docker stores all images including our built image. In the example above we make sure that it’s a directory on an attached EBS volume. So next time the image will not be rebuilt again, but just loaded from the Docker cache.
EBS volume parameters:
size
: size of the volume in GB.deletionPolicy
: what to do with the volume once the instance is stopped using thespotty stop
command. Possible values include: “create_snapshot” (default), “update_snapshot”, “retain” and “delete”. Read more in the documentation: Volumes and Deletion Policies.mountDir
: a directory where the volume will be mounted on the instance. By default, it will be mounted to the “/mnt/<ebs_volume_name>” directory. In the example above, we need to explicitly specify this directory for the “docker” volume, because we reuse this value in thedockerDataRoot
parameter.
Read more about other AWS instance parameters in the documentation.
Section 4: Scripts
Scripts are optional but very useful. They can be run on the instance using the following command:
spotty run <SCRIPT_NAME>
For this project we’ve created 4 scripts:
- preprocess: downloads the dataset and prepares it for training,
- train: starts training,
- tensorboard: runs TensorBoard on the port 6006,
- jupyter: starts Jupyter Notebook server on the port 8888.
That’s it! The model is ready to be trained on AWS.
Spotty Installation
Requirements
- Python ≥3.5
- Installed and configured AWS CLI (see Installing the AWS Command Line Interface)
Installation
Install Spotty using pip:
pip install -U spotty
Model Training
1. Start a Spot Instance with the Docker container:
spotty start
Once the instance is up and running, you will see its IP address. Use it to open TensorBoard and Jupyter Notebook later.
2. Download and preprocess the data for the Tacotron model. We already have a custom script in the configuration file to do that, just run:
spotty run preprocess
Once the data is processed, use the Ctrl + b
, then x
combination of keys to close the tmux pane.
3. Once the preprocessing is done, train the model. Run the “train” script:
spotty run train
You can detach this SSH session using the Ctrl + b
, then d
combination of keys. The training process won’t be interrupted. To reattach that session, just run the spotty run train
command again.
TensorBoard
Start TensorBoard using the “tensorboard” script:
spotty run tensorboard
TensorBoard will be running on the port 6006. You can detach the SSH session using the Ctrl + b
, then d
combination of keys, TensorBoard will still be running.
Jupyter Notebook
You also can start Jupyter Notebook using the “jupyter” script:
spotty run jupyter
Jupyter Notebook will be running on the port 8888. Open it using the instance IP address and the token that you will see in the command output.
Download Checkpoints
If you need to download checkpoints or any other files from the running instance to your local machine, just use the download
command:
spotty download -f 'logs-Tacotron-2/taco_pretrained/*'
SSH Connection
To connect to the running Docker container via SSH, use the following command:
spotty ssh
It uses a tmux session, so you can always detach it using the Ctrl + b
, then d
combination of keys and attach that session later using the spotty ssh
command again.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK