A quick guide to using Spot instances with Amazon SageMaker

One of the simplest ways to lower your machine learning training costs is to use Amazon EC2 Spot instances. Spot instances allow you to access spare Amazon EC2 compute capacity at a steep discount of up to 90% compared to on-demand rates. So why not always use Spot instances? Well, you can, as long as your workload is tolerant to sudden interruptions. Since Spot instances are part of spare capacity, they may be reclaimed with just 2 minutes notice!

Deep learning training is a good example of a workload that can be made tolerant to interruptions and I’ve written about using Amazon EC2 Spot instances for deep learning training before. However, as a machine learning developer or data scientist, you may not want to manage Spot fleet requests, poll for capacity, poll for termination status, manually back up your checkpoints, manually sync checkpoints when resuming training, and set everything up every time you want to run training jobs.

Amazon SageMaker offers Managed Spot Training , which is a convenient way to lower training costs using Amazon EC2 Spot instances for Amazon SageMaker training jobs. This means you can now save up to 90% on training workloads without having to setup and manage Spot instances. Amazon SageMaker will automatically provision Spot instances for you, and if a Spot instance is reclaimed, Amazon SageMaker will automatically resume training after capacity is available!

In this blog post, I’ll provide a step-by-step guide to using Spot instances with Amazon SageMaker for deep learning training. I’ll cover what code changes you need to make to take advantage of Amazon SageMaker’s automatic checkpoint back up and sync to Amazon S3 feature. I’ll be using Keras with TensorFlow backend to illustrate how you can take advantage of Amazon SageMaker Managed Spot Training. You can also implement the same steps on an another framework such as PyTorch or MXNet.

A complete example with Jupyter notebook is available on GitHub: https://github.com/shashankprasanna/sagemaker-spot-training

What workloads can take advantage of Spot Instances?

To take advantage of Spot instance savings, your workload must be tolerant of interruptions. In machine learning, and there are two types of workloads that broadly fall into this category:

Stateless microservices, such as model servers (TF Serving, TorchServe — read my blog post ), serving inference requests
Stateful jobs such as deep learning training that are capable of saving their full state with frequent checkpointing.

In the first case, if a Spot instance is reclaimed, traffic can be routed to another instance, assuming you’ve set up your service with redundancy for high-availability. In the second case, if a Spot instance is interrupted, your application must immediately save its current state, and resume training when capacity has been restored.

In this guide, I’ll cover the second use-case, i.e. Spot instances for deep learning training jobs using open-source deep learning frameworks such as TensorFlow, PyTorch, MXNet and others.

Quick recap on how Amazon SageMaker runs deep learning training

Let me start with how Amazon SageMaker runs deep learning training. This background is important to understand how SageMaker manages Spot training and backs up your checkpoint data and resumes training. If you’re an Amazon SageMaker user, this should serve as a quick reminder.

What you are responsible for:

m6jeYrf.png!web

Develop your training scripts and provide it to the SageMaker SDK Estimator function and Amazon SageMaker will take care of the rest

Writing your training scripts in TensorFlow, PyTorch, MXNet or other supported framework.
Writing a SageMaker Python SDK Estimator function specifying where to find your training scripts, what type of CPU or GPU instance to train on, how many instances (for distributed) to train on, where to find your training dataset and where to save the trained models in Amazon S3.

What Amazon SageMaker is responsible for:

FnIRFvz.png!web

Amazon SageMaker training workflow

Amazon SageMaker will manage infrastructure details, so you don’t have to. Amazon SageMaker will:

Upload your training script and dependencies to Amazon S3
Provision the specified number of instances in a fully managed cluster
Pull the specified TensorFlow container image and instantiate containers on every instance.
Download the training code from Amazon S3 into the instance and make it available in the container
Download training dataset from Amazon S3 and make it available in the container
Run training
Copy trained models to a specified location in Amazon S3

Running Amazon SageMaker managed spot training

Spot instances can be preempted and be terminated with just 2 minutes notice, therefore it’s critical that you frequently checkpoint training progress. Thankfully, Amazon SageMaker will manage everything else. It’ll automatically backup your training checkpoints to Amazon S3 and if the training instance is terminated due to lack of capacity, it’ll keep polling for capacity, and automatically restart training once capacity becomes available.

Amazon SageMaker will automatically copy your dataset and the checkpoint files into the new instance and make it available to your training script in a docker container so that you can resume training from the latest checkpoint.

fUnEniA.png!web

Amazon SageMaker will automatically back up and sync checkpoint files to Amazon S3

Let’s take a look at an example to see how you can prepare your training scripts to make it Spot training ready.

Amazon SageMaker Managed Spot Training with TensorFlow and Keras

To make sure that your training scripts can take advantage of SageMaker Managed Spot instances you’ll need to implement:

frequent saving of checkpoints and
ability to resume training from checkpoints.

I’ll show how to make these changes in Keras, but you can follow the same steps on another framework.

Step 1: Saving checkpoints

Amazon SageMaker will automatically back up and sync checkpoint files generated by your training script to Amazon S3. Therefore you’ll need to make sure that your training script saves checkpoints to a local checkpoint directory on the docker container that’s running the training. The default location to save the checkpoint files is /opt/ml/checkpoints and Amazon SageMaker will sync these files to the specific Amazon S3 bucket. Both local and Amazon S3 checkpoint locations are customizable.

vEJJRn6.png!web

Save your checkpoints locally to /opt/ml/checkpoints (path is customizable), and Amazon SageMaker will back it up to Amazon S3

If you’re using Keras, this is very easy. Create an instance of the ModelCheckpoint callback class and register it with the model by passing it to fit() function.

The full implementation is available in this file: https://github.com/shashankprasanna/sagemaker-spot-training/blob/master/code/cifar10-training-sagemaker.py

Here is the relevant excerpt:

Notice that I’m passing initial_epoch which you normally wouldn’t have bothered with. This lets us resume training from a certain epoch number and will come in handy when you already have checkpoint files.

Step 2: Resuming from checkpoint files

When spot capacity becomes available again after an interruption, Amazon SageMaker will:

Launch a new spot instance
Instantiate a Docker container with your training script
Copy your dataset and checkpoint files from Amazon S3 to the container
Run your training scripts

Your script needs to implement resuming training from checkpoint files, otherwise your training script will restart training from scratch. You can implement a load_checkpoint_mode function as shown below. It takes in the local checkpoint files path ( /opt/ml/checkpoints being the default), and returns a model loaded from the latest checkpoint and the associated epoch number.

There are many ways to query a list of files in a directory, extract the epoch number from the file names and load the file name with the latest epoch number. I use os.listdir() and regular expressions. I’m sure you can come up with more clever and elegant ways to do the same thing.

The full implementation is available in this file: https://github.com/shashankprasanna/sagemaker-spot-training/blob/master/code/cifar10-training-sagemaker.py

Here is the relevant excerpt:

Step 3: Instructing Amazon SageMaker to run Managed Spot training

You can launch Amazon SageMaker training jobs from your laptop, desktop, Amazon EC2 instance, or Amazon SageMaker Notebook instances. As long as you have Amazon SageMaker Python SDK installed and the right user permissions to run SageMaker training jobs.

To run a managed spot training job, you’ll need to specify few additional options to your standard Amazon SageMaker Estimator function call.

The full implementation is available in this file: https://github.com/shashankprasanna/sagemaker-spot-training/blob/master/tf-keras-cifar10-spot-training.ipynb

Here is the relevant excerpt:

train_use_spot_instances : Instructs Amazon SageMaker to run Managed Spot training
checkpoint_s3_uri : Instructs Amazon SageMaker to sync your checkpoint files to this Amazon S3 location
train_max_wait : Instructs Amazon SageMaker to terminate the job after this time has passed and spot capacity doesn’t become available.

That’s it. Those are all the changes you need to make to dramatically lower your cost to train.

To monitor your training job and view savings you can look at the logs on your Jupyter notebook or navigate to Amazon SageMaker Console > Training Job , click on your training job name. Once the training is completed, you should see how much you saved. As an example, for a 30 epoch training on a p3.2xlarge GPU instance, I was able to save 70% on training cost!

auMJbu7.png!web

Screenshot of training job in the Amazon SageMaker Console showing cost savings and other useful information

Simulating Spot interruptions on Amazon SageMaker

How do you know if your training will resume properly if a spot interruption occurs?

If you’re familiar with running Amazon EC2 Spot instances, you know that you can simulate your application behavior during a Spot interruption by terminating the Amazon EC2 Spot instance. If there is capacity, Spot fleet will launch a new instance to replace the one you terminated. You can monitor your application to check if it handles interruptions and resumes gracefully. Unfortunately, you can’t terminate an Amazon SageMaker training instance manually. Your only option is to stop the entire training job.

Fortunately, you can still test your code for its behavior when resuming training. To do that first run an Amazon SageMaker Managed Spot training for a specified number of epochs as described in the previous section. Let’s say you run training for 10 epochs. Amazon SageMaker would have backed up your checkpoint files to the specified Amazon S3 location for the 10 epochs. Head over to Amazon S3 to verify that the checkpoints are available:

3yy2eme.png!web

Checkpoints in Amazon S3, automatically backed up for you by Amazon SageMaker

Now run a second training run, but this time provide the first jobs’ checkpoint location to checkpoint_s3_uri

checkpoint_s3_uri = tf_estimator.checkpoint_s3_uri .

Here is the relevant excerpt from the Jupyter notebook :

https://github.com/shashankprasanna/sagemaker-spot-training/blob/master/tf-keras-cifar10-spot-training.ipynb

By providing checkpoint_s3_uri with your previous job’s checkpoints, you’re telling Amazon SageMaker to copy those checkpoints to your new job’s container. Your training script will then load the latest checkpoint and resume training. In the figure below you can see that the training will resume from the 10th epoch.

What workloads can take advantage of Spot Instances?

Quick recap on how Amazon SageMaker runs deep learning training

What you are responsible for:

What Amazon SageMaker is responsible for:

Running Amazon SageMaker managed spot training

Amazon SageMaker Managed Spot Training with TensorFlow and Keras

Step 1: Saving checkpoints

Step 2: Resuming from checkpoint files

Step 3: Instructing Amazon SageMaker to run Managed Spot training

Simulating Spot interruptions on Amazon SageMaker

Recommend

MySQL 8.0 资源组（Resource Groups）深度解读

关于分布式锁的面试题都在这里了

小鹏汽车声明：过去一年特斯拉提出诸多不合理诉求

情感智能进行时，如何打开商业化正确姿势？

笔记本如何凉的快？往里面倒点液态金属试试

想退休的俞敏洪，与固守线下的新东方

PHPUnit基本使用 | 魏如博的个人主页

vscode 新人，原来用 Idea 写 Java ，自动“import” 或提示 “import 哪个类” 都很方便...

隐私保护新突破：高斯差分隐私框架与深度学习结合

Parallel typeclass for Haskell

About Joyk