Choosing the right GPU for deep learning on AWS

Amazon Elastic Inference

With Amazon Elastic Inference (EI) , rather than selecting a GPU instance for hosting your models, you can attach a low-cost GPU powered acceleration to a CPU-only instance via the network.

EI comes in different sizes, you can add just the amount of GPU processing power you need for your model. Since the GPU acceleration is accessible via the network, EI adds some latency compared to a native GPU in say a G4 instance, but will still be much faster than a CPU-only instance if you have a demanding model.

It’s therefore important to define your target latency SLA for your application and work backwards to choose the right accelerator option. Let’s consider a hypothetical scenario below.

Please note all these numbers in the hypothetical scenario are made-up for the purpose of illustration.
Every use case is different. Use the scenario only for general guidance.

Let’s say your application can deliver a good customer experience if your total latency (app + network + model predictions) is under 200 ms. And let’s say, with a G4 instance type you can get total latency down to 40 ms which is well within your target latency. Also, let’s say with a C5 instance type you can only get total latency to 400 ms which does not meet your SLA requirements and results in poor customer experience.

With Elastic Inference, you can network attach a “slice” or a “fraction” of a GPU to a CPU instances such as a C5 instance, and get your total latency down to, say 180 ms which is under the desired 200 ms mark. Since EI is significantly cheaper than provisioning a dedicated GPU instance, you save on your total deployment costs. A GPU instance like G4 will still deliver best inference performance, but if the extra performance doesn’t improve your customer experience, you can use EI to stay under the target latency SLA, deliver good customer experience and save on overall deployment costs.

AWS Inferentia and Amazon EC2 Inf1 Instances

Amazon EC2 Inf1 is the new kid in the block. Inf1 instances give you access to a high performance inference chip called AWS Inferentia, custom designed by AWS. AWS Inferentia chips support FP16, BF16 and INT8 for reduced precision inference. To target AWS Inferentia, you can use the AWS Neuron software development kit (SDK) to compile your TensorFlow, PyTorch or MXNet model. The SDK also comes with a runtime library to run the compiled models in production.

Amazon EC2 Inf1 instances deliver better performance/cost compared to GPU EC2 instances. You’ll just need to make sure that the AWS Neuron SDK supports all layers in your model. See here for a list of supported Ops for each framework .

Optimizing for cost

You have a few different options to optimize the cost of your training and inference workloads.

Spot instances:Spot-instance pricing makes high-performance GPUs much more affordable and allows you to access spare Amazon EC2 compute capacity at a steep discount compared to on-demand rates. For an up-to-date list of prices by instance and Region, visit the Spot Instance Advisor . In some cases you can save over 90% on your training costs, but your instances can be preempted and be terminated with just 2 mins notice. Your training scripts must implement frequent checkpointing and ability to resume training once Spot capacity is restored.

Amazon SageMaker managed training:During the development phase much of your time is spent prototyping, tweaking code and trying different options in your favorite editor or IDE (which is obvious VIM) — all of which don’t need a GPU. You can save costs by simply decoupling your development and training resources and Amazon SageMaker will let you do this easily. Using the Amazon SageMaker Python SDK you can test your scripts locally on your laptop, desktop, EC2 instance or SageMaker notebook instance.

When you’re ready to train, specify what GPU instance type you want to train on and SageMaker will provision the instances, copy the dataset to the instance, train your model, copy results back to Amazon S3, and tear down the instance. You are only billed for the exact duration of training. Amazon SageMaker also supports managed Spot Training for additional convenience and cost savings.

I’ve written a guide on how to use it here: A quick guide to using Spot instances with Amazon SageMaker

Amazon Elastic Inference and Amazon Inf1 instance:Save costs for inference workloads by leveraging EI to add just the right amount of GPU acceleration to your CPU instances, or by leveraging cost effective Amazon Inf1 instances.

Optimize for cost by improving utilization:

Optimize your training code to take full advantage of P3 and G4 instances Tensor Cores by enabling mixed-precision training . Every deep learning framework does this differently and you’ll have to refer to the specific framework’s documentation.
Use reduce precision (INT8) inference on G4 instance types to improve performance. NVIDIA’s TensorRT library provides APIs to convert single precision models to INT8, and provides examples in their documentation.

What software should I use on Amazon EC2 GPU instances?

Without optimized software, there is a risk that you’ll under-utilize the hardware resources you provision. You may be tempted to “pip install tensorflow/pytorch”, but I highly recommend using AWS Deep Learning AMIs or AWS Deep Learning Containers (DLC) instead.

AWS qualifies and tests them on all Amazon EC2 GPU instances, and they include AWS optimizations for networking, storage access and the latest NVIDIA and Intel drivers and libraries. Deep learning frameworks have upstream and downstream dependencies on higher level schedulers and orchestrators and lower-level infrastructure services. By using AWS AMIs and AWS DLCs you know it’s been tested end-to-end and is guaranteed to give you the best performance.

TL;DR

Congratulations! You made it to the end (even if you didn’t read all of the post). I intended this to be a quick guide, but also provide enough context, references and links to learn more. In this final section, I’m going to give you the quick recommendation list. Please take time to read about the specific instance type and GPU type you’re considering to make an informed decision.

The recommendation below is very much my personal opinion based on my experience working with GPUs and deep learning. Caveat emptor.

And (drum roll) here’s the list:

Highest performing GPU instance on AWS. Period :
Best single GPU training performance : p3.2xlarge (V100, 16 GB GPU)
Best single-GPU instance for developing, testing and prototyping : g4dn.xlarge (T4, 16 GB GPU). Consider g4dn.(2/4/8/16)xlarge for more vCPUs and higher system memory.
Best multi-GPU instance for single node training and running parallel experiments : p3.8xlarge (4 V100 GPUs, 16 GB per GPU), p3.16xlarge (8 GPUs, 16 GB per GPU)
Best multi-GPU, multi-node distributed training performance : p3dn.24xlarge (8 V100 GPUs, 32 GB per GPU, 100 Gbps aggregate network bandwidth)
Best single-GPU instance for inference deployments : G4 instance type. Choose instance size g4dn.(2/4/8/16)xlarge based on pre- and post-processing steps in your deployed application.
I need the most GPU memory I can get for large models : p3dn.24xlarge (8 V100, 32 GB per GPU)
I need access to Tensor Cores for mixed-precision training : P3 and G4 instance types. Choose the instance size based on your model size and application.
I need access to double precision (FP64) for HPC and deep learning : P3, P2 instance types. Choose the instance size based on your application.
I need 8 bit integer precision (INT8) for inference : G4 instance type. Choose instance size based on pre- and post-processing steps in your deployed application.
I need access to half precision (FP16) for inference : P3, G4 instance type. Choose the instance size based on your application.
I want GPU acceleration for inference but don’t need a full GPU : Use Amazon Elastic Inference and attach just the right amount of GPU acceleration you need.
I want the best performance on any GPU instance : Use AWS Deep Learning AMI and AWS Deep Learning Containers
I want to save money : Use Spot Instances and Managed Spot Training on Amazon SageMaker. Choose Amazon Elastic Inference for models that don’t take advantage of a full GPU.

Thank you for reading. If you found this article interesting, please check out my other blog posts on medium or follow me on twitter ( @shshnkp ), LinkedIn or leave a comment below. Want me to write on a specific machine learning topic? I’d love to hear from you!

Amazon Elastic Inference

AWS Inferentia and Amazon EC2 Inf1 Instances

Optimizing for cost

Optimize for cost by improving utilization:

What software should I use on Amazon EC2 GPU instances?

TL;DR

And (drum roll) here’s the list:

Recommend

年化超 1000%? YFI 为何能成为 DeFi 新宠？

外卖小哥，我们最熟悉的陌生人

5分钟！用Java实现目标检测

女孩子学医好还是学法好？

不吐不快，在美国有资产的要小心了

Sizedness in Rust

Making a Game in 48 hours with Rust and WebAssembly

Under the hood of Linkerd's state-of-the-art Rust proxy, Linkerd2-proxy

图解|什么是蒙提霍尔问题

Checking status of Rust features

About Joyk