How much difference do GPUs make in model serving?

Why you should use GPUs for model serving—not just training

Dec 20 ·4min read

6NfIRbB.jpg!web

GPUs are unquestionably the standard for model training. Particularly when it comes to large models with rich training sets, the gap between CPU and GPU performance is substantial.

But what about model serving? Intuitively, GPUs probably offer more raw power than CPUs for inference, but does that translate to performance—and if so, is the performance difference meaningful?

The short answer is yes. In many real-time inference situations, GPUs are necessary for functional model serving. To understand why this is true, let’s run an experiment.

Benchmarking GPT-2 on GPUs and CPUs

We’ll be running a test of CPU vs GPU model serving performance using OpenAI’s 1.5 billion parameter GPT-2, the model behind projects like AI Dungeon 2.

For each architecture, we’ll deploy the model to AWS using Cortex . Once deployed, we will make 100 single-word predictions, and measure the average latency.

We’ll use the same incomplete sentence for each query:

There are only two hard things in Computer Science: cache invalidation and

(The original quote from Phil Karlton is “There are only two hard things in Computer Science: cache invalidation and naming things.”)

Note that we will also include an encoding and decoding step in our request lifecycle which will affect performance, but this shouldn’t impact latency enough to meaningfully muddle our results.

Test #1: CPUs

We’ll start by deploying the 1.5B GPT-2 model with Cortex . Our model will be running on AWS using 2 vCPUs.

After querying our endpoint 100 times, our results look like this:

"there are only two hard things in computer science: cache invalidation and naming""there are only two hard things in computer science: cache invalidation and naming""there are only two hard things in computer science: cache invalidation and naming"...

So far, so good. Now, let’s check out our latency:

$ cortex get generatorstatus   up-to-date   requested   last update   avg inference
live     1            1           5m            925 ms        endpoint: http://abc123.us-west-2.elb.amazonaws.com/text/generator

generator is the name of our API, and cortex get is the Cortex CLI command for retrieving information about our deployments/APIs.

Looking at the above, our average latency is 925 ms per request. For real-time inference, this is a very long time.

Let’s see how much using GPUs speeds things up.

Test #2: GPUs

In our last test, we used 2 vCPUs to serve predictions. In this test, we’re going to swap one vCPU for a GPU. To do this, we simply add one line in Cortex’s config file :

gpu: 1

After running 100 queries against our API, our outputs are mostly identical to our CPU test:

"there are only two hard things in computer science: cache invalidation and naming""there are only two hard things in computer science: cache invalidation and naming""there are only two hard things in computer science: cache invalidation and naming"...

Side note: One response did say “there are only two hard things in computer science: cache invalidation and documentation,” which we both enjoyed and agreed with.

Let’s take a look at our latency:

$ cortex get generatorstatus   up-to-date   requested   last update   avg inference
live     1            1           5m            199 ms        endpoint: http://abc123.us-west-2.elb.amazonaws.com/text/generator

Our average latency with a GPU is 199 ms per request, over 4.6 times faster than it was with just CPUs.

GPUs are clearly much faster at serving predictions, but the question is, how much does it matter?

Why a 4.6x performance increase matters

The gap between 199 ms and 925 ms is 726 ms—a handful of blinks of an eye. On the surface, that might not seem like a massive delay, but think of the applications of real-time inference we see in our daily lives.

Let’s use something ubiquitous as an example, like Gmail’s Smart Compose feature. In order for the feature to be useful, it needs to serve predictions before you type more characters—not after.

The average person types 40 words per minute. The average English word has roughly 5 characters. An average person, as a result, types 200 characters per minute, or 3.33 characters per second. Taken one step further, this means there is roughly 300 ms between each character an average person types.

If you’re running on CPUs, taking 925 ms per request, you’re way to slow for Gmail’s Smart Compose. By the time you process one of a user’s characters, they’re roughly 3 characters ahead—even more if they’re a fast typer.

With GPUs, however, you’re well ahead of them. At 199 ms per request, you’ll be able to predict the rest of their message with about 100 ms to spare—which is useful when you consider their browser still needs to render your prediction.

The results are clear. To run real-time inference, particularly with larger models, GPUs are absolutely necessary for model serving.

One final note about cost

It would feel disingenuous to write about GPUs vs CPUs without acknowledging their cost difference.

CPUs are cheaper to run by a decent margin. In our tests, we ran our two vCPUs on an m5.large instance, which cost about $0.096 per hour to run. For our GPU, we used a g4dn.xlarge instance, which costs $0.526 per hour—a ~5.5x increase.

Because of this cost increase, there is a legitimate argument to be made for using CPUs on smaller models. In fact, if your performance requirements can be satisfied with CPU architecture, you should be using them rather than GPUs.

However, if your performance demands are larger than what CPUs are capable of, the cost difference is a moot point. GPUs simply give you the horsepower you need to build production-level deep learning APIs.

Why you should use GPUs for model serving—not just training

Benchmarking GPT-2 on GPUs and CPUs

Test #1: CPUs

Test #2: GPUs

Why a 4.6x performance increase matters

One final note about cost

Recommend

SEO博客权重上去了流量却增长缓慢的原因

ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

Go语言实战(四) | 数组、切片和映射 -- 切片

Golang中生成随机字符串并复制到粘贴板

排序

Go测试开发就用这三板斧

Practical Graph Neural Networks for Molecular Machine Learning

Visual Studio Code repository is open again

Systems @Scale Tel Aviv 2019 recap

国家互联网信息办公室发布《网络信息内容生态治理规定》

About Joyk