How GPUs accelerate deep learning

The embarrassingly parallel nature of neural networks

Jun 23 ·7min read

N eural networks and deep learning are not recent methods. In fact, they are quite old. Perceptrons, the first neural networks, were created in 1958 by Frank Rosenblatt. Even the invention of the ubiquitous building blocks of deep learning architectures happened mostly near the end of the 20th century. For example, convolutional networks were introduced in 1989 in the landmark paper Backpropagation Applied to Handwritten Zip Code Recognition by Yann LeCun et al.

Why did the deep learning revolution had to wait decades?

One major reason was the computational cost. Even the smallest architectures can have dozens of layers and millions of parameters, so repeatedly calculating gradients during is computationally expensive. On large enough datasets, training used to take days or even weeks. Nowadays, you can train a state of the art model in your notebook under a few hours.

There were three major advances which brought deep learning from a research tool to a method present in almost all areas of our life. These are backpropagation , stochastic gradient descent and GPU computing . In this post, we are going to dive into the latter and see that neural networks are actually embarrassingly parallel algorithms, which can be leveraged to improve computational costs by orders of magnitude.

A big pile of linear algebra

Deep neural networks may seem complicated for the first glance. However, if we zoom into them, we can see that its components are pretty simple in most cases. As the always brilliant xkcd puts it, a network is (mostly) a pile of linear algebra.

IVzIFbq.png!web

Source: xkcd

During training, the most commonly used functions are the basic linear algebra operations such as matrix multiplication and addition. The situation is simple: if you call a function a bazillion times, shaving off just the tiniest amount of the time from the function call can compound to a serious amount.

Using GPU-s not only provide a small improvement here, they supercharge the entire process. To see how it is done, let’s consider activations for instance.

Suppose that φ is an activation function such as ReLU or Sigmoid. Applied to the output of the previous layer

q2yqQv.png!web

the result is

6VVZJjU.png!web

(The same goes for multidimensional input such as images.)

This requires to loop over the vector and calculate the value for each element. There are two ways to make this computation faster. First, we can calculate each φ(xᵢ) faster. Second, we can calculate the values φ(x ₁ ), φ(x ₂ ), …, φ(x ₙ ) simultaneously, in parallel. In fact, this is embarrassingly parallel , which means that the computation can be parallelized without any significant additional effort.

Over the years, doing things faster became much more difficult. A processor’s clock speed used to double almost every year, but this has plateaued recently. Modern processor design has reached a point where packing more transistors into the units has quantum-mechanical barriers.

However, calculating the values in parallel does not require faster processors, just more of them. This is how GPUs work, as we are going to see.

The principles of GPU computing

Graphics Processing Units, or GPUs in short were developed to create and process images. Since the value of every pixel can be calculated independently of others, it is better to have a lot of weaker processors than a single very strong one doing the calculations sequentially.

This is the same situation we have for deep learning models. Most operations can be easily decomposed to parts which can be completed independently.

JRR73aQ.png!web

nVidia Fermi architecture. There has been many improvements to this, but it illustrates the point well. Source: nVidia Fermi architecture whitepaper

To give you an analogy, let’s consider a restaurant, which has to produce French fries on a massive scale. To do this, workers must peel, slice and fry the potato. Hiring people to peel the potatoes costs much more than purchasing many more kitchen robots capable to perform this task. Even if the robots are slower, you can buy much more from the budget, so overall the process will be faster.

Modes of parallelism

When talking about parallel programming, one can classify the computing architectures into four different classes. This was introduced by Michael J. Flynn in 1966 and it is in use ever since.

S ingle I nstruction, S ingle D ata (SISD)
S ingle I nstruction, M ultiple D ata (SIMD)
M ultiple I nstructions, S ingle D ata (MISD)
M ultiple I nstructions, M ultiple D ata (MIMD)

A multi-core processor is MIMD, while GPUs are SIMD. Deep learning is a problem for which SIMD is very well suited. When you calculate the activations, the same exact operation needs to be performed, with different data for each call.

Latency vs throughput

To give a more detailed picture on what GPU better than CPU, we need to take a look into latency and throughput . Latency is the time required to complete a single task, while throughput is the number of tasks completed per unit time.

Simply put, a GPU can provide much better throughput, at the cost of latency. For embarrassingly parallel tasks such as matrix computations, this can offer an order of magnitude improvement in performance. However, it is not well suited for complex tasks, such as running an operating system.

CPU, on the other hand, is optimized for latency, not throughput. They can do much more than floating point calculations.

General purpose GPU programming

In practice, general purpose GPU programming was not available for a long time. GPU-s were restricted to do graphics, and if you wanted to leverage their processing power, you needed to learn graphics programming languages such as OpenGL. This was not very practical and the barrier of entry was high.

This was the case until 2007, when nVidia launched the CUDA framework, an extension of C, which provides an API for GPU computing. This significantly flattened the learning curve for users. Fast forward a few years: modern deep learning frameworks use GPUs without us explicitly knowing about it.

GPU computing for deep learning

So, we have talked about how GPU computing can be used for deep learning, but we haven’t seen the effects. The following table shows a benchmark, which was made in 2017. Although it was made three years ago, it still demonstrates the order of magnitude improvement in speed.

2mQJru7.png!web

CPU vs GPU benchmarks for various deep learning frameworks. (The benchmark is from 2017, so it considers the state of the art back from that time. However, the point still stands: GPU outperforms CPU for deep learning.) Source: Benchmarking State-of-the-Art Deep Learning Software Tools

How modern deep learning frameworks use GPUs

Programming directly in CUDA and writing kernels by yourself is not the easiest thing to do. Thankfully, modern deep learning frameworks such as TensorFlow and PyTorch doesn’t require you to do that. Behind the scenes, the computationally intensive parts are written in CUDA using its deep learning library cuDNN . These are called from Python, so you don’t need to use them directly at all. Python is really strong in this aspect: it can be combined with C easily, which gives you both the power and the ease of use.

This is similar to how NumPy works behind the scenes: it is blazing fast because its functions are written directly in C.

Is NumPy really faster than Python?

Yes, but only if you know how to use it.

towardsdatascience.com

Do you need to build a deep learning rig?

If you want to train deep learning models on your own, you have several choices. First, you can build a GPU machine for yourself, however, this can be a significant investment. Thankfully, you don’t need to do that: cloud providers such as Amazon and Google offer remote GPU instances to work on. If you want to access resources for free, check out Google Colab , which offers free access to GPU instances.

Conclusion

Deep learning is computationally very intensive. For decades, training neural networks was limited by hardware. Even relatively smaller models had to be trained for days, and training large architectures on huge datasets was impossible.

However, with the appearance of general computing GPU programming, deep learning exploded. GPUs excel in parallel programming, and since these algorithms can be parallelized very efficiently, it can accelerate training and inference by several orders of magnitude.

This has opened the way for rapid growth. Now, even relatively cheap commercially available computers can train state of the art models. Combined with the amazing open source tools such as TensorFlow and PyTorch, people are building awesome things every day. This is truly a great time to be in the field.

The embarrassingly parallel nature of neural networks

A big pile of linear algebra

The principles of GPU computing

Modes of parallelism

Latency vs throughput

General purpose GPU programming

GPU computing for deep learning

How modern deep learning frameworks use GPUs

Is NumPy really faster than Python?

Yes, but only if you know how to use it.

towardsdatascience.com

Do you need to build a deep learning rig?

Conclusion

Recommend

伪随机的概率可视化

BlueLeaks 曝光美国警察局档案

阿拉斯加超级火山喷发可能帮助终结了罗马共和国

NASA 考虑重返海王星

中国发射最后一颗北斗三号卫星

CERN 计划造 100 公里周长的超级对撞机

微软关闭游戏直播平台 Mixer

苹果发布《基地》电视剧预告

苹果未来的 Mac 电脑将使用它自己设计的 ARM 处理器

看完苹果WWDC 2020，开发者们忧心忡忡

About Joyk