Introduction to compute shaders

A couple of months ago I went to the Munich GameCamp -- a bar camp where anyone can propose a talk, and then a quick vote is cast which talks get accepted. I've been asking around for some ideas, and one developer mentioned I might want to cover compute shaders. So I went in, without much hope to attract many people, but I ended up with an overcrowded room with roughly one quarter of all the game camp participants, rambling about compute shaders for roughly an hour. Afterwards, the main question I got was: "Where can I read about this?", and I couldn't quite nail a good introduction to compute shaders (there's Ryg's "a trip through the graphics pipeline" , but that's already quite past an introduction.)

The hardware

To understand how compute shaders happened, we have to take a look at the hardware evolution. Back in the old days, before shaders, we had geometry processing and texturing separated. For instance, a Voodoo² card had one rasterizer and two texturing units on the card, splitting the workload between them. This theme continued for a long time, even after shaders were introduced. Up until the GeForce 7 and the Radeon X1950, GPUs had separate vertex and pixel shader units. Those units had usually similar capabilities in terms of what they could compute (after all, additions and multiplications are the bulk of the work on a GPU), but memory access differed a lot. For instance, accessing textures was something vertex shaders couldn't do for a long time. At that time, the split made sense as scenes consisted of few polygons, covering many pixels, so having less vertex shading power typically didn't result in a bottleneck. By removing functionality from the vertex shaders, they could be better optimized and thus run faster.

A basic GPU pipeline with separate vertex and pixel units. Vertex units write shaded vertices into an attribute buffer, the pixel units consume it and also access memory through texture units.

However, a fixed distribution also makes it impossible to load-balance resources. As games evolved, sometimes more vertex power was required -- for instance, when rendering dense geometry like trees -- while other games were exploiting the new shader programming models to write more complex pixel shaders. This was also the time at which GPGPU programming starting, that is, programmers were trying to solve numerical programs on GPUs. Eventually, it became clear that the fixed split was not good enough.

In late 2006 -- early 2007, the age of "unified shaders" began with the release of the GeForce 8800 GTX and the Radeon HD 2800 (technically speaking, the XBox 360 was first with a customized ATI chip .) Gone were the days of separate units; instead, the shader core could process any kind of workload.

An unified pixel & vertex pipeline. The units can all talk to memory through the texture units, and pass data between them through an attribute buffer.

So what happened back then? The ALUs -- the units executing the math instructions -- were already similar. What changed was that the distinction between the units was removed completely. Vertex shaders were now executing on the same units as pixel shaders, making it possible to balance the workload between vertex and pixel shader work. One thing you'll notice here is that vertex and pixel shaders need to communicate somehow -- a decent chunk of memory is needed to store attributes for interpolation, like vertex color. A vertex shader will compute those per vertex and move on to the next vertex, but a pixel shader might have to hang on this information for a while until all pixels are processed. We'll come to this memory later, let's remember if for now.

The unified pipeline, shown as a compute pipeline. The attribute buffer becomes the local or shared memory, the texture units become the gateway to global memory, and the pixel and vertex units are now general compute units.

This model -- some ALUs bundled together, with some extra memory to allow communicating between sets of them -- was exposed at the same time as a "compute shader". In reality there are a few more details to this, for instance, a compute unit usually does not process a single element, but multiple of those, and there are quite a few more caches involved to make all of this efficient. Below is the diagram for a single compute unit in the AMD GCN architecture -- note that what I previously called "compute unit" is now a "SIMD", more on this in a moment.

An AMDGCN class compute unit, consisting of four 16-wide SIMD units, local memory, and various caches. Blocks with a dashed outline are shared between more than one compute unit.

Note that other GPUs look very similar, I'm simply using GCN here as I'm most familiar with it. On NVIDIA hardware, a GCN compute unit maps to a "streaming multiprocessor", or SM. The SIMDs have different width, and the caches will look a bit different, but the basic compute model is still exactly the same.

Back to the the SIMD unit -- GPUs have been always optimized to process many things concurrently. Where you have one pixel, you have many, and all of them run the same shader, so this is how the hardware was designed. In the case of GCN, the designers built four, 16-wide SIMDs into each compute unit. A SIMD (SIMD is short for single instruction, multiple data) unit can execute one operation across 16 elements at once. Not in one clock cycle -- there's some latency -- but for GCN, that latency is four cycles, so by having 4 SIMD, and pretending a SIMD is not 16, but 64 wide, the machine behaves as-if it executes 64 instructions per cycle. Confused? Don't worry, this is just an implementation detail, the point here is that every GPU executes some set of instructions together, be it 64 in the GCN case or 32 on current NVIDIA architectures.

What we ended up with is a GPU consisting of many compute units, and each compute unit containing:

A bunch of SIMD processors, which execute the instructions.
Some local memory inside each compute unit which can be used to communicate between shader stages.
Each SIMD unit executes one instruction across many items.

With those prerequisites, let's have a look at how we might expose them when not doing pixel and vertex shader work. Let's get going!

The kernel

The goal is to write a programming system which allows us to take advantage of the hardware at hand. Obviously, the first thing we notice is that cross-compute-unit communication is something we want to avoid. All of the compute units are clients of the L2 cache, but clearly the L1 caches can go out of sync, so if we distribute work, we should pretend we can't talk across compute units. This means that we somehow have to split our work into smaller units. We could go just pretend there are no compute units and dispatch individual items, but then we're missing out on the local memory, so it seems that we want one level below "here's all work".

We could go directly at the SIMD level, but that's not optimal as there is a capability we haven't discussed yet within a compute unit. As we have some kind of local memory, it's natural to assume we can use that to synchronize the SIMD units (given we could just write something to local memory and wait for it to show up). This also means we can only synchronize within a single compute unit, so let's use that as the next grouping level -- we'll call the work that gets put onto a single compute unit a work group . We thus start with a work domain , which is then split into work groups . Each work group runs independently of the others. Given the developer might not know how many groups can run concurrently, we enforce that work groups can't depend on the fact that:

Another work group from the same domain is running
Work groups inside a domain execute in any particular order

That allows us to run one work group after the other on a single unit, or all of them at once. Now, inside a work group, we still need many independent items of work, so we can fill all of our SIMD units. Let's call the individual item of work a work item , and our programming model will launch work groups containing many work items -- at least enough to fill up one compute unit (we probably want more than just enough to fill it, but more on this later.)

Let's put all of this together: Our domain consists of work groups, each work group gets mapped onto one compute unit, and the SIMD units process the work items.

Here is how our programming model maps onto the hardware so far. A domain consists of work groups, those get assigned to compute units. Each workgroup is comprised of multiple work items.

One thing we forgot to mention so far is the fact that there's apparently one more level which we didn't cover. I mentioned a SIMD unit executes multiple work items together, yet we haven't exposed this in our programming model, we only have "independent" work items there. Let's give the work items executing together a name -- we'll call them a subgroup (on AMD, they're typically referred to as a "wavefront" or "wave", while NVIDIA calls them "warp".) Now, we can expose another set of instructions, which perform operations across all items in a SIMD. For instance, we could compute the minimum of all values without having to go through local memory.

I mentioned before that there are only very few constraints on ordering. For instance, a GPU may want to execute a whole domain on a single compute unit. One of the reasons for doing this is memory latency. Those wide SIMD units are great when it comes to number crunching, but it also means that memory access is relatively speaking super slow. In fact, it's horribly slow, because a GCN CU can process 64 float multiply-add instructions per cycle, which is 64×3×4 bytes of input data, and 64×4 bytes of out. Across a large chip like a Vega10, that's 48 KiB worth of data read in a single cycle -- at 1.5 GHz, that's 67 TiB of data you'd have to read in. GPU memory systems have been thus optimized for throughput, at the expense of latency, and that impacts how we program them.

Before we head into the effects of this on the programming model, let's summarize how compute shaders look like:

Work is defined through a 3-level hierarchy:
- A domain specifies all work
- The domain is subdivided into work groups , which are executed independently, but allow for communication inside a group
- Work items are the individual elements of work
Some APIs also expose an intermediate level -- the subgroup , which allows some optimizations below the work group, but above the work item level.
Work groups can synchronize internally and exchange data through local memory.

This programming model is universal across all GPU compute shader APIs. Differences occur in terms of what guarantees are provided, for instance, an API might limit the number of work groups in flight to allow you to synchronize between all of them; or some hardware might guarantee the execution order so you can pass information from one workgroup to the next.

Latency hiding & how to write code

Now that we understand how the hardware and the programming model looks like, how do we actually program this efficiently? Until now, it sounds like we're writing normal code, and all that happens is that we process a work-item in it, potentially communicating between our neighbors on the same SIMD, others through local memory, and finally read and write through global memory. The problem is that "normal" code is not good, as we need to hide latency.

The way a GPU hides latency is by having more work in flight. A lot more work -- taking GCN again as the example, every compute unit can have up to 40 subgroups in flight. Every time a subgroup accesses memory, another one gets scheduled in. Given only four can execute at the same time, this means we can switch up to 10 times before we return to the subgroup which started the original request.

The first subgroup hits a memory instruction and stalls. The compute unit immediately schedules the next subgroup, which also stalls. After the fourth subgroup has issued an instruction, the memory subsystem has returned, and the first subgroup can resume processing. This way, the compute unit was executing work at every cycle.

There's a problem here though -- subgroups switching needs to be instantaneous to make this possible. Which means you can't write the program state to memory and read it back. Instead, all state of all programs is kept permanently resident in registers. This requires a huge amount of registers -- a single compute unit on GCN has 256 KiB of registers. With that, we can use up to (256 KiB / 40 / 4 / 64B) = 24 registers for a single item before we need to reduce occupancy . For our programming style, this means we should try to minimize the amount of state we need to keep around as much as possible, at least as long as we're accessing memory. If we don't access memory, a single wave can keep a SIMD 100% occupied. We should also make sure to use that local memory and L1 cache as much as we can, as they have more bandwidth and less latency than the external memory.

One SIMD can process multiple work items in a single clock cycle, in this figure, it's 8-wide. If a branch is short, we don't waste much computation. Long branches, which disable most of the SIMD lanes however can drastically reduce our throughput.

The other problem we're going to face is how branches are handled. SIMDs execute everything together, so they can't perform jumps. What happens instead is that the individual lanes get masked off, i.e. they no longer participate. It also implies that unless all work items take the same branch, a GPU will typically execute both sides of the branch.

Which means that for a SIMD of length N, we can get a worst-case usage of 1/N. On CPUs N is usually 1..8, so it's not catastrophic, but on a GPU N can be as high as 64 at which point this does really matter. How can we make sure this doesn't happen? First, we can take advantage of the subgroup execution; if we have a branch to select between a cheap and expensive version, and some work items take the expensive version, we might as well all take it. This reduces the cost of a branch from expensive + cheap to just expensive. The other part is to simply avoid excessive branching. As CPUs get wider, this is also becoming more important, and techniques like sorting all data first, and the processing it become more interesting than heavy branching on individual work items. For instance, if you're writing a particle simulation engine, it's much faster to sort the particles by type, and the run a specialized program for each, compared to running all possible simulation programs for each particle.

What have we learned? We need:

A large problem domain -- the more independent work items, the better.
Memory intensive code should minimize the state, to allow for high occupancy.
We should avoid branches which disable most of the subgroup.

Those guidelines will apply universally to all GPUs. After all, all of them were designed to solve massively parallel graphics problems, and those have little branching, not too much state in flight, and tons and tons of independent work items.

Summary

I hope this blog post gives you a good idea how we ended up with those compute shaders. The really interesting bit is that the concepts of minimal communication, local memory access, and divergence costs are universally applicable. Modern multi-core CPUs are also introducing new levels of communications costs, NUMA has been long used to model costs for memory access, and so on. Understanding that not all memory is equal, and that your code is executed in some particular way on the underlying hardware will enable you to squeeze out more performance everywhere!

The hardware

The kernel

Latency hiding & how to write code

Summary

Recommend

Not all bugs are worth fixing and that’s okay

Address Verification and Full PGP Support

Ruby on Rails - send Emails with style

Machine Learning in Google Bigquery1532548821221411

[.NET Internals 01] Basics of memory structure

Android ViewModel injection with Dagger – Chili Labs – Medium

Kotlin DSL: from Theory to Practice | Java Code Geeks - 2018

NPEs in Kotlin – Drako's Developer Blog

什么样的人才是真正的吃货？ - 知乎

一个女人可以漂亮到什么地步？ - 知乎

About Joyk