Modern garbage collection

A look at the Go GC strategy

You can find discussions on Hacker News and Reddit

I’ve seen a bunch of articles lately which promote the Go language’s latest garbage collector in ways that trouble me. Some of these articles come from the Go project itself. They make claims that imply a radical breakthrough in GC technology has occurred.

Here is the initial announcement of a new collector in August 2015:

Go is building a garbage collector (GC) not only for 2015 but for 2025 and beyond … Go 1.5’s GC ushers in a future where stop-the-world pauses are no longer a barrier to moving to a safe and secure language. It is a future where applications scale effortlessly along with hardware and as hardware becomes more powerful the GC will not be an impediment to better, more scalable software. It’s a good place to be for the next decade and beyond.

The Go team not only claim to have solved the problem of GC pauses, but also made the entire thing brainless:

At a higher level, one approach to solving performance problems is to add GC knobs, one for each performance issue. The programmer can then turn the knobs in search of appropriate settings for their application. The downside is that after a decade with one or two new knobs each year you end up with the GC Knobs Turner Employment Act. Go is not going down that path. Instead we provide a single knob, called GOGC.
Furthermore, unencumbered by ongoing support for dozens of knobs, the runtime team can focus on improving the runtime based on feedback from real customer applications.

I have no doubt that many Go users are very happy with the new runtime. But I have a bone to pick with these claims — to me it comes across like a misleading piece of marketing. As these claims are getting repeated across the blogosphere, it’s time to take a deeper look at them.

The reality is that Go’s GC does not really implement any new ideas or research. As their announcement admits, it is a straightforward concurrent mark/sweep collector based on ideas from the 1970s. It is notable only because it has been designed to optimise for pause times at the cost of absolutely every other desirable characteristic in a GC. Go’s tech talks and marketing materials don’t seem to mention any of these tradeoffs, leaving developers unfamiliar with garbage collection technologies to assume that no such tradeoffs exist, and by implication, that Go’s competitors are just badly engineered piles of junk. And Go encourages this perception:

To create a garbage collector for the next decade, we turned to an algorithm from decades ago. Go’s new garbage collector is a concurrent, tri-color, mark-sweep collector, an idea first proposed by Dijkstra in 1978. This is a deliberate divergence from most “enterprise” grade garbage collectors of today, and one that we believe is well suited to the properties of modern hardware and the latency requirements of modern software

Reading this announcement, you could be forgiven for thinking that the last 40 years of “enterprise” GC research had achieved nothing at all.

A primer on GC theory

Here are the different factors you will want to think about when designing a garbage collection algorithm:

Program throughput: how much does your algorithm slow the program down? This is sometimes expressed as a percentage of CPU time spent doing collection vs useful work.
GC throughput: how much garbage can the collector clear given a fixed amount of CPU time?
Heap overhead: how much additional memory over the theoretical minimum does your collector require? If your algorithm allocates temporary structures whilst collecting, does that make memory usage of your program very spiky?
Pause times: how long does your collector stop the world for?
Pause frequency: how often does your collector stop the world?
Pause distribution: do you typically have very short pauses but sometimes have very long pauses? Or do you prefer pauses to be a bit longer but consistent?
Allocation performance: is allocation of new memory fast, slow, or unpredictable?
Compaction: does your collector ever report an out-of-memory (OOM) error even if there’s sufficient free space to satisfy a request, because that space has become scattered over the heap in small chunks? If it doesn’t you may find your program slows down and eventually dies, even if it actually had enough memory to continue.
Concurrency: how well does your collector use multi-core machines?
Scaling: how well does your collector work as heaps get larger?
Tuning: how complicated is the configuration of your collector, out of the box and to obtain optimal performance?
Warmup time: does your algorithm self-adjust based on measured behaviour and if so, how long does it take to become optimal?
Page release: does your algorithm ever release unused memory back to the OS? If so, when?
Portability: does your GC work on CPU architectures that provide weaker memory consistency guarantees than x86?
Compatibility: what languages and compilers does your collector work with? Can it be run with languages that weren’t designed for GC, like C++? Does it require compiler modifications? And if so, does changing GC algorithm require recompiling all your program and dependencies?

As you can see, there are a lot of different factors that go into designing a garbage collector and some of them impact the design of the wider ecosystem around your platform. I’m not even sure I got them all.

Because the design space is so complex, garbage collection is a subfield of computer science rich in research papers. New algorithms are proposed and implemented at a steady rate, by both academia and industry. Unfortunately, nobody has yet found a single algorithm that is ideal for all situations.

Tradeoffs, tradeoffs everywhere

Let’s make that a bit more concrete.

The first garbage collection algorithms were designed for uniprocessor machines and programs that had small heaps. CPU and RAM was expensive and users were not very demanding, so visible pauses were OK. Algorithms designed for this world prioritised minimising the CPU and heap overhead of the collector. This meant a GC that did nothing at all until you failed to allocate memory. Then the program would be paused and a full mark/sweep of the heap would be done to mark parts as free as quickly as possible.

These types of collectors are old but still have some advantages — they are simple, don’t slow down your program when not collecting and don’t add any memory overhead. In the case of conservative collectors like the Boehm GC they don’t even need changes to your compiler or programming language! This can make them appropriate for desktop apps that typically have small heaps, including AAA video games where the bulk of RAM is taken by data files which don’t need to be scanned.

Stop-the-world (STW) mark/sweep is the GC algorithm most commonly taught in undergrad computer science classes. When doing job interviews I sometimes ask candidates to talk a bit about GC and almost always, they either see GC as a black box and know nothing about it at all, or think it still uses this by now very old technique.

The problem is that simple STW mark/sweep scales very badly. As you add cores and grow your heaps/allocation rates ever larger, this algorithm stops working well. But — sometimes you actually do have small heaps and the pause times from even simple approaches are good enough! In that case, maybe you still want to use this approach and keep your overheads low.

At the other end of the spectrum, perhaps you are using heaps hundreds of gigabytes in size on a machine with dozens of cores. Perhaps your server is doing trading in financial markets, or running a search engine, and thus low pause times are very important to you. In these cases you are probably willing to use an algorithm that actually slows down your program whilst it runs in order to do collection in the background and with low pause times.

It’s not a simple spectrum! At the high end you can also have large batch jobs. As they are non-interactive pause times don’t matter at all, only total runtime. In such situations you are better off with an algorithm that maximises throughput above all else, i.e. the ratio of useful work done to time spent doing collection.

The problem is that there’s no single algorithm that is perfect in all aspects. Nor can a language runtime know whether your program is a batch job or an interactive latency-sensitive program. That’s the start of why “GC tuning” exists — it’s not because runtime engineers are dumb. It reflects fundamental limits in our capabilities in computer science.

The generational hypothesis

It has been known since 1984 that most allocations “die young” i.e. become garbage very soon after being allocated. This observation is called the generational hypothesis and is one of the strongest empirical observations in the entire PL engineering space. It has been consistently true across very different kinds of programming languages and across decades of change in the software industry: it is true for functional languages, imperative languages, languages that don’t have value types and languages that do.

Discovering this fact about programs was useful because it meant GC algorithms could be designed to take advantage of it. These new generational collectors had lots of improvements over the old stop-mark-sweep style:

GC throughput: they could collect a lot more garbage a lot faster.
Allocation performance: allocating new memory no longer required searching through the heap looking for a free slot, so allocation became effectively free.
Program throughput: allocations became neatly laid out in space next to each other, which improved cache utilisation significantly. Generational collectors do require the program to do some extra work as it runs, but that hit seems empirically to be outweighed by the improved cache effects.
Pause times: most (but not all) pause times became much lower.

They also introduced some downsides:

Compatibility: implementing a generational collector requires the ability to move things around in memory, and do extra work when the program writes to a pointer in some cases. This means the GC must be tightly integrated with the compiler. There are no generational collectors for C++.
Heap overhead: these collectors work by copying allocations back and forth between various ‘spaces’. Because there must be space to copy to, these collectors impose some heap overhead. Also, they require various pointer maps to be maintained (the remembered sets), further increasing overhead.
Pause distribution: whilst many GC pauses were now very fast, some still required doing a full mark/sweep over the entire heap.
Tuning: generational collectors introduce the notion of a “young generation” or “eden space”, and program performance becomes quite sensitive to the sizing of this space.
Warmup time: in response to the tuning issue, some collectors dynamically adapt the young generation size by observing how the program runs in practice, but now pause times depend on how long the program is running for as well. In practice this rarely matters outside of benchmarking.

Still, the benefits are so huge that basically all modern GC algorithms are generational. If you can afford it — and you probably can — then you want it. Generational collectors can be enhanced with all sorts of other features, and a typical modern GC will be concurrent, parallel, compacting and generational all together.

The Go concurrent collector

As Go is a relatively ordinary imperative language with value types, its memory access patterns are probably comparable to C# where the generational hypothesis certainly holds and thus .NET uses a generational collector.

In fact Go programs are usually request/response processors like HTTP servers, meaning that Go programs exhibit strongly generational behaviour, and the Go team are exploring potentially exploiting that in future with something they call the “request oriented collector”. It has been observed that this is essentially a renamed generational GC with a tweaked tenuring policy. Such a GC can be simulated in other runtimes for request/response processors by ensuring the young generation is large enough that all garbage generated by handling a request fits within it.

Despite that, Go’s current GC is not generational. It just runs a plain old mark/sweep in the background.

Doing it this way has one advantage — you can get very very low pause times — but makes almost everything else worse. Like what? Well, from our basic theory above we can see:

GC throughput: The time needed to clear the heap of garbage scales with the size of a heap. Put simply, the more memory your program uses the more slowly memory gets freed up, and the more time your computer spends doing collection vs useful work. The only way this isn’t true is if your program doesn’t parallelise at all but you can keep throwing cores at the GC without limit.
Compaction: as there’s no compaction, your program can eventually fragment its heap. I’ll talk about heap fragmentation more below. You also don’t benefit from having things laid out neatly in the cache.
Program throughput: as the GC has to do a lot of work for every cycle, that steals CPU time from the program itself, slowing it down.
Pause distribution: any garbage collector that runs concurrently with your program can encounter what the Java world calls a “concurrent mode failure”: your program creates garbage faster than the GC threads can clean it up. In this case the runtime has no choice but to stop your program entirely and wait for the GC cycle to complete. Thus when Go claims GC pauses are very low, this claim can only be true for the case where the GC has sufficient CPU time and headroom to outrun the main program. Additionally the Go compiler lacks features needed to ensure threads can be reliably paused quickly, meaning that whether pause times are actually low or not depends heavily on what kind of code you’re running (e.g. base64 decoding a large blob in a single goroutine can cause pause times to go up).
Heap overhead: because collecting the heap via mark/sweep is very slow, you need lots of spare space to ensure you don’t suffer a “concurrent mode failure”. Go defaults to a heap overhead of 100% … it doubles the amount of memory your program needs.

We can see these tradeoffs at work in posts to golang-dev like this one:

The Service 1 allocates more than the Service 2, so STW pauses are higher there. But STW pause duration dropped by an order of magnitude on both services. We see ~20% increase in CPU usage spent in GC after the switch on both services.

So in this specific case Go bought an order of magnitude drop in pause times, but at a cost of an even slower collector. Was that a good tradeoff or were pause times low enough already? The poster does not say.

There comes a point though, where paying for more hardware to get lower pause times no longer makes sense. If your server pause times go from 10msec to 1msec, will your users really notice that? What if you had to double your machine count to get it?

Go optimises for pause times as the expense of throughput to such an extent that it seems willing to slow down your program by almost any amount in order to get even just slightly faster pauses.

Comparison with Java

The HotSpot JVM has several GC algorithms you can choose on the command line. None aim for pause times as low as Go’s because they balance them against other factors. It’s worth comparing them to get a feel for the tradeoffs. It’s possible to switch between GC’s just by restarting the program because compilation is done whilst the program runs, so the different barriers the different algorithms need can be compiled and optimised into the code as needed.

The default algorithm on any modern computer is the throughput collector. This is designed for batch jobs and by default does not have any pause time goal (one can be given on the command line). This choice of defaults is one reason people tend to think Java GC must kind of suck: out of the box, Java tries to make your app run as fast as possible, with as little memory overhead as possible, and pause times be damned.

If pause times matter to you more then you might switch to the concurrent mark/sweep collector (CMS). This is the closest comparable algorithm to the one Go uses. But it’s also generational, and that’s why it has pause times longer than Go’s: the young generation is compacted whilst the app is paused because it involves moving objects around. There are two types of pauses in CMS. The first, faster kind, might last around 2–5 milliseconds. The second might be more like 20 milliseconds. CMS is adaptive: because it’s concurrent it has to guess when to start running (just like Go). Whereas Go asks you to configure the heap overhead to tune that, CMS will adapt itself at runtime to try and avoid concurrent mode failures. Because the bulk of the heap is ordinary mark/sweep, it’s possible to hit problems and slowdowns because of heap fragmentation.

The latest generation Java GC is called “G1” for “garbage first”. It’s not on by default in Java 8, but will be in Java 9. It is intended to be a general purpose one-size-fits-all algorithm, or as close as you can get right now. It is mostly concurrent, generational and compacting for the entire heap. It is also largely self tuning, but because (like all GC algorithms) it can’t know what you really want, it allows you to specify your preferred tradeoffs: just tell it the maximum amount of RAM you will let it use and a pause time goal in milliseconds, and it’ll adjust everything else as the app runs to try and meet the pause time goal. The default pause time goal is around 100msec, so you shouldn’t expect to see better than that unless you specify a different goal: G1 will prefer to make your app run faster than pause less than that. Pauses aren’t entirely consistent — most are extremely fast (less than a millisecond), and some will be slower (more like 50 milliseconds) when the heap is being compacted. G1 scales very well. There are reports of people using it with terabyte sized heaps. It also has some neat features, like deduplicating the strings in the heap.

Finally, a new GC algorithm has been developed called Shenandoah. It is being contributed to OpenJDK but won’t be in Java 9 unless you use special Java builds from Red Hat (who sponsor the project). This is designed to give very low pause times regardless of heap size whilst still being compacting. The cost is extra heap overhead and more barriers: to move objects around whilst the app is still running requires both pointer reads and writes to interact with the GC. In this sense it is similar to Azul’s “pauseless” collector.

Conclusion

The point of this article is not to convince you to use a different programming language or tool. But if you take one thing away, let it be this: garbage collection is a hard problem, really hard, one that has been studied by an army of computer scientists for decades. So be very suspicious of supposed breakthroughs that everyone else missed. They are more likely to just be strange or unusual tradeoffs in disguise, avoided by others for reasons that may only become apparent later.

But if you do wish to minimize pause times at the expense of everything else, then by all means, check out the Go GC.

Modern garbage collection

Modern garbage collection

A look at the Go GC strategy

A primer on GC theory

Tradeoffs, tradeoffs everywhere

The generational hypothesis

The Go concurrent collector

Comparison with Java

Conclusion

Recommend

What should follow the web?

It’s time to kill the web

More CPAP for COVID

Is epidemiology useful?

Pseudo-epidemics

Pseudo-epidemics: part II

Book review: Science Fictions

Adventures in Profiling with Go

In React, The Wrong Abstraction Kills Efficiency

Array Functions and the Rule of Least Power

About Joyk