Tuning Haskell RTS for Kubernetes, Part 1

We’re running Haskell in production. We’ve told that story before.

We are also running Haskell in production on Kubernetes, but we never talked about that. It was a long journey and it wasn’t all roses, so we’re going to share what we went through.

TL;DR

Configure Haskell RTS settings:

Match -N⟨x⟩ to your limits.cpu
Match -M⟨size⟩ to your limits.memory

You can set them between +RTS ... -RTS arguments to your program, like +RTS -M5.7g -N3 -RTS.

Our scenario

We had been running Haskell in production before Kubernetes. Each application was the single inhabitant of its own EC2 instance. Things were smooth. We launched the executable, provisioned what looked like fast enough instances, and things just worked.

We could have kept our conditions pretty much the same when moving to Kubernetes by giving our Haskell Pods as much requests.memory and requests.cpu as our worker nodes, so each machine runs a single Pod.

We had two main incentives to run small Pods, all packed together into beefier worker nodes:

Our traffic is very seasonal, and even within a single day we go from 1000 requests per minute at night to close to 500,000 requests per minute when both east- and west-coast are at school. If we can scale down to the smallest footprint at idle, we save money.
We use Datadog for infrastructure monitoring, and Datadog charges customers on a per-host basis. If we used small worker nodes, at peak traffic we’d be needing so many of them that our Datadog bill would become prohibitive.

We wanted effective resource utilization at idle and at peak while keeping costs under control.

We googled for tips, war stories or even fanfiction on Haskell in Kubernetes, and the two ⁽¹⁾⁽²⁾ results we found were pretty old, and didn’t get into any specifics on how Haskell itself behaves in a containerized environment, so it seemed like there’d be no dragons here.

With this in mind we launched our highest traffic Haskell service in prod with:

2 cores
200MB memory
70% target CPU usage on our Horizontal Pod Autoscaler

And called it a day.

Fires

After we went live we saw:

😨 Terrible performance: everything was slow
😳 Frequent container restarts: it looked like the GC wasn’t working at all and the processes were getting OOMKilled frequently
🤕 Horrendous performance at scale-up: When we got bursts of traffic, response times would shoot up and cause request queueing in our upstream service

This last one was kind of obvious. At 70% target CPU usage, even if our app was able to saturate the machine’s CPU to 99.99% without slowing down, and even if it had linear rate for requests to CPU usage, we’d only have room for 30% growth in traffic while waiting for a scale-up. This was not enough slack, due to two main factors:

AWS EKS takes close to 3 minutes to scale up Pods when worker node scaling is also necessary. 3 minutes is a lot time when we’re ramping up 500x in a few hours. At peak season, we more than double our traffic every 3 minutes during ramp-up when the East Coast is starting school.
- It’s partly fixable by using the cluster-overprovisioner pattern, which we do, but outside of 1-node scale-ups, the option seems to be tweaking AWS VPC CNI’s startup, which we haven’t looked into yet.
Kubernetes has no concept of create-before-destroy. Shifting Pods around for Cluster Autoscaler’s bin-packing operation and for kubectl drain works by first terminating one or more Pods, then letting the scheduler recreate them on another Node. Say we have 3 Pods alive, between terminating one Pod and going Ready with its substitute, our compute capacity is reduced by 33%.
- It might be fixable by writing our own create-before-destroy operation, forking cluster-autoscaler to use that instead, and also using it to write our own drain script. Things like Pod eviction due to taints will be out of reach, but that might be acceptable. Regardless, we chose not to go down that path.

So we lowered our target CPU usage to 50%, and scale-ups were safe.

While fighting frequent container restarts, we kept being less conservative about memory, going all the way up to 2GB per core. Our app consistently used ~100MB of RAM before moving to Kubernetes, so we were surprised. It might be we introduced space/memory leaks at the same time as we moved to Kubernetes, but also, the Haskell garbage collector didn’t seem to be aware it was reaching a memory limit. So we started looking at Haskell RTS GC settings.

While diagnosing terrible performance, we noticed Haskell was spinning up dozens of threads for each tiny Pod, and we knew from working with the Elm compiler (also written in Haskell), that Haskell doesn’t care about virtualized environments when figuring out the capabilities of the machine it’s running on. We figured something similar was at play and we might have to tune the RTS.

Tuning the RTS

The two settings that helped us get over terrible performance and frequent container restarts were:

`-M⟨size⟩`

This setting tells the Haskell RTS what should be the maximum Heap size, which also informs other garbage collector parameters.

So we set the maximum heap size a bit below our Pods’ limits.memory, and the GC started acting more aggressively to prevent us from going over limits.memory. We managed to stop getting OOMKilled.

Eventually, as sneakily as they appeared, our space or memory leaks went away, and we went down to a stable 200MB of memory usage per process.

`-N⟨x⟩`

The docs are a bit misleading here:

Use ⟨x⟩ simultaneous threads when running the program

Without reading further, we thought setting -N2 would get us 2 threads for our 2-core Pods, but we were still seeing more than 10 threads per process.

⟨x⟩ here is what the RTS calls capabilities, which the docs clarify further on:

A capability is animated by one or more OS threads; the runtime manages a pool of OS threads for each capability, so that if a Haskell thread makes a foreign call (see Multi-threading and the FFI) another OS thread can take over that capability.
Normally ⟨x⟩ should be chosen to match the number of CPU cores on the machine

Ok, that’s expected then, albeit a bit weird that it’s such a big pool for only two capabilities.

Regardless, performance was actually good again with -N matching our CPU count.

In the end, we landed on 3 cores per Pod and -N3: Kubernetes reserves a few hundred millicores of each worker node for its manager process (the kubelet) and this meant we’d only be able to use 14 cores on a 16 cores node. 2 cores would go to waste, unless we had enough pebbles in our cluster, which we didn’t.

Obligatory detour through CFS Throttling

At the same time we also learned about CFS throttling, and learned to keep an eye on how much we were getting throttled. For -N2 and 2 cores, it was infrequent.

In the hopes of disabling CFS completely, like Zalando did, we did trial running our Nodes with --cpu-manager-policy=static. This uses taskset to give Pods exclusive access to certain cores.

Our idea was to constrain high throughput Pods to their own cores, in order to spare processes from noisy neighbours and prevent worker nodes from overloading.

We saw a steep drop in performance, so we backed away. We ended up figuring out why, but that’s the subject of another blog post. (hint: it’s the parallel GC)

Production-ready enough

Performance was good
Containers weren’t restarting anymore
We were churning out close to 500,000 requests per minute on 7 Pods, each with 3 capabilities and eating less than 200MB of RAM
Autoscaling was smooth

It wasn’t the end of our ramblings on the Haskell RTS options page, we still had daily incidents where Haskell would slow down for a few seconds, cause upstream request queueing and trigger our fire alerts, but that’s a story for another day.

Juliano Solanho @julianobs Engineer at NoRedInk

Thank you, Brian Hicks, Ju Liu and Richard Feldman for draft reviews and feedback! ❤️

TL;DR

Our scenario

Fires

Tuning the RTS

`-M⟨size⟩`

`-N⟨x⟩`

Obligatory detour through CFS Throttling

Production-ready enough

Recommend

Yarn 3.1 🎃👻 Corepack, ESM, pnpm, Optional Packages ...

CEO Spotlight: From Hospitals to Hotels - How Pinnacle Hotels CEO Dr Barry Lall...

Generate realtime GitHub contribution chart using puppeteer and update it realti...

A small DOCUMERICA Twitter bot

Telegram机器人自动推送推特YouTube同步消息到频道群组

10 Useful APIs for Your Next Project

GitHub - nuta/kerla: A new operating system kernel with Linux binary compatibili...

An Illustrated Tour of Wav2vec 2.0

Why Consumers Don’t Care About (Most) Brands

EBCDIC is incompatible with GDPR

About Joyk