Extreme HTTP Performance Tuning

That is one hell of a comprehensive article. I wonder how much impact would such extreme optimizations on a real-world application, which for example does DB queries.

This experiment feels similar to people who buy old cars and remove everything from the inside except the engine, which they tune up so that the car runs faster :).

This comprehensive level of extreme tuning is not going to be directly useful to most people; but there are a few things in there like SO_ATTACH_REUSEPORT_CBPF that I hope to see more servers and frameworks adopt. Similarly I think it is good to be aware of the adaptive interrupt capabilities of AWS instances, and the impacts of speculative execution mitigations, even if you stick to the defaults.

More importantly it is about the idea of using tools like Flamegraphs (or other profiling tools) to identify and eliminate your bottlenecks. It is also just fun to experiment and share the results (and the CloudFormation template). Plus it establishes a high water mark for what is possible, which also makes it useful for future experiments. At some point I would like to do a modified version of this that includes DB queries.

Wow, I haven't seen SO_ATTACH_REUSEPORT_CBPF before, I didn't even know it existed. That is a pretty ingenious and powerful primitive to cut down on cross-NUMA chatter. I always like it when folks push things to the extreme, it really shows exactly what is going on under the hood.

What does SO_ATTACH_REUSEPORT_CBPF and how does one uses it?

That is covered in the article.

Yes, my experience (not much) is that what makes YouTube or Google or any of those products really impressive is the speed.

YouTube or Google Search suggestion is good, and I think it could be replicable with that amount of data. What is insane is the speed. I can't think how they do it. I am doing something similar for the company I work on and it takes seconds (and the amount of data isn't that much), so I can't wrap my head around it.

The point is that doing only speed is not _that_ complicated, and doing some algorithms alone is not _that_ complicated. What is really hard is to do both.

A lot of this is just spending more money and resources to make it possible to optimize for speed.

With sufficient caching with and a lot of parallelism makes this possible. That costs money though. Caching means storing data twice. Parallelism means more servers (since you'll probably be aiming to saturate the network bandwidth for each host).

Pre-aggregating data is another part of the strategy, as that avoids using CPU cycles in the fast-path, but it means storing even more copies of the data!

My personal anecdotal experience with this is with SQL on object storage. Query engines that use object storage can still perform well with the above techniques, even though querying large amounts of data from object is slow. You can bypass the slowness of object storage if you pre-cache the data somewhere else that's closer/faster for recent data. You can have materialized views/tables for rollups of data over longer periods of time, which reduces the data needed to be fetched and cached. It also requires less CPU due to working with a smaller amount of pre-calculated data.

Apply this to every layer, every system, etc, and you can get good performance even with tons of data. It's why doing machine-learning in real- is way harder than pre-computing models. Streaming platforms make this all much easier as you can constantly be pre-computing as much as you can, and pre-filling caches, etc.

Of course, having engineers work on 1% performance improvements in the OS kernel, or memory allocators, etc will add up and help a lot too.

One interesting thing to note is that there are lots of internal tools (CLI, web UI, etc.) that are REALLY slow. Things that are heavily used in the fast-path for development (e.g. code search, code review, test results) are generally pretty quick, but if there's a random system that has a UI, it's probably going to be very slow - because there's no budget for speeding them up, and the only people it annoys are engineers from other teams.

Latency. Latency. Latency!

It's hard to measure, so nobody does.

Throughput is easy to measure, so everybody does.

Latency is hard to buy, so few people try.

Throughput is easy to buy, so everybody does.

Latency is what matters to every user.

Throughput matters only to a few people.

Turn on SR-IOV. Disable ACPI C-states. Stop tunnelling internal traffic through virtual firewalls. Use binary protocols instead of JSON over HTTPS.

I've seen just those alone improve end-user experience tenfold.

I've had them take seconds for suggestions before when doing more esoteric searches. I think there's an inordinate amount of cached suggestions and they have an incredible way to look them up efficiently.

Speaking of which I wonder if anyone did this to the Linux kernel for a variant that's tuned only for http

He's cheating by assuming all http responses fit in one TCP packet, but you could use FreeBSD which is already tuned like this and has optimizations like ACCEPT_FILTER_HTTP not mentioned in this article.

Great work, thanks for sharing! Systems performance at its best. Nice to see the use of the custom palette.map (I forget to do that myself and I often end up hacking in highlights in the Perl code.)

BTW, those disconnected kernel stacks can probably be reconnected with the user stacks by switching out the libc for one with frame pointers; e.g., the new libc6-prof package.

Thank you for sharing all your amazing tools and resources brendangregg! I wouldn't have been able to do most of these optimizations without FlameGraph and bpftrace.

I actually did the same thing and hacked up the perl code to generate the my custom palette.map

Thanks for the tip re: the disconnected kernel stacks. They actually kinda started to grow on me for this experiment, especially since most of the work was on the kernel side.

Is libc6-prof just glibc recompiled with -fno-omit-frame-pointer? I did that a couple times and found that while that fixes a few system calls, it doesn't fix all of them. I think the main issue was several syscalls being called from asm, which wasn't unsurprisingly isn't affected by -fno-omit-frame-pointer.

Right, it is. It fixed my hot-path syscalls on x86 (via read/write functions, pthread_mutex functions, etc.). But if you have syscalls called via asm outside of libc (by who?) then they need frame pointers as well.

I really like the "Optimizations That Didn't Work" section. This type of information should be shared more often.

I'm missing one thing from the article, that is commonly missing from performance-related articles.

When you talk about playing whack-a-mole with the optimizations, this is what you are missing:

> What's the best the hardware can do?

You don't say in the article. The article only says that you start at 250k req/s, and ends at 1.2 req/s.

Is that good? Is your optimization work done? Can you open a beer and celebrate?

The article doesn't say.

If the best the hardware can technically do is 1.3M req/s, then you probably can call it a day.

But if the best the hardware can do is technically 100M req/s, then you just went from very very bad (0.25% of hardware peak) to just very bad (1.2% of hardware peak).

Knowing how many reqs per second should the hardware be able to do is the only way to put things in perspective here.

The answer to that question is not quite as straight-forward as you might think. In many ways, this experiment/post is about figuring out the answer to the question of "what is the best the hardware can do".

I originally started running these tests using the c5.xlarge (not c5n.xlarge) instance type, which is capable of a maximum 1M packets per second. That is an artificial limit set by AWS at the network hardware level. Now mind you, it is not an arbitrary limit, I am sure they used several factors to decide what limits make the most sense based on the instance size, customer use cases, and overall network health. If I had to hazard a guess I would say that 99% of AWS customers don't even begin to approach that limit, and those that do are probably doing high speed routing and/or using UDP.

Virtually no-one would have been hitting 1M req/s with 4 vCPUs doing synchronous HTTP request/response over TCP. Those that did would have been using a kernel bypass solution like DPDK. So this blog post is actually about trying to find "the limit", which is in quotes because it is qualified with multiple conditions: (1) TCP (2) request/response (3) Standard kernel TCP/IP stack.

While working on the post, I actively tried to find a network performance testing tool that would let me determine the upper limit for this TCP request/response use case. I looked at netperf, sockperf and uperf (iPerf doesn't do req/resp). For the TCP request/response case they were *all slower* than wrk+libreactor. So it was up to me to find the limit.

When I realized that I might hit the 1M req/s limit I switched to the c5n.xlarge whose hardware limit is 1.8M pps. Again, this is just a limit set by AWS.

Future tests using a Graviton2 instance + io_uring + recompiling the kernel using profile-guided optimizations might allow us to push past the 1.8M pps limit. Future instances from AWS may just raise the pps limit again...

Either way, it should be fun to find out.

TCP is not typically a hardware feature so how’d you know exactly?

Maybe you wanna write a dedicated OS for it? Interesting project but I can’t blame them for not doing it.

Your network links supports certain throughput and latencies depending on the packet sizes. Your vendor should tell you what these are, and provide you with benchmarks to reproduce their claims (OSU reproduces these for, e.g., MPI).

The network card also has hardware limits in the BW that it can handle, its latency. It is connected with the CPU via PCI-e usually, which has also latency and bandwidths, etc.

All this go to the CPU, which has latencies and BW from the different caches and DRAM, etc.

So you should be able to model what's the theoretical maximum of request that the network can handle, and then the network interface, the PCI-e bus, etc. up to DRAM.

The amount that they can handle differs, so the bottleneck is going to be the slowest part of the chain.

For example, as an extremely simplified example, say you have a 100 GB/s network, connected to a network adapter that can handle 200GB/s, connected with PCI-e 3 to the CPU at 12GB/S, which is connected with DRAM at 200GB/s.

If each request has to receive or send 1 GB, then you can at most handle 12 req/s because that's all what your PCI-e bus can support.

If you are then delivering 1 reqs/s then either your "model" is wrong, or your app is poorly implemented.

If you are then delivering 11 req/s, then either your "model" is wrong, or your app is well implemented.

But if you are far away from your model, e.g., at 1 reqs/s, you can still validate your model, e.g., by using two PCI-e bus, which you then expect to be 2x as fast. Maybe your data about your PCI-e bw is incorrect, or you are not understanding something about how the packets get transfer, but the model guides you through the hardware bottlenecks.

The blog post lacks a "model", and focus on "what the software does" without ever putting it into the context of "what the hardware can do".

That is enough to allow you to compare whether software A is faster than software B, but if you are the fastest, it doesn't tell you how far can you go.

I had this literal debate with a "network engineer" that was trying to convince me that 14 Mbps coming out of a Windows box with dual 10 Gbps NICs was expected. You know... because "Windows is slow"!

I aim for 9 Gbps per NIC, but I still see people settling for 3 Gbps total as if that's "normal".

> but I still see people settling for 3 Gbps total as if that's "normal".

y'know - it might be enough

It might also be 15% of what you purchased.

Handling request response isn’t just about packet count. I might as well claim it’s all just electric current and short some wires for max throughput /s

The computation given by parent allows you to compute upper bounds and order of magnitude estimates. He is correct that you need these values to guide your optimizations.

Sure, its more complex than that, and an accurate model would be more complex as well.

But hey, doing science[0] is hard, better not be scientific instead /s

[1] science as in the scientific method: model->hypothesis->test , improve model->iterate. In contrast to the "shoot gun", or like the blog author called it, "whack-a-mole" method: try many things, be grateful if one sticks, no ragrets. /s

Doing science is great, but first we need to make sure we're not comparing apples and oranges.

OP has defined the problem as speeding up an HTTP server (libreactor based) on Linux. So that's a context we assume as a base, questions like "what can the hardware do without libreactor and without Linux" are not posed here.

If your problem is "speeding up X", one of, if not the first question you should ask is: "how fast can X be"?

If you don't know, find out, because maybe X is already as fast as it can be, and there is nothing to speed up.

Sure, the OP just looks around and sees that others are faster, and they want to be as fast as they are.

That's one way to go. But if all others are only 1% as fast as _they should be_, then...

- either you have fundamentally misunderstood the problem and the answer to "how fast can X be?" (maybe its not as fast as you thought for reasons worth learning)

- what everyone else is doing is not the right way to make X as fast as X can be

The value in having a model of your problem is not the model, but rather what you can learn from it.

You can optimize "what an application does", but if what it does is the wrong thing to do, that's not going to get you close to what the performance of that application should be.

Offloading TCP to hardware is, in fact, something that is very common, especially once you get into the 10gbit connections area. I would be surprised if AWS didn’t do this.

It's available, is it very common, I can't claim.

Googling stuff like "Amazon AWS hardware TCP TOE" doesn't reveal anything. So we can't assume that either.

Typically with public cloud vendors you get SR-IOV networking above a certain VM size, but you may have to jump through hoops to enable it.

I'm not sure about AWS, but in Azure it is called "Accelerated Networking" and it is available in most recent VM sizes that have 4 CPUs or more.

It enables direct hardware connectivity and all offload options. In my testing it reduces latency dramatically, with typical applications seeing a 5x faster small transactions. Similarly, you can get "wire speed" for single TCP streams without any special coding.

> Disabling [spectre] mitigations gives us a performance boost of around 28%

Every couple months these last several years there always seems to be some bug where the fix only costs us 3% performance. Since those tiny performance hits add up over time, security is sort of like inflation in the compute economy. What I want to know is how high can we make that 28% go? The author could likely build a custom kernel that turns off stuff like pie, aslr, retpoline, etc. which would likely yield another 10%. Can anyone think of anything else?

Most of these mitigations are worse than useless in an environment not executing untrusted code. Simply put, if you have a dedicated server and you aren't running user code, you don't need them.

But of course other exploits (e.g. in your webapp) might lead to "running user code" where you didn't expect it and then the mitigations could prevent privilege escalation, couldn't they?

But if you have a dedicated server for your web app, if there's some kind of exploit in it allowing for random code to be run, said code already has access to everything it needs, right?

The interesting data will probably be whatever secrets the app handles, say database credentials, so the attacker is off to the races. They probably don't care about having root in particular.

> if there's some kind of exploit in it allowing for random code to be run, said code already has access to everything it needs

On the same host there could be SSL certificates, credentials in a local MTA, credentials used to run backups and so on.

Or the application itself could be made of multiple components where the vulnerable one is sandboxed.

All those points are true - though I'd argue this is stretching the "one app per VM" thing -, but I guess this is just the usual case of understanding your situation and realizing there's no one size fits all.

My take on this question is rather that there shouldn't be any dogma around this, such as disabling mitigations should not be considered absolutely, 100% harmful and never, ever, ever disabled.

In the context of the OP, where the application is running on AWS, backups, email, etc are all likely to be handled either externally (say EBS snapshots) in which case there's no issue, or via "trusting the machine", so getting credentials via the instance role which every process on the VM can do, so no need for privilege escalation.

So I guess if you trust EC2 or Task roles or similar (not familiar with EKS) to access sensitive data and only run a "single" application, there's likely little to no reason to use the mitigations.

But, yeah, if you're running an application with multiple components, each in their own processes and don't use instance roles for sensitive access, maybe leave them on. Also, maybe, this means you're not running a single app per vm?

Why "app"? These are services.

> there shouldn't be any dogma around this

Like everything in security, it's about tradeoffs.

> Also, maybe, this means you're not running a single app per vm?

This is an argument for unikernels.

Instead, on 99.9% of your services you want to run multiple independent processes, especially in a datacenter environment: your service, web server, sshd, logging forwarder, monitoring daemon, dhcp client, NTP client, backup service.

Often some additional "bigcorp" services like HIDS, credential provider, asset management, power management, deployment tools.

> Why "app"? These are services.

Yes, but I was using my initial post's parent's terminology. But I agree, in my mind, the subject was one single "service", as in process (or a process hierarchy, like say with gunicorn for python deployments).

> This is an argument for unikernels.

It is. And I'm also very interested in the developments around Firecraker and similar technologies. If we'd be able to have the kind of isolation AWS promises between ec2 instances on a single physical machine, while at the same time being able to launch a process in an isolated container as easy as with docker right now, I'd consider that really great. And all the other "infrastructure" services you talk about could just live their lives in their dedicated containers.

Not sure how all this would compare, performance-wise, with just enabling the mitigations.

The puzzling thing was that spectre V2 mitigations were cited as the main culprit. They were responsible by themselves for a 15-20% slowdown, which is about an order of magnitude worse than in my experience. I wonder if the system had IBRS enabled instead of using retpolines at the mitigation strategy?

I am not full deep in SecOps these days and would gladly hear opinion of some expert:

Can disabling these mitigations bring any risks assuming the server is sending static content to the Internet over port 80/443 and it is practically stateless with read-only file system?

I am not an expert but you shall have my take either way. The most important question here is "Am I executing arbitrary untrusted code?". HTTP servers will parse the incoming requests so they are executed to some extent. But I would not worry about it unless there is some backend application doing more involved processing with the data. repl.it should not disable mitigations.

Does anyone know of a quick & easy PowerShell script I can run on Windows servers to disable Spectre mitigations?

The last time I looked I found a lot of waffle but no simple way I can just turn that stuff off...

PIE and ASLR are free on x86-64, unless someone has a bad ABI I don't know of. Spectre mitigations are also free or not needed on new enough hardware.

Many security changes also help you find memory corruption bugs, which is good for developer productivity.

Some of these things could be fixed upstream and everyone see real perf gains...

For example, having dhclient (a very popular dhcp client) leave open an AF_PACKET socket causing a 3% slowdown in incoming packet processing for all network packets seems... suboptimal!

Surely it can be patched to not cause a systemwide 3% slowdown (or at least to only do it very briefly while actively refreshing the DHCP lease)?

I would also love to see that dhclient issue resolved upstream, or at least a cleaner way to work around it. But we should also be mindful that for most workloads the impact is probably way, way less.

Some of these things really only show up when you push things to their extremes, so it probably just wasn't on the developer's radar before.

I believe systemd-networkd has its own implementation of DHCP and therefore doesn't use dhclient. But I wonder if it's behavior is any better in this respect.

This has piqued my interest.

systemd-networkd keeps open that kind of socket for LLDP but apparently not for the DHCP client code. wpa_supplicant also keeps open this type of socket on my local system. and the dhcpd daemons on my routers have some of those too for each interface...

i wonder if the slow path here could be avoided by using separate network namespaces in a way these sockets don't even get to see the packets...

Specifically on EC2 I don't think you actually need to keep dhcp client running anyways, afaik EC2 instance ips are static so you can just keep using the one you got on boot.

I don't have anything to add to the conversation other than to say that this is fantastic technical writing (and content too). Most of the time, when similar articles like this one are posted to company blogs, they bore me to tears and I can't finish them, but this is very engaging and informative. Cheers

Thanks, that actually means a lot. It took a lot of work, not just on the server/code, but also the writing. I asked a lot of people to review it (some multiple times) and made a ton of changes/edits over the last couple months.

Thanks again to my reviewers!

I'm of two minds with regards to this: This is cool but unless you have no authentication, data to fetch remotely or on disk this is really just telling you what the ceiling is for everything you could possibly run.

As for this article, there are so many knobs that you tweaked to get this to run faster it's incredibly informative. Thank you for sharing.

> this is really just telling you what the ceiling is

That's a useful piece of info to know when performance tuning a real world app with auth / data / etc.

Your website is super snappy. I see that it has a perfect lighthouse score too. Can you explain the stack you used and how you set it up?

It is a statically generated site created with vitepress[1] and hosted on Cloudflare Pages[2]. The only dynamic functionality is the contact form which sends a JSON request to a Cloudflare Worker[3], which in turn dispatches the message to me via SNS[4].

It is modeled off of the code used to generate Vue blog[5], but I made a ton of little modifications, including some changes directly to vitepress.

Keep in mind that vitepress is very much an early work in progress and the blog functionality is just kinda tacked on, the default use case is documentation. It also definitely has bugs and is under heavy development so wouldn't recommend it quite yet unless you are actually interested in getting your handa dirty with Vue 3. I am glad I used it because it gave me an excuse to start learning Vue, but unless you are just using the default theme to create a documentation site, it will require some work.

1. https://vitepress.vuejs.org/

2. https://pages.cloudflare.com/

3. https://workers.cloudflare.com/

4. https://aws.amazon.com/sns/

3. https://github.com/vuejs/blog

On the other hand you could probably make the table of content be always visible when the screen size allows it. Clicking on the burger in the site menu to get a page-specific sidebar is a bit counter-intuitive.

Thanks :). Found one flaw in your already crazy optimized vitpress site - the images aren't cached :P

> EC2 X-factor?

> Even after taking all the steps above, I still regularly saw a 5-10% variance in performance across two seemingly identical EC2 server instances

> To work around this variance, I tried to use the same instance consistently across all benchmark runs. If I had to redo a test, I painstakingly stopped/started my server instance until I got an instance that matched the established performance of previous runs.

We notice similar performance variance when running benchmark on GCP and Azure. In the worst case, there can be a 20% variance on GCP. On Azure, the variance between identical instances is not as bad, perhaps about 10%, but there is an extra 5% variance between normal hours and off-peak hours, which further complicates things.

It can be very frustrating to stop/start hundreds of times for hours to get back an instance with the same performance characteristic. For now, I use a simple bash for-loop that checks the "CPU MHz" value from lscpu output, and that seems to be reliable enough.

On AWS you can rent ".metal" instances which are probably more stable for benchmarking. I tried this once for fun on a1.metal because I wanted access to all hardware performance counters. For that it worked. My computation was also running slightly faster (something around 5% IIRC). But of course you'll have to pay for all its cores and memory while you use it.

Yeah, that's exactly what the GCP engineer recommends, and likely why the final benchmark in the article was done using a c5n.9xlarge.

Still, there is no guarantee that after stopping the instance on Friday evening, you would get back the same physical host on Monday morning. So, while using dedicated hardware does avoid the noisy neighbor problem, the "silicon lottery" problem remains. And so far, the data that I gathered indicates that the latter is the more likely cause, i.e. a "fast" virtual machine would remain fast indefinitely, while a "slow" virtual machine would remain slow indefinitely, despite both relying on a bunch of shared resources.

Why would you expect two different virtual machines to have identical performance?

I would expect that just the cache usage characteristics of "neighbouring" workloads alone would account for at least a 10% variance! Not to mention system bus usage, page table entry churn, etc, etc...

If you need more than 5% accuracy for a benchmark, you absolutely have to use dedicated hosts. Even then, just the temperature of the room would have an effect if you leave Turbo Boost enabled! Not to mention the "silicon lottery" that all overclockers are familiar with...

This feels like those engineering classes where we had to calculate stresses in every truss of a bridge to seven figures, and then multiply by ten for safety.

I didn't expect identical performance, but a 10~20% variance is just too big. For example, if https://www.cockroachlabs.com/guides/2021-cloud-report/ got a "slow" GCP virtual machine but a "fast" azure virtual machine, the final result could totally flip.

The more problematic scenario, as mentioned in the article, is when you need to do some sort of performance tuning that can take weeks/months to complete. On the cloud, you either have to keep the virtual machine running all the time (and hope that a live migration doesn't happen behind the scene to move it to a different physical host), and do the painful stop/start until you get back the "right" virtual machine before proceeding to do the actual work.

We discovered this variance a couple of months ago. And this article from talawah.io is actually the first time I have seen anyone else mentioning about it. It still remains a mystery, because we too can't figure out what contributes to the variance using tools like stress-ng, but the variance is real when looking at MySQL commits/s metric.

> If you need more than 5% accuracy for a benchmark, you absolutely have to use dedicated hosts.

After this ordeal, I am arriving at that conclusion as well. Just the perfect excuse to build a couple of ryzen boxes.

This is a bit like someone being mystified that their arrival time at a destination across the city is not repeatable to within plus-minus a minute.

There are traffic lights on the way! Other cars! Weather! Etc...

I've heard that Google's internal servers (not GCP!) use special features of the Intel Xeon processors to logically partition the CPU caches. This enables non-prod workloads to coexist with prod workloads with a minimal risk of cache trashing of the prod workload. IBM mainframes go further, splitting at the hardware level, with dedicated expansion slots and the like.

You can't reasonably expect 4-core virtual machines to behave identically to within 5% on a shared platform! That tiny little VM is probably shoulder-to-shoulder with 6 or 7 other tenants on a 28 or 32 core processor. The host itself is likely dual-socket, and some other VMs sizes may be present, so up to 60 other VMs running on the same host. All sharing memory, network, disk, etc...

The original article was also a network test. Shared fabrics aren't going to return 100% consistent results either. For that, you'd need a crossover cable.

Well, I'll be the first one to admit that I was naive to expect <5% variance prior to this experience. But I guess you are going to far by framing this as a common wisdom?

In the HN discussion about cockroachdb cloud report 2021 (https://news.ycombinator.com/item?id=25811532), there was only 1 comment thread that talks about "cloud weather".

In https://engineering.mongodb.com/post/reducing-variability-in..., high profile engineers still claimed that it is perfectly fine to use cloud for performance testing, and "EC2 instances are neither good nor bad".

Of course, both the cockroachdb and mongodb cases could be related, as any performance variance at the instance level could be masked when the instances form a cluster, and the workload can be served by any node within the cluster.

You do have a point. I also have seen many benchmarks use cloud instances without any disclaimers, and it always made me raise an eyebrow quizzically.

Any such benchmark I do is averaged over a few instances in several availability zones. I also benchmark specifically in the local region that I will be deploying production to. They're not all the same!

Where the cloud is useful for benchmarking is that it's possible to spin up a wide range of "scenarios" at low cost. Want to run a series of tests ranging from 1 to 100 cores in a single box? You can! That's very useful for many kinds of multi-threaded development.

The analysis itself is quite impressive: a very systematic top-down approach. We need more people doing stuff like this!

But! Be careful applying tunables from the article "as-is"[1]: some of them would destroy TCP performance:

  net.ipv4.tcp_sack=0
  net.ipv4.tcp_dsack=0
  net.ipv4.tcp_timestamps=0
  net.ipv4.tcp_moderate_rcvbuf=0
  net.ipv4.tcp_congestion_control=reno
  net.core.default_qdisc=noqueue

Not to mention that `gro off` that will bump CPU usage by ~10-20% on most real world workload, Security Team would be really against turning off mitigations, and usage of `-march=native` will cause a lot of core dumps in heterogenous production environments.

[1] This is usually the case with single purpose micro-benchmarks: most of the tunables have side effects that may not be captured by a single workflow. Always verify how the "tunings" you found on the internet behave in your environment.

That can be done with HTTP. But right now it's all HTTPS specially when you are serving APIs over the Internet.

And once I switch to HTTPS I see a dramatic drop in throughput like x10.

A http 15k req/sec drops down to 400 req/sec once I start serving it over HTTPS.

I see no solution to it as everything has to https now.

HTTPS especially TLS1.3 is not slow. x86 has had AES acceleration since 2010.

It might need different tuning or you might be negotiating a slow cipher.

The SSL handshake (which affects TTFB) isn’t AES.

Right, but TLS1.3 improves that especially with 0RTT. Before that you had things like session resumption for repeat clients, or if your server was overloaded you could use an external HTTPS proxy.

Great work, thanks!

I'm curious whether disabling the slow kernel network features competes with an tcp bypass stack. I did my own wrk benchmark [0], but I did not try to optimize the kernel stack beyond pinning CPUs and busypoll, because the bypass was about 6 times as fast. I assumed that there is no way the kernel stack could compete with that. This article shows that I may be wrong. I will definitely check out SO_ATTACH_REUSEPORT_CBPF in the future.

[0] https://github.com/raitechnology/raids/#using-wrk-httpd-load...

That is an area I am curious about as well, especially if you throw io_uring into the mix. I think most kernel bypass solutions get some of their gains by just forcing you to use the same strategies covered in the perfect locality section. It doesn't all just come from the "bypass" part.

Even if isn't quite as fast as DPDK and co, it might be close enough for some people to start opting to stick with the tried and true kernel stack instead of the more exotic alternatives.

My gut feeling with io_uring is that it wouldn't help as much with messaging applications with 100 byte request/reply patterns. It would be better in a with a pipelined situation, through a load balancing front end. I would love to be proven wrong, though.

1.2M req/s means 2.4M (send/recv) syscalls per second. I definitely think io_uring will make a difference. Just not sure if it will be 5% or 25%.

Wow. Such impressive bpftrace skill! Keeping this article under my pillow ;)

Wonder where the next optimization path leads? Using huge memory pages. io_uring, which was briefly mentioned. Or kernel bypass, which is supported on c5n instances as of late...

Anyone can recommend similar articles/blogs that focus on optimization of networking/computing in Linux/cloud environments? This kind of articles are very informative, because they refer to advanced mechanisms that I either haven't heard about or newer saw in practical use.

Very nice round-up of techniques. I'd throw out a few that might or might not be worth trying: 1) I always disable C-states deeper than C1E. Waking from C6 takes upwards of 100 microseconds, way too much for a latency-sensitive service, and it doesn't save you any money when you are running on EC2; 2) Try receive flow steering for a possible boost above and beyond what you get from RSS.

Would also be interesting to discuss the impacts of turning off the xmit queue discipline. fq is designed to reduce frame drops at the switch level. Transmitting as fast as possible can cause frame drops which will totally erase all your other tuning work.

Thanks!

> I always disable C-states deeper than C1E

AWS doesn't let you mess with c-states for instances smaller than a c5.9xlarge[1]. I did actually test it out on a 9xlarge just for kicks, but it didn't make a difference. Once this test starts, all CPUs are 99+% Busy for the duration of the test. I think it would factor in more if there were lots of CPUs, and some were idle during the test.

> Try receive flow steering for a possible boost

I think the stuff I do in the "perfect locality" section[2] (particularly SO_ATTACH_REUSEPORT_CBPF) achieves what receive flow steering would be trying to do, but more efficiently.

> Would also be interesting to discuss the impacts of turning off the xmit queue discipline

Yea, noqueue would definitely be a no-go on a constrained network, but when running the (t)wrk benchmark in the cluster placement group I didn't see any evidence of packet drops or retransmits. Drop only happened with the iperf test.

1. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo...

2. https://talawah.io/blog/extreme-http-performance-tuning-one-...

Does C-state tuning even do anything on EC2? My intuition says it probably doesn't pass through to the underlying hardware -- once the VM exits, it's up to the host OS what power state the CPU goes into.

It definitely works and you can measure the effect. There's official documentation on what it does and how to tune it:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo...

Okay, so it looks as though it only applies to certain large instance types -- presumably ones which are large enough that it makes sense for the host to statically allocate CPU cores (or even sockets) to a guest. Interesting.

I suspect that the web server's CPU usage will be pretty high (almost 100%), so C-state tuning may not matter as much?

EDIT: also, RSS happens on the NIC. RFS happens in the kernel, so it might not be as effective. For a uniform request workload like the one in the article, statically binding flows to a NIC queue should be sufficient. :)

Take a note, no quick cheat like DPDK was used.

This shows you can make a regular Linux program using Linux network stack to approach something handcoded with DPDK.

Yea, I looked it wrk2 but it was a no-go right out the gate. From what I recall the changes to handle coordinated omission use a timer that has a 1ms resolution. So basically things broke immediately because all requests were under 1ms.

If I understand correctly, coordinated omission handling only matters if the benchmark is done with a fixed rate RPS right? In this case, it looks like a closed model benchmark where a fixed number of client threads just go as fast as they can.

edit: Oh, perhaps wrk2 still relies on the timer even when not specifying a fixed rate RPS.

so twrk doesn't handle coordinated omission or you found a different way to do it?

I didn't make any coordinated omission changes (I really didn't make many changes in general), so twrk does what wrk does. It attempts to correct it after the fact by looking for requests that took twice as long as average and doing some backfilling[1].

I am no expert where coordinated omission is concerned, but my understanding is that it is most problematic in scenarios where your p90+ latency is high. Looking at the results for the 1.2M req/s test you have the following latencies:

  p50     203.00us
  p90     236.00us
  p99     265.00us
  p99.99  317.00us
  pMAX    626.00us

If you were to apply wrk's coordinated omission hack to these result, the backfilling only starts for requests that took longer than p50 x 2 (roughly) = 406us, which is probably somewhere between p99.999 and pMAX; a very, very small percentage.

I am not claiming that wrk's hack is "correct", just that I don't think coordinated omission is a major concern for *this specific workload/environment*

1. https://github.com/wg/wrk/blob/a211dd5a7050b1f9e8a9870b95513...

I really like that wrk2 allows to configure fixed framerate, latency measurement works much better in this case. But wrk2 itself has bugs that doesn't allow it to use in more complicated cases, e.g. lua scripts are not working properly.

I wonder what the results would be if all the optimizations were applied except for the security-related mitigations, which were left enabled.

This was great!

Reminds me a lot of this classic CS paper: Improving IPC by Kernel Design, by Jochen Liedke (1993)

https://www.cse.unsw.edu.au/~cs9242/19/papers/Liedtke_93.pdf

At a previous job they tracked down some slow https performance in a game server to OpenSSL lib allocating/reallocation new buffers for each zip’d request. Patching that gave a huge performance increase and saved them from buying some fancy $500k hardware to offload the https processing.

I can still remember the days when /dev/random slowed down SSL session handshakes.

How much head room there would be if one were to use Unikernel and skip the application space altogether?

Since it's CPU-bound and spends a lot of time in the kernel would compiling the kernel for the specific CPU used make sense? Or are the CPU cycles wasted on things the compiler can't optimize?

Recompiling the kernel using profile guided optimizations[1] is yet another thing on the (never-ending) to-do list.

1. https://lwn.net/Articles/830300/

Could you make a profile of just a bunch of functions on a running system?

Very well written.

- I have nodejs server for the APIs and its running on m5.xlarge instance. I haven't done much research on what instance type should I go for. I looked up and it seems like c5n.xlarge(mentioned in the article) is meant compute optimized. That cost difference isn't much between m5.xlarge and c5n.xlarge. So, I'm assuming that switching to c5 instance would be better, right?

- Does having ngnix handle the request is better option here? And setup reverse proxy for NodeJS? I'm thinking of taking small steps on scaling an existing framework.

Thanks!

The c5 instance type is about 10-15% faster than the m5, but the m5 has twice as much memory. So if memory is not a concern then switching to c5 is both a little cheaper and a little faster.

You shouldn't need the c5n, the regular c5 should be fine for most use cases, and it is cheaper.

Nginx in front of nodejs sounds like a solid starting point, but I can't claim to have a ton of experience with that combo.

For high level languages like node, the graviton2 instances offer vastly cheaper cpu time (as in, 40%). That’s the m6g / c6g series.

As in all things, check the results on your own workload!

m5 has more memory, if you application is memory bound stick with that instance type.

I'd recommend just using a standard AWS application load balancer in front of your Node.js app. Terminate SSL at the ALB as well using certificate manager (free). Will run you around $18 a month more.

I have done some performance optimization but this article has 30% stuff I have never heard of. Great work and thanks!

Fantastic article. Disabling spectre mitigations on all my team's GCE instances is something I'm going to check out.

Regarding core pinning, the usual advice is to pin to the CPU socket physically closest to the NIC. Is there any point doing this on cloud instances? Your actual cores could be anywhere. So just isolate one and hope for the best?

Pinning to the physically closest core is a bit misleading. Take a look at output from something like `lstopo` [https://www.open-mpi.org/projects/hwloc/], where you can filter pids across the NUMA topology and trace which components are routed into which nodes. Pin the network based workloads into the corresponding NUMA node and isolate processes from hitting the IRQ that drives the NIC.

wow, i had wondered about pinning in the cloud. this is a fantastic tip - thank you!

There are a bunch more mitigations that can be disabled than he disables in the article. I usually refer to https://make-linux-fast-again.com/

In this list, mitigations=off implies all the others.

Make Linux Even More Insecure Again

I’d love to have the time (and ability!) to do this level of digging. Amazing write up to, very well presented.

When is it advisable to turn off spectre/meltdown mittigations in practice? My guess is that if you are on a server and not running any user supplied code then you are on the safe side; on condition that you could exclude buffer overuns by running managed code/java or by using Rust.

So the unspoken part of your question is when is it useful to turn off mitigations. The answer to that is when your application makes a lot of syscalls / when syscalls are a bottleneck beyond the actual work of the syscalls.

This case, where it's all connection handling and serving a small static piece of data is a clear example; there's almost no userland work to be done before it goes to another syscall so any additional cost for the user/kernel barrier is going to hurt.

Then the question becomes who can run code on your server; also condidering maybe there's a remote code execution vulnerability in your code, or library code you use. Is there a meaningful barrier that spectre/meltdown mitigations would help enforce? Or would getting RCE get control over everything of substance anyway?

if you have an event driven system then end up with very frequent system calls.

Partially that can be amortized with io_uring... At the cost of some complexity, of course.

Interesting that most of the gains are from better utilization/configuration of Linux, not from code optimizations. The userland code was, and remained a tiny fraction of time spent.

What was an MTU in the test, how increasing it affects the results ?

Reminds me how complicated it was to generate 40Gbit/sec of http traffic (with default MTU) to test F5 Bigip appliances, luckily TCL irules had `HTTP::retry`

The MTU is 9001 within the VPC, but the packets are less than 250 bytes so the MTU doesn't really come into play.

This test is more about packets/s than bytes/s.

I'm digging the website layout.What's the CSS framework he's using? I'm on mobile, and can't see the source.

How can you be sure the estimated max server capability is not actually just a limitation in the client, i.e, the client maxes out at sending 224k requests / second.

I see that this is clearly not the case here, but in general how can one be sure?

You parallelize the load from multiple clients (running on separate hardware). There are some open source projects that facilitate this sort of workload (and the subsequent aggregation of results/stats.)

Use N clients. Increase N until you’re sure.

"Many of these specific optimizations won't really benefit you unless you are already serving more than 50k req/s to begin with."

Thank you for the amazing article and detailed insights. Great writing style and approaches.

How long did you spend researching this subject to produce such an in depth report?

Hard to say exactly. I have been working on this in my spare time, but pretty consistently since covid-19 started. A lot of this was new to me, so it wasn't all as straight-forward as it seems in the blog.

As a ballpark I would say I invested hundreds of hours in this experiment. Lots of sidetracks and dead ends along the way, but also an amazing learning experience.

Very well written, bravo. TOC and reference links makes it even better.

What is the theoretical max req/s for a 4 vCPU c5n.xlarge instance?

There is no published limit, but based on my tests the network device for the c5n.xlarge has a hard limit of 1.8M pps (which translates directly to req/s for small requests without pipelining).

There is also a quota system in place, so even though that is the hard limit, you can only operate at those speeds for a short time before you start getting rate-limited.

Improving from 12.4% to 66.6% of theoretical max is kinda amazing.

Presented this way may help noobs like me with capacity planning.

Very impressive analysis. Thanks for sharing.

Very educational and well-written, thank you.

Hi Marc,

Fantastic work! Keep it up.

"Disabling these mitigations gives us a performance boost of around 28%. "

This can't be serious. Can someone flag this article? Highly inappropriate.

Recommend

PHP-FPM tuning: Using 'pm static' for max performance

Performance Tuning with Array Caching @ peachpie.io

PostgreSQL Tuning: Key Things to Drive Performance

Linux OS Tuning for MySQL Database Performance

GopherCon 2018 Performance Tuning Workshop

Hibernate performance tuning tips

Hibernate performance tuning tips - Vlad Mihalcea

Azure SQL Database Performance Tuning Options

Tuning performance is harder than debugging bugs

Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance

About Joyk