

Extreme HTTP Performance Tuning
source link: https://news.ycombinator.com/item?id=27226382
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

This experiment feels similar to people who buy old cars and remove everything from the inside except the engine, which they tune up so that the car runs faster :).

More importantly it is about the idea of using tools like Flamegraphs (or other profiling tools) to identify and eliminate your bottlenecks. It is also just fun to experiment and share the results (and the CloudFormation template). Plus it establishes a high water mark for what is possible, which also makes it useful for future experiments. At some point I would like to do a modified version of this that includes DB queries.




YouTube or Google Search suggestion is good, and I think it could be replicable with that amount of data. What is insane is the speed. I can't think how they do it. I am doing something similar for the company I work on and it takes seconds (and the amount of data isn't that much), so I can't wrap my head around it.
The point is that doing only speed is not _that_ complicated, and doing some algorithms alone is not _that_ complicated. What is really hard is to do both.

With sufficient caching with and a lot of parallelism makes this possible. That costs money though. Caching means storing data twice. Parallelism means more servers (since you'll probably be aiming to saturate the network bandwidth for each host).
Pre-aggregating data is another part of the strategy, as that avoids using CPU cycles in the fast-path, but it means storing even more copies of the data!
My personal anecdotal experience with this is with SQL on object storage. Query engines that use object storage can still perform well with the above techniques, even though querying large amounts of data from object is slow. You can bypass the slowness of object storage if you pre-cache the data somewhere else that's closer/faster for recent data. You can have materialized views/tables for rollups of data over longer periods of time, which reduces the data needed to be fetched and cached. It also requires less CPU due to working with a smaller amount of pre-calculated data.
Apply this to every layer, every system, etc, and you can get good performance even with tons of data. It's why doing machine-learning in real- is way harder than pre-computing models. Streaming platforms make this all much easier as you can constantly be pre-computing as much as you can, and pre-filling caches, etc.
Of course, having engineers work on 1% performance improvements in the OS kernel, or memory allocators, etc will add up and help a lot too.


It's hard to measure, so nobody does.
Throughput is easy to measure, so everybody does.
Latency is hard to buy, so few people try.
Throughput is easy to buy, so everybody does.
Latency is what matters to every user.
Throughput matters only to a few people.
Turn on SR-IOV. Disable ACPI C-states. Stop tunnelling internal traffic through virtual firewalls. Use binary protocols instead of JSON over HTTPS.
I've seen just those alone improve end-user experience tenfold.



BTW, those disconnected kernel stacks can probably be reconnected with the user stacks by switching out the libc for one with frame pointers; e.g., the new libc6-prof package.

I actually did the same thing and hacked up the perl code to generate the my custom palette.map
Thanks for the tip re: the disconnected kernel stacks. They actually kinda started to grow on me for this experiment, especially since most of the work was on the kernel side.


When you talk about playing whack-a-mole with the optimizations, this is what you are missing:
> What's the best the hardware can do?
You don't say in the article. The article only says that you start at 250k req/s, and ends at 1.2 req/s.
Is that good? Is your optimization work done? Can you open a beer and celebrate?
The article doesn't say.
If the best the hardware can technically do is 1.3M req/s, then you probably can call it a day.
But if the best the hardware can do is technically 100M req/s, then you just went from very very bad (0.25% of hardware peak) to just very bad (1.2% of hardware peak).
Knowing how many reqs per second should the hardware be able to do is the only way to put things in perspective here.

I originally started running these tests using the c5.xlarge (not c5n.xlarge) instance type, which is capable of a maximum 1M packets per second. That is an artificial limit set by AWS at the network hardware level. Now mind you, it is not an arbitrary limit, I am sure they used several factors to decide what limits make the most sense based on the instance size, customer use cases, and overall network health. If I had to hazard a guess I would say that 99% of AWS customers don't even begin to approach that limit, and those that do are probably doing high speed routing and/or using UDP.
Virtually no-one would have been hitting 1M req/s with 4 vCPUs doing synchronous HTTP request/response over TCP. Those that did would have been using a kernel bypass solution like DPDK. So this blog post is actually about trying to find "the limit", which is in quotes because it is qualified with multiple conditions: (1) TCP (2) request/response (3) Standard kernel TCP/IP stack.
While working on the post, I actively tried to find a network performance testing tool that would let me determine the upper limit for this TCP request/response use case. I looked at netperf, sockperf and uperf (iPerf doesn't do req/resp). For the TCP request/response case they were *all slower* than wrk+libreactor. So it was up to me to find the limit.
When I realized that I might hit the 1M req/s limit I switched to the c5n.xlarge whose hardware limit is 1.8M pps. Again, this is just a limit set by AWS.
Future tests using a Graviton2 instance + io_uring + recompiling the kernel using profile-guided optimizations might allow us to push past the 1.8M pps limit. Future instances from AWS may just raise the pps limit again...
Either way, it should be fun to find out.

Maybe you wanna write a dedicated OS for it? Interesting project but I can’t blame them for not doing it.

The network card also has hardware limits in the BW that it can handle, its latency. It is connected with the CPU via PCI-e usually, which has also latency and bandwidths, etc.
All this go to the CPU, which has latencies and BW from the different caches and DRAM, etc.
So you should be able to model what's the theoretical maximum of request that the network can handle, and then the network interface, the PCI-e bus, etc. up to DRAM.
The amount that they can handle differs, so the bottleneck is going to be the slowest part of the chain.
For example, as an extremely simplified example, say you have a 100 GB/s network, connected to a network adapter that can handle 200GB/s, connected with PCI-e 3 to the CPU at 12GB/S, which is connected with DRAM at 200GB/s.
If each request has to receive or send 1 GB, then you can at most handle 12 req/s because that's all what your PCI-e bus can support.
If you are then delivering 1 reqs/s then either your "model" is wrong, or your app is poorly implemented.
If you are then delivering 11 req/s, then either your "model" is wrong, or your app is well implemented.
But if you are far away from your model, e.g., at 1 reqs/s, you can still validate your model, e.g., by using two PCI-e bus, which you then expect to be 2x as fast. Maybe your data about your PCI-e bw is incorrect, or you are not understanding something about how the packets get transfer, but the model guides you through the hardware bottlenecks.
The blog post lacks a "model", and focus on "what the software does" without ever putting it into the context of "what the hardware can do".
That is enough to allow you to compare whether software A is faster than software B, but if you are the fastest, it doesn't tell you how far can you go.

I aim for 9 Gbps per NIC, but I still see people settling for 3 Gbps total as if that's "normal".

y'know - it might be enough




But hey, doing science[0] is hard, better not be scientific instead /s
[1] science as in the scientific method: model->hypothesis->test , improve model->iterate. In contrast to the "shoot gun", or like the blog author called it, "whack-a-mole" method: try many things, be grateful if one sticks, no ragrets. /s

OP has defined the problem as speeding up an HTTP server (libreactor based) on Linux. So that's a context we assume as a base, questions like "what can the hardware do without libreactor and without Linux" are not posed here.

If you don't know, find out, because maybe X is already as fast as it can be, and there is nothing to speed up.
Sure, the OP just looks around and sees that others are faster, and they want to be as fast as they are.
That's one way to go. But if all others are only 1% as fast as _they should be_, then...
- either you have fundamentally misunderstood the problem and the answer to "how fast can X be?" (maybe its not as fast as you thought for reasons worth learning)
- what everyone else is doing is not the right way to make X as fast as X can be
The value in having a model of your problem is not the model, but rather what you can learn from it.
You can optimize "what an application does", but if what it does is the wrong thing to do, that's not going to get you close to what the performance of that application should be.


Googling stuff like "Amazon AWS hardware TCP TOE" doesn't reveal anything. So we can't assume that either.

I'm not sure about AWS, but in Azure it is called "Accelerated Networking" and it is available in most recent VM sizes that have 4 CPUs or more.
It enables direct hardware connectivity and all offload options. In my testing it reduces latency dramatically, with typical applications seeing a 5x faster small transactions. Similarly, you can get "wire speed" for single TCP streams without any special coding.
Every couple months these last several years there always seems to be some bug where the fix only costs us 3% performance. Since those tiny performance hits add up over time, security is sort of like inflation in the compute economy. What I want to know is how high can we make that 28% go? The author could likely build a custom kernel that turns off stuff like pie, aslr, retpoline, etc. which would likely yield another 10%. Can anyone think of anything else?



The interesting data will probably be whatever secrets the app handles, say database credentials, so the attacker is off to the races. They probably don't care about having root in particular.

On the same host there could be SSL certificates, credentials in a local MTA, credentials used to run backups and so on.
Or the application itself could be made of multiple components where the vulnerable one is sandboxed.

My take on this question is rather that there shouldn't be any dogma around this, such as disabling mitigations should not be considered absolutely, 100% harmful and never, ever, ever disabled.
In the context of the OP, where the application is running on AWS, backups, email, etc are all likely to be handled either externally (say EBS snapshots) in which case there's no issue, or via "trusting the machine", so getting credentials via the instance role which every process on the VM can do, so no need for privilege escalation.
So I guess if you trust EC2 or Task roles or similar (not familiar with EKS) to access sensitive data and only run a "single" application, there's likely little to no reason to use the mitigations.
But, yeah, if you're running an application with multiple components, each in their own processes and don't use instance roles for sensitive access, maybe leave them on. Also, maybe, this means you're not running a single app per vm?

> there shouldn't be any dogma around this
Like everything in security, it's about tradeoffs.
> Also, maybe, this means you're not running a single app per vm?
This is an argument for unikernels.
Instead, on 99.9% of your services you want to run multiple independent processes, especially in a datacenter environment: your service, web server, sshd, logging forwarder, monitoring daemon, dhcp client, NTP client, backup service.
Often some additional "bigcorp" services like HIDS, credential provider, asset management, power management, deployment tools.

Yes, but I was using my initial post's parent's terminology. But I agree, in my mind, the subject was one single "service", as in process (or a process hierarchy, like say with gunicorn for python deployments).
> This is an argument for unikernels.
It is. And I'm also very interested in the developments around Firecraker and similar technologies. If we'd be able to have the kind of isolation AWS promises between ec2 instances on a single physical machine, while at the same time being able to launch a process in an isolated container as easy as with docker right now, I'd consider that really great. And all the other "infrastructure" services you talk about could just live their lives in their dedicated containers.
Not sure how all this would compare, performance-wise, with just enabling the mitigations.


Can disabling these mitigations bring any risks assuming the server is sending static content to the Internet over port 80/443 and it is practically stateless with read-only file system?


The last time I looked I found a lot of waffle but no simple way I can just turn that stuff off...

Many security changes also help you find memory corruption bugs, which is good for developer productivity.
For example, having dhclient (a very popular dhcp client) leave open an AF_PACKET socket causing a 3% slowdown in incoming packet processing for all network packets seems... suboptimal!
Surely it can be patched to not cause a systemwide 3% slowdown (or at least to only do it very briefly while actively refreshing the DHCP lease)?

Some of these things really only show up when you push things to their extremes, so it probably just wasn't on the developer's radar before.

This has piqued my interest.

i wonder if the slow path here could be avoided by using separate network namespaces in a way these sockets don't even get to see the packets...


Thanks again to my reviewers!
As for this article, there are so many knobs that you tweaked to get this to run faster it's incredibly informative. Thank you for sharing.

That's a useful piece of info to know when performance tuning a real world app with auth / data / etc.

It is modeled off of the code used to generate Vue blog[5], but I made a ton of little modifications, including some changes directly to vitepress.
Keep in mind that vitepress is very much an early work in progress and the blog functionality is just kinda tacked on, the default use case is documentation. It also definitely has bugs and is under heavy development so wouldn't recommend it quite yet unless you are actually interested in getting your handa dirty with Vue 3. I am glad I used it because it gave me an excuse to start learning Vue, but unless you are just using the default theme to create a documentation site, it will require some work.
1. https://vitepress.vuejs.org/
2. https://pages.cloudflare.com/
3. https://workers.cloudflare.com/


> Even after taking all the steps above, I still regularly saw a 5-10% variance in performance across two seemingly identical EC2 server instances
> To work around this variance, I tried to use the same instance consistently across all benchmark runs. If I had to redo a test, I painstakingly stopped/started my server instance until I got an instance that matched the established performance of previous runs.
We notice similar performance variance when running benchmark on GCP and Azure. In the worst case, there can be a 20% variance on GCP. On Azure, the variance between identical instances is not as bad, perhaps about 10%, but there is an extra 5% variance between normal hours and off-peak hours, which further complicates things.
It can be very frustrating to stop/start hundreds of times for hours to get back an instance with the same performance characteristic. For now, I use a simple bash for-loop that checks the "CPU MHz" value from lscpu output, and that seems to be reliable enough.


Still, there is no guarantee that after stopping the instance on Friday evening, you would get back the same physical host on Monday morning. So, while using dedicated hardware does avoid the noisy neighbor problem, the "silicon lottery" problem remains. And so far, the data that I gathered indicates that the latter is the more likely cause, i.e. a "fast" virtual machine would remain fast indefinitely, while a "slow" virtual machine would remain slow indefinitely, despite both relying on a bunch of shared resources.

I would expect that just the cache usage characteristics of "neighbouring" workloads alone would account for at least a 10% variance! Not to mention system bus usage, page table entry churn, etc, etc...
If you need more than 5% accuracy for a benchmark, you absolutely have to use dedicated hosts. Even then, just the temperature of the room would have an effect if you leave Turbo Boost enabled! Not to mention the "silicon lottery" that all overclockers are familiar with...
This feels like those engineering classes where we had to calculate stresses in every truss of a bridge to seven figures, and then multiply by ten for safety.

The more problematic scenario, as mentioned in the article, is when you need to do some sort of performance tuning that can take weeks/months to complete. On the cloud, you either have to keep the virtual machine running all the time (and hope that a live migration doesn't happen behind the scene to move it to a different physical host), and do the painful stop/start until you get back the "right" virtual machine before proceeding to do the actual work.
We discovered this variance a couple of months ago. And this article from talawah.io is actually the first time I have seen anyone else mentioning about it. It still remains a mystery, because we too can't figure out what contributes to the variance using tools like stress-ng, but the variance is real when looking at MySQL commits/s metric.
> If you need more than 5% accuracy for a benchmark, you absolutely have to use dedicated hosts.
After this ordeal, I am arriving at that conclusion as well. Just the perfect excuse to build a couple of ryzen boxes.

There are traffic lights on the way! Other cars! Weather! Etc...
I've heard that Google's internal servers (not GCP!) use special features of the Intel Xeon processors to logically partition the CPU caches. This enables non-prod workloads to coexist with prod workloads with a minimal risk of cache trashing of the prod workload. IBM mainframes go further, splitting at the hardware level, with dedicated expansion slots and the like.
You can't reasonably expect 4-core virtual machines to behave identically to within 5% on a shared platform! That tiny little VM is probably shoulder-to-shoulder with 6 or 7 other tenants on a 28 or 32 core processor. The host itself is likely dual-socket, and some other VMs sizes may be present, so up to 60 other VMs running on the same host. All sharing memory, network, disk, etc...
The original article was also a network test. Shared fabrics aren't going to return 100% consistent results either. For that, you'd need a crossover cable.

In the HN discussion about cockroachdb cloud report 2021 (https://news.ycombinator.com/item?id=25811532), there was only 1 comment thread that talks about "cloud weather".
In https://engineering.mongodb.com/post/reducing-variability-in..., high profile engineers still claimed that it is perfectly fine to use cloud for performance testing, and "EC2 instances are neither good nor bad".
Of course, both the cockroachdb and mongodb cases could be related, as any performance variance at the instance level could be masked when the instances form a cluster, and the workload can be served by any node within the cluster.

Any such benchmark I do is averaged over a few instances in several availability zones. I also benchmark specifically in the local region that I will be deploying production to. They're not all the same!
Where the cloud is useful for benchmarking is that it's possible to spin up a wide range of "scenarios" at low cost. Want to run a series of tests ranging from 1 to 100 cores in a single box? You can! That's very useful for many kinds of multi-threaded development.
But! Be careful applying tunables from the article "as-is"[1]: some of them would destroy TCP performance:
net.ipv4.tcp_sack=0
net.ipv4.tcp_dsack=0
net.ipv4.tcp_timestamps=0
net.ipv4.tcp_moderate_rcvbuf=0
net.ipv4.tcp_congestion_control=reno
net.core.default_qdisc=noqueue
Not to mention that `gro off` that will bump CPU usage by ~10-20% on most real world workload, Security Team would be really against turning off mitigations, and usage of `-march=native` will cause a lot of core dumps in heterogenous production environments.[1] This is usually the case with single purpose micro-benchmarks: most of the tunables have side effects that may not be captured by a single workflow. Always verify how the "tunings" you found on the internet behave in your environment.
And once I switch to HTTPS I see a dramatic drop in throughput like x10.
A http 15k req/sec drops down to 400 req/sec once I start serving it over HTTPS.
I see no solution to it as everything has to https now.

It might need different tuning or you might be negotiating a slow cipher.


I'm curious whether disabling the slow kernel network features competes with an tcp bypass stack. I did my own wrk benchmark [0], but I did not try to optimize the kernel stack beyond pinning CPUs and busypoll, because the bypass was about 6 times as fast. I assumed that there is no way the kernel stack could compete with that. This article shows that I may be wrong. I will definitely check out SO_ATTACH_REUSEPORT_CBPF in the future.
[0] https://github.com/raitechnology/raids/#using-wrk-httpd-load...

Even if isn't quite as fast as DPDK and co, it might be close enough for some people to start opting to stick with the tried and true kernel stack instead of the more exotic alternatives.


Wonder where the next optimization path leads? Using huge memory pages. io_uring, which was briefly mentioned. Or kernel bypass, which is supported on c5n instances as of late...
Would also be interesting to discuss the impacts of turning off the xmit queue discipline. fq is designed to reduce frame drops at the switch level. Transmitting as fast as possible can cause frame drops which will totally erase all your other tuning work.

> I always disable C-states deeper than C1E
AWS doesn't let you mess with c-states for instances smaller than a c5.9xlarge[1]. I did actually test it out on a 9xlarge just for kicks, but it didn't make a difference. Once this test starts, all CPUs are 99+% Busy for the duration of the test. I think it would factor in more if there were lots of CPUs, and some were idle during the test.
> Try receive flow steering for a possible boost
I think the stuff I do in the "perfect locality" section[2] (particularly SO_ATTACH_REUSEPORT_CBPF) achieves what receive flow steering would be trying to do, but more efficiently.
> Would also be interesting to discuss the impacts of turning off the xmit queue discipline
Yea, noqueue would definitely be a no-go on a constrained network, but when running the (t)wrk benchmark in the cluster placement group I didn't see any evidence of packet drops or retransmits. Drop only happened with the iperf test.
1. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo...
2. https://talawah.io/blog/extreme-http-performance-tuning-one-...


https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo...


EDIT: also, RSS happens on the NIC. RFS happens in the kernel, so it might not be as effective. For a uniform request workload like the one in the article, statically binding flows to a NIC queue should be sufficient. :)
This shows you can make a regular Linux program using Linux network stack to approach something handcoded with DPDK.


edit: Oh, perhaps wrk2 still relies on the timer even when not specifying a fixed rate RPS.


I am no expert where coordinated omission is concerned, but my understanding is that it is most problematic in scenarios where your p90+ latency is high. Looking at the results for the 1.2M req/s test you have the following latencies:
p50 203.00us
p90 236.00us
p99 265.00us
p99.99 317.00us
pMAX 626.00us
If you were to apply wrk's coordinated omission hack to these result, the backfilling only starts for requests that took longer than p50 x 2 (roughly) = 406us, which is probably somewhere between p99.999 and pMAX; a very, very small percentage.I am not claiming that wrk's hack is "correct", just that I don't think coordinated omission is a major concern for *this specific workload/environment*
1. https://github.com/wg/wrk/blob/a211dd5a7050b1f9e8a9870b95513...

Reminds me a lot of this classic CS paper: Improving IPC by Kernel Design, by Jochen Liedke (1993)
https://www.cse.unsw.edu.au/~cs9242/19/papers/Liedtke_93.pdf



- I have nodejs server for the APIs and its running on m5.xlarge instance. I haven't done much research on what instance type should I go for. I looked up and it seems like c5n.xlarge(mentioned in the article) is meant compute optimized. That cost difference isn't much between m5.xlarge and c5n.xlarge. So, I'm assuming that switching to c5 instance would be better, right?
- Does having ngnix handle the request is better option here? And setup reverse proxy for NodeJS? I'm thinking of taking small steps on scaling an existing framework.

The c5 instance type is about 10-15% faster than the m5, but the m5 has twice as much memory. So if memory is not a concern then switching to c5 is both a little cheaper and a little faster.
You shouldn't need the c5n, the regular c5 should be fine for most use cases, and it is cheaper.
Nginx in front of nodejs sounds like a solid starting point, but I can't claim to have a ton of experience with that combo.

As in all things, check the results on your own workload!

I'd recommend just using a standard AWS application load balancer in front of your Node.js app. Terminate SSL at the ALB as well using certificate manager (free). Will run you around $18 a month more.
Regarding core pinning, the usual advice is to pin to the CPU socket physically closest to the NIC. Is there any point doing this on cloud instances? Your actual cores could be anywhere. So just isolate one and hope for the best?






This case, where it's all connection handling and serving a small static piece of data is a clear example; there's almost no userland work to be done before it goes to another syscall so any additional cost for the user/kernel barrier is going to hurt.
Then the question becomes who can run code on your server; also condidering maybe there's a remote code execution vulnerability in your code, or library code you use. Is there a meaningful barrier that spectre/meltdown mitigations would help enforce? Or would getting RCE get control over everything of substance anyway?


Reminds me how complicated it was to generate 40Gbit/sec of http traffic (with default MTU) to test F5 Bigip appliances, luckily TCL irules had `HTTP::retry`

This test is more about packets/s than bytes/s.
I see that this is clearly not the case here, but in general how can one be sure?


How long did you spend researching this subject to produce such an in depth report?

As a ballpark I would say I invested hundreds of hours in this experiment. Lots of sidetracks and dead ends along the way, but also an amazing learning experience.

There is also a quota system in place, so even though that is the hard limit, you can only operate at those speeds for a short time before you start getting rate-limited.

Presented this way may help noobs like me with capacity planning.
Fantastic work! Keep it up.
This can't be serious. Can someone flag this article? Highly inappropriate.
Recommend
-
264
Using 'pm static' for max performance...
-
104
Performance Tuning with Array Caching It’s been a while since we produced our last benchmarks, but the wait is over. Although we still aren’t focusing on performance per se, the recently implemented a...
-
69
PostgreSQL and Performance Performance is one of the key requirements in software architecture design, and has been the focus of PostgreSQL developers since
-
61
In this post we will review the most important Linux settings to adjust for performance tuning and optimization of a MySQL database server. We’...
-
87
GopherCon 2018 Performance Tuning Workshop Instructors David Cheney [email protected] Francesc Campoy
-
57
(Last Updated On: September 18, 2018) Follow @vlad_mihalcea Introduction In this article, I’m going to summarise the most common Hibernate per...
-
69
Learn the most common Hibernate performance issues and various performance tuning tips to boost up application performance.
-
64
Azure SQL Database is Microsoft’s database-as-a-service offering that offers a tremendous amount of flexibility and security and, as part of Microsoft’s Platform-as-a-Service, gets to take advantage of additional features...
-
64
In the past week, I focused on resolving an application performance issue, i.e., try to pinpoint why the code didn’t run as fast as I expected. Once upon a time, I am convinced that tuning performance is indeed harder than deb...
-
10
Running 10s test @ http://server.tfb:8080/json 16 threads and 256 connections Thread Stats Avg Stdev Max Min +/- Stdev Latency 204.24us 23.94us 626.00us 70.00us 68.70% Req/Sec 75.56k...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK