

The performance regression odyssey - Ayende @ Rahien
source link: https://ayende.com/blog/193922-A/the-performance-regression-odyssey?Key=c421c698-06e3-4afc-9db0-989cc65cb386&utm_campaign=Feed%3A+AyendeRahien+%28Ayende+%40+Rahien%29
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The performance regression odyssey
We care a lot about the performance of RavenDB.
Aside from putting a lot of time an effort into ensuring that RavenDB uses optimal code, we also have a monitoring system in place to alert us if we can observe a performance degradation. Usually those are fairly subtle issues, but we got an alert on the following scenario. As you can see, we are seeing a big degradation of this test.
The actual test in question is doing a low level manipulation of Voron (RavenDB’s storage engine), and as such, stand at the core of our performance hotspots to watch for.
Looking at the commits around that time frame, we quickly narrow the fault down to the following changes:
A really important observation here, however, is that this method is not called in the test. So we were looking at whatever this change caused a regression in code generation. I couldn’t believe that this is the case, to be honest.
Indeed, looking at the generated assembly, there was no difference between the two versions. But something cause the performance to degrade significantly enough for this test that it raised all sorts of alarm bells.
We started looking into a lot of details about the system, the usual things like checking for thermal throttling, etc.
We struck gold on this command: sudo smartctl --all /dev/nvme0n1
Take a look at the results:
SMART overall-health self-assessment test result: FAILED! - NVM subsystem reliability has been degradedSMART/Health Information (NVMe Log 0x02, NSID 0x1) Critical Warning: 0x04 Temperature: 35 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 115% Data Units Read: 462,613,897 [236 TB] Data Units Written: 2,100,668,468 [1.07 PB] Host Read Commands: 10,355,495,581 Host Write Commands: 9,695,954,131 Controller Busy Time: 70,777
In other words, the disk is literally crying at us. This tells us that the drive has been in action for ~50 days of actual activity and that it has gone beyond is design specs.
In particular, you can see that we wrote over a petabyte of data to the disk as part of our test case executions. This is a 500GB drive, which means that we fill it to capacity over 2,000 times before we hit this issue.
Once we hit this capacity (Percentage Used is > 100%), the drive needs to do a lot more work, so we are seeing longer test times.
First time that I closed a bug by sending a PO to get new hardware, I got to admit.
Recommend
-
10
Jan 20 2021
-
12
Apr 08 2021 Talki...
-
8
Webinar: Real Time ArchitectureFUTURE POSTS No future posts left, oh my!
-
9
May 20 2021 RavenDB Discussions on GitHub...
-
4
Jun 22 2021 Installing...
-
11
RavenDB Live Coding with DejanFUTURE POSTS Products, Pricing and Predictability - 17 hours from now There are posts all the way to Jul 02, 2021
-
5
Oct 29 2021 Pragmatic BDD with RavenDB...
-
6
Nov 10 2021
-
3
Nov 12 2021
-
11
Nov 10 2021
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK