The performance regression odyssey

Jun 10 2021

time to read 4 min | 660 words

We care a lot about the performance of RavenDB.

Aside from putting a lot of time an effort into ensuring that RavenDB uses optimal code, we also have a monitoring system in place to alert us if we can observe a performance degradation. Usually those are fairly subtle issues, but we got an alert on the following scenario. As you can see, we are seeing a big degradation of this test.

The actual test in question is doing a low level manipulation of Voron (RavenDB’s storage engine), and as such, stand at the core of our performance hotspots to watch for.

Looking at the commits around that time frame, we quickly narrow the fault down to the following changes:

A really important observation here, however, is that this method is not called in the test. So we were looking at whatever this change caused a regression in code generation. I couldn’t believe that this is the case, to be honest.

Indeed, looking at the generated assembly, there was no difference between the two versions. But something cause the performance to degrade significantly enough for this test that it raised all sorts of alarm bells.

We started looking into a lot of details about the system, the usual things like checking for thermal throttling, etc.

We struck gold on this command: sudo smartctl --all /dev/nvme0n1

Take a look at the results:

    SMART overall-health self-assessment test result: FAILED!
    - NVM subsystem reliability has been degraded

    SMART/Health Information (NVMe Log 0x02, NSID 0x1)
    Critical Warning:                   0x04
    Temperature:                        35 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          10%
    Percentage Used:                    115%
    Data Units Read:                    462,613,897 [236 TB]
    Data Units Written:                 2,100,668,468 [1.07 PB]
    Host Read Commands:                 10,355,495,581
    Host Write Commands:                9,695,954,131
    Controller Busy Time:               70,777

In other words, the disk is literally crying at us. This tells us that the drive has been in action for ~50 days of actual activity and that it has gone beyond is design specs.

In particular, you can see that we wrote over a petabyte of data to the disk as part of our test case executions. This is a 500GB drive, which means that we fill it to capacity over 2,000 times before we hit this issue.

Once we hit this capacity (Percentage Used is > 100%), the drive needs to do a lot more work, so we are seeing longer test times.

First time that I closed a bug by sending a PO to get new hardware, I got to admit.

The performance regression odyssey - Ayende @ Rahien

The performance regression odyssey

Recommend

Seeking feedback on the RavenDB Cluster Dashboard - Ayende @ Rahien

Talking about RavenDB in the Angular Show - Ayende @ Rahien

Webinar: Real Time Architecture - Ayende @ Rahien

RavenDB Discussions on GitHub - Ayende @ Rahien

Installing RavenDB on a Ubuntu machine - Ayende @ Rahien

RavenDB Live Coding with Dejan - Ayende @ Rahien

Pragmatic BDD with RavenDB - Ayende @ Rahien

RavenDB 5.3 Features: Concurrent subscriptions - Ayende @ Rahien

RavenDB 5.3 New Features: Incremental time series - Ayende @ Rahien

RavenDB 5.3 New Features: Concurrent subscriptions - Ayende @ Rahien

About Joyk