3

Lies, Damned Lies and Statistics

 3 years ago
source link: https://treatwell.engineering/lies-damned-lies-and-statistics-740414396183
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Lies, Damned Lies and Statistics

Image for post
Image for post

Here at Treatwell, we don’t so much love relational databases as respect and admire them for consistently (and atomically, isolatedly and durably) dealing with the unending barrage of difficult questions we ask them. Although we favour PostgreSQL for new services and are in the process of migrating to it for older ones, for historical reasons we still have a lot of MSSQL and will probably continue to for some time. This is not to say that we are strangers to non-relational databases — in CPU and RAM terms, we are running far more Elasticsearch and Redis than relational dbs — but we strongly believe that, as time- and battle-tested solutions to persistence problems, they deserve an important place in our architecture.

By and large neither MSSQL nor PostgreSQL gives us much trouble. With a fairly basic maintenance program and an understanding of our workloads and the trade-offs of having (or not having) indices, we can usually just rely on the query planner and its colleagues to do the right thing.

Usually, but not always. Very occasionally an apparently simple, inexpensive-looking query (or class of query) starts running slowly. This happened recently, and we managed to produce a minimal-ish query to reproduce it. The query, which failed to return in tens of seconds, looked something like this:

appointment_event is a real table with over half a billion rows in it. The subquery returns 11 rows, which sounds about right for the “happy path” through an appointment’s lifecycle. It also returns quickly — < 1ms. Why, then, was the whole query taking so long?

In such situations, the first thing to do is usually to look at the execution plan. It looked quite complicated, so we decided to focus on the subquery. The root element looked like this:

Estimated rows 4814410 — WTF!? Off the actual count of 11 by some margin. Drilling into the execution plan further, we could see that the estimates were quite far off for a number of operations. Focusing on appointment_event, we could see that the estimate for queries like this one were off too:

Estimated rows: 776, actual rows 11.

Aware that SQL Server (and the majority of RDBMSes) uses statistics to determine the most efficient way to execute a query, it seemed to follow that the statistics were “off” in some (major) way.

This command showed us the statistics for the index that was being used, specifically the histogram, which gives the query planner an idea of how values are distributed among buckets defined by ranges of key values (MSSQL will create up to 200 such buckets in a given statistics object).

AVG_RANGE_ROWS should be the mean number of rows per key value, in this case, the mean number of appointment_eventrows per appointment. Clearly this is where the query planner was getting its estimate of 776 rows from, but surely it couldn’t be right. We ran some queries to see what the actual distribution of values in this range was, and the answer was about 12. So where was 776 coming from?

For very large tables like appointment_event, it’s not feasible to look at all rows when computing statistics — it would take many hours to do this — so we do what many DBAs (or people who, from time to time, begrudgingly have the DBA hat thrust upon them) do and try to infer the distribution from a sample of rows. In general, letting MSSQL pick the sample size works fine, but in this case it was clearly not working.

It turns out that sampling is random but works at the page level. Since we often generate a flurry of appointment events around the time an appointment is created, a given 8K page will often contain several events for the same appointment, which causes the sampler to overestimate the total number of events for an appointment. As an analogy, given several random pages from a telephone directory (the book that people used to use in the olden days to find someone’s “telephone number”), you will very often see many entries for the same surname. Without knowing anything about the nature of the data in a telephone directory, you might be forgiven for thinking that there are not as many surnames are there really are, and that there are more people with each surname.

Continuing with the phone book analogy, by looking at more pages you’d get an increasingly accurate picture of the real distribution of names. Similarly with our appointment event statistics — it turns out that a sampling factor of 0.17%, which is what MSSQL thought reasonable, is not good enough for our purposes. The (somewhat anticlimactic) solution to our problem was simply to increase the sampling factor to 1% for this index, which still results in inaccurate statistics (estimating ~45 for AVG_RANGE_ROWS), but which is good enough for our purposes, and is not too computationally expensive.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK