46

Reader/Reader blocking in reader/writer locks

 4 years ago
source link: https://www.tuicool.com/articles/hit/e6NZ7r7
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

tl;dr

In writer-priority reader/writer locks, as soon as a single writer enters the acquisition queue, all future accesses block behind any in-flight reads. If any readers hold the lock for extended periods of time, this can lead to extreme pauses and loss of throughput given even a very small number of writers.

Abstract

This post describes a phenomenon that can occur in systems built on reader/writer locks , where slow readers and a small number of writers (e.g. even a single writer) can lead to substantial latency spikes for the entire system. This phenomenon is well-known in certain systems engineering communities (e.g. among some kernel or database developers), but is often surprising when first encountered, and has important implications for the design of such systems.

Background

Reader/writer locks (sometime styled rwlocks ) are a common concurrency primitive that aim to provide higher concurrency than a traditional mutex by allowing multiple readers to proceed in parallel.

A reader/writer lock can be acquired, as the name suggests, either for read or write access. Any number of threads may hold it for read access concurrently, but if a writer holds the lock, no other thread may concurrently hold it for either read or write access.

In order to avoid starving writers, most reader/writer lock implementations are writer priority. This means that once a writer starts waiting on the lock, no future reader may acquire the lock until after the writer has acquired and dropped the lock. This behavior is necessary in order to prevent writer starvation; Otherwise, under contention, readers may continually overlap their sessions, and never leave a moment at which a writer can acquire an exclusive lock.

The problem

This writer priority behavior, however, essentially creates the possibility for readers to block on other readers , something that a reader/writer lock is supposed to avoid. Consider a situation where, in this order:

  • At time T₁, a long-running reader R₁ acquires a read lock
  • At time T₂, a writer W attempts to acquire a writer lock
  • At time T₃, a reader R₂ attempts to acquire a read lock
  • At time T₄, R₁ drops the read lock

Because the writer W has priority, R₂ will be blocked until W can acquire and then release the lock. However, W is blocked until R₁ releases its read lock at T₄. Thus, between times T₃ and T₄, R₂ is blocked, effectively waiting on R₁ to complete (and then W after it).

If readers ever hold locks for extended periods of time, such that T₄ is much later than T₁, this can lead to disastrous system pauses: a very small amount of write load may be sufficient to allow long-running readers to halt all other threads until they complete.

Exacerbating factors

This description may seem like a niche problem: It requires a confluence of a few different factors in order to occur. However, nature of reader/writer locks and the situations in which they’re used actually makes it much more common than one might expect.

First, note that any system designed to use a reader/writer lock by hypothesis is likely to see a high rate of read volume and less frequent write traffic, because that’s exactly the kind of situation where a reader/writer lock provides value over a vanilla mutex. So reader/writer locks tend to rely on the additional concurrency, and on readers not blocking other readers.

Furthermore, absent contention with writers, long-lived readers have no impact on the throughput or latency of lock acquisitions, because other readers can peacefully coexist with them. As a general matter, software has a constant tendency to get slower, absent deliberate and careful efforts to hold the line, as developers add features and complexity. And if slower processes don’t immediately impact performance (because they only hold read locks), they’re more likely to go unnoticed. Thus, it’s not unlikely that over time read lock durations will creep upwards, mostly without effect, until one happens to coincide with an attempt to grab a write lock.

Examples

Here’s a few examples of real systems in which I’ve observed this problem for real:

Linux kernel mmap_sem

The Linux kernel uses a reader/writer lock to protect the structures controlling address space layout within a process (and shared between threads). It is in general a well-known source of contention, but I’ve recently debugged and documented a case in which unexpected long-lived readers using this lock can result in egregious latency spikes for other readers.

As an interesting aside, I looked up the documentation for the kernel’s reader/writer locks, and discovered this advice :

If your code divides neatly along reader/writer lines, and the lock is held by readers for significant lengths of time, using [reader/writer] locks can help.

Based on the phenomenon described in this post, I believe this evidence to be overly simplistic at best, and actively harmful at worst.

MongoDB MMAPv1 storage engine

MongoDB’s classic storage engine (known as “mmapv1”) traditionally used a single reader/writer lock to guard access to each database. By design, the lock should never be held for extended periods of time, and readers are supposed to periodically release (“yield”) the lock to allow other processes to make progress.

However, readers holding the lock for too long was and is a recurring source of performance-limiting concurrency problems; here’s just one example from their issue tracker describing exactly this issue, and I’ve debugged many others in the course of operating a large MongoDB deployment.

SQL

SQL databases, even ones with powerful MVCC concurrency engines, tend to use reader/writer locks in the implementation, and it’s not hard to unintentionally lock entire tables against all access for extended periods of time. Let’s look at an example that works on (at least) both PostgreSQL and MySQL:

We’ll create a small table with some data:

CREATE TABLE rwlock (n integer);
INSERT INTO rwlock VALUES (1), (2), (3);

Then, in one client, we’ll simulate a slow read-only analytical query:

SELECT pg_sleep(60) from rwlock; -- PostgreSQL
SELECT sleep(60) from rwlock; -- MySQL

While that runs, we execute a schema migration:

ALTER TABLE rwlock ADD COLUMN i integer;

This command hangs, waiting for the SELECT to complete. To me this behavior seems mildly disappointing, but also not that surprising. However, if we then attempt to execute a read that should be fast:

SELECT * from rwlock;

We find that that this read also hangs until the initial query completes!

This is a surprising result to me: Running a slow analytical query alongside an online query is something I understood to be safe, and I also understood running an ADD COLUMN to be a safe operation to perform online. However, combining them lets us accidentally completely lock our database!

MySQL’s ALTER TABLE supports the option to specify ALGORITHM=INSTANT , documented as

Operations only modify metadata in the data dictionary. No exclusive metadata locks are taken on the table during preparation and execution, and table data is unaffected, making operations instantaneous.

But testing reveals that even that option is not sufficient to prevent the potentially-unbounded blocking behavior described here.

I am, of course, far from the first to note this behavior: Among others, GoCardless wrote a good postmortem describing a run-in with exactly this problem.

What to do about it?

So, we’ve seen that reader/writer locks can have surprising behavior, leading to high latency and leading to readers effectively holding exclusive locks. What can we do about it?

At a high level, my experiences and this exploration have made me much more wary of designs that rely on reader/writer locks. Because of the behavior explored here, their additional concurrency, can effectively becomes a false promise, but one that works just well for systems designers and users to come to rely on it. That’s in many ways the worst possible outcome, setting systems up for unexpected failures when that concurrency vanishes.

With that in mind, here’s some recommendations. None of these is appropriate for every situation, but hopefully together they provide some helpful starting points.

  • Use finer-grained concurrency and normal exclusive locks. A vanilla exclusive lock offers lower concurrency, but does so in a more consistent way — contention is always bad, as opposed to sporadically bad, and so performance is more predictable.
    • As a concrete example, the RadixVM research paper replaces the mmap_sem with a radix tree; nodes in the tree are locked using a trivial spinlock, but by design contention is extremely rare, and so the system as a whole achieves greatly improved concurrency.
  • Various concurrency schemes can provide the guarantee that readers never block, offering an even stronger guarantee of reader concurrency than reader/writer locks:
    • read-copy-update is a technique used in the Linux kernel which allows for completely synchronization-free reads.
    • Various schemes using immutable data structures and atomic pointer updates can provide similar properties in other environments.
    • Some optimistic concurrency schemes, like Michael-Scott queues , offer progress guarantees even in the face of contention.
  • Time out writer acquisitions.
    • If potentially starving writers is acceptable, adding a timeout to write-side acquisitions will bound how long readers can be forced to pause. GoCardless’ library for online Rails migrations implements this option for Postgres, and also discusses the challenges and implications a bit.
  • If you do have to use a reader/writer lock, think carefully about how long a reader or writer can hold the lock, and instrument the system so that long-lived critical sections are observable before they become problematic.

Conclusion

A number of people I’ve talked to have described this phenomenon as unsurprising, and I think it is when explained explicitly. But the number of times I’ve run into performance problems related to long-lived readers in reader/writer locks makes me believe that it’s not necessarily obvious a priori , or perhaps not obvious in the right ways. Dealing with these problems and writing this post has made me much more cautious about ever reaching for an rwlock as a solution to a problem, and I hope it helps raise awareness of this problem and potential performance pitfall.


Recommend

  • 44
    • www.tuicool.com 5 years ago
    • Cache

    Look ma, no locks

    We previously argued that threads are not the answer to parallelism problems, and also that we should avoid locks as much as possible. But how can we have a world in which threads do not block each other? We want...

  • 35
    • studygolang.com 5 years ago
    • Cache

    Golang io reader writer

    一、《GO语言实战》P194 类 UNIX 的操作系统如此伟大的一个原因是,一个程序的输出可以是另一个程序的输入这一理念。依照这个哲学,这类操作系统创建了一系列的简单程序,每个程序只做一件事,并把这件事做得非常好。之后,将...

  • 29
    • www.tuicool.com 5 years ago
    • Cache

    Implementing reader-writer locks

    When writing concurrent code, a lock is one of the most critical tools programmers have in their toolbox. For the majority of applications adding a lock to protect access to some data is all you need. However, for some p...

  • 40

    I’m sure you’ve seen it – you kick off an ALTER and you get the dreaded “waiting on metadata lock”.  In many cases, this is expected if you are...

  • 10

    When developing music software, you are operating under tight time constraints. The time between subsequent audio processing callbacks is typically between 1-10 ms. A common default setting is a buffer size of 128 samples...

  • 8

    All together now Let’s now put together all that we’ve learned in InnoDB Data Locking – Part 2 “Locks” about table and record locks to understand following s...

  • 18
    • zwischenzugs.com 3 years ago
    • Cache

    How to Manually Clear Locks in Jenkins

    Problem Recently I got into a situation where I hit a bug with Jenkins where Jenkinsfile locks were not released if the job was terminated. I tried: Restarting Jenkins Reinstalling the plugin Remo...

  • 6

    Deadlocks in Practice: Don’t Hold Locks While Notifying posted by Craig Gidney on December 22, 2013 In this po...

  • 6
    • belaban.blogspot.com 3 years ago
    • Cache

    I hate distributed locks!

    I hate distributed locks!Adding distributed locks to JGroups has been the second biggest mistake that I've made (next to MuxRpcDispatcher). Unfortunately, a lot of people are using them in JGroups... Distributed locks don't work...

  • 12

    If the slim reader/writer lock (SRWLOCK) doesn’t remember who the shared lock owner is, does that mean it’s okay to acquire it recursively? Raymond ...

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK