InnoDB Data Locking – Part 2.5 “Locks” (Deeper dive) - JOYK Joy of Geek, Geek News, Link all geek

All together now

Let’s now put together all that we’ve learned in InnoDB Data Locking – Part 2 “Locks” about table and record locks to understand following situation:

mysql> BEGIN;

Query OK, 0 rows affected (0.00 sec)

mysql> SELECT * FROM t FOR SHARE;

+----+

3 rows in set (0.00 sec)

mysql> DELETE FROM t WHERE id=10;

Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO t VALUES (4);

Query OK, 1 row affected (0.00 sec)

mysql> SELECT INDEX_NAME,LOCK_TYPE,LOCK_DATA,LOCK_MODE

FROM performance_schema.data_locks WHERE OBJECT_NAME='t';

+------------+-----------+------------------------+---------------+

+------------+-----------+------------------------+---------------+

| PRIMARY | RECORD | 5 | S |

| PRIMARY | RECORD | 10 | S |

| PRIMARY | RECORD | 42 | S |

+------------+-----------+------------------------+---------------+

8 rows in set (0.00 sec)

We see that:

the first SELECT * FROM t FOR SHARE; created S locks (which are on both gap and record) on 5, 10, 42 and supremum pseudo-record. Essentially this means that the whole axis is covered with locks. And this is exactly what one needs to prevent any other transaction from modifying the result set of this query. Also, this required an IS table lock on t to begin with.
the DELETE FROM t WHERE id=10; first obtained IX table lock to demonstrate its intention to modify the table, and then acquired X,REC_NOT_GAP to modify just the record with id=10
finally, the INSERT INTO t VALUES (4); saw that it already has IX, so it proceeded to perform an insert. This is very tricky operation, and would require me to talk about details we’ve abstracted away. It starts by temporarily latching (note the word: “latching”, not “locking”!) the page to see if this is the correct place to put the record, then latching the Lock System queue for the point to the right of insertion point and checking if there is any *,GAP, S or X lock. In our case there were none, so we’ve immediately proceeded to insert the record (that has an implicit lock, as it has id of our transaction in the “last modified by” field, which I hope explains why there is no explicit X,REC_NOT_GAP lock on 4). The opposite case would be that there was some conflicting lock, and then to track the conflict explicitly a waiting INSERT_INTENTION lock would be created, so that the operation could be retried once it became GRANTED. The final step is to deal with the fact that inserting a new point on the axis splits already existing gap into two. Thus whatever locks already existed for this old gap, have to be inherited to the gap newly created gap on the left side of insertion point. This is the reason we can see S,GAP on row 4: it’s inherited from S lock on row 5.

This is just a tip of the iceberg of the real complications involved (we haven’t talked about inheritance of locks from deleted rows, secondary indexes, uniqueness checks..), but there are some high level ideas to get from this:

in general, to provide serializability, you need to “lock what you saw”, which includes not only points but also gaps between them – you saw them, too! If you can imagine how the query accesses the table when scanning, you can mostly guess what it will have to lock. This means it is important to have good indexes, so that you can jump straight to the point you want to lock, instead of having to lock the whole scanned range.
the opposite is also true: if you don’t care about serializability you can flirt with the idea of not locking something. For example in READ COMMITTED which is a lower isolation level, we try to avoid locking the gaps between rows (so, other transactions can insert rows in between them which leads to so-called “phantom rows” effect, which should now be obvious to you)
in InnoDB all the rows, including those “being inserted” and “being deleted” are physically present in the index and thus appear on the axis and split it into gaps. This is in contrast to some other engines which keep the “ongoing” changes in a “scratch pad area” and only merge it into the common picture on commit. This means that even though conceptually there is no interaction between concurrent transactions (for example we should not see rows being inserted by transaction until it’s committed), in the low-level implementation they interact quite a lot (for example a transaction can have a waiting lock on a row which officially does not yet exist). So, do not be surprised to see performance_schema.data_locks reporting rows which are not yet inserted or were already deleted (the latter will be eventually purged)

Record Locks’ compression (and missing LOCK_DATA)

In examples above you saw a useful LOCK_DATA column which shows you row values for the columns of the index on which a record lock is placed. This is very useful for humans to analyse the situation, but storing this “LOCK_DATA” explicitly in memory objects would be wasteful, so this data is actually reconstructed on-the fly when you query the performance_schema.data_locks table from a compressed information available in Lock System memory joined with the data available in the buffer pool pages. That is a Lock System identifies a record lock by <space_id, page_no> of the page on which the record was and a heap_no which is a number of the record within the page (these numbers do not in general have to be in the same order as record values on the page as they are assigned by a small heap allocator which tries to reuse space within the page the best in can when you remove, insert, and resize the rows). This approach has a nice advantage that a point is described using three fixed-length numbers: space_id, page_no, heap_no. Moreover, as it is a common use case that a query has to lock several rows on the same page, all locks which differ only by heap_no are (usually) stored together in a single object which has a bitmap long enough so that heap_no-th bit can signify if given record should be covered by this lock instance or not. (There is some trade-off here, because we “waste” space for a whole bitmap even if all we need is to lock a single record. Thankfully number of records per page is often small enough that you can afford n/8 bytes)

So, even though performance_schema.data_locks reports each record lock separately, they often correspond just to different bits in the same object, and you can see that by looking into OBJECT_INSTANCE_BEGIN column:

> CREATE TABLE t(id INT PRIMARY KEY);

> insert into t values (1),(2),(3),(4);

> delete * from t where id=3;

> insert into t values (5);

> BEGIN;

> SELECT * FROM t FOR SHARE;

+----+

> SELECT OBJECT_INSTANCE_BEGIN,INDEX_NAME,LOCK_TYPE,LOCK_DATA,LOCK_MODE

FROM performance_schema.data_locks WHERE OBJECT_NAME='t';

+-----------------------+------------+-----------+------------------------+-----------+

+-----------------------+------------+-----------+------------------------+-----------+

| 3011491639016 | PRIMARY | RECORD | 1 | S |

| 3011491639016 | PRIMARY | RECORD | 2 | S |

| 3011491639016 | PRIMARY | RECORD | 5 | S |

| 3011491639016 | PRIMARY | RECORD | 4 | S |

+-----------------------+------------+-----------+------------------------+-----------+

Notice, that SELECT..FROM t.. returned rows in their semantical order (ascending by id), which means that the easiest way to scan through primary index indeed visits rows in order of they primary key, because they form a linked list inside the page’s heap. However, the SELECT..from performance_schema.data_locks reveals some hints at internal implementation: the newly inserted row with id=5 got into the hole left by deleted row with id=3. We see that all the record locks are stored in the same object instance, and we can guess that this instance’s bitmap has bits set for heap_no‘s corresponding to all the real rows and the supremum pseudo-record.

Now, let’s demonstrate that the Lock System doesn’t really know what are the values of the columns and thus has to consult content of actual pages in the buffer pool to fill in the LOCK_DATA column. The buffer pool can be thought of as a cache for actual pages on disc (sorry for oversimplification: actually it can be fresher than data from a page on disc as it also contains patches to the page stored in redo log deltas). Performance_schema only uses data from buffer pool, but not from disc, so if it can’t find the page there, it will not attempt to fetch data from disk, instead reporting NULL in the LOCK_DATA column. How can we force eviction of a page from buffer pool? In general: I don’t know. But, what seems to work is to push so many new pages to the buffer pool that its capacity is reached and it has to evict the oldest pages. For this purpose I’ll open up a new client and create a table so huge it doesn’t fit in buffer pool. How huge?

con2> SELECT @@innodb_buffer_pool_size;

+---------------------------+

| @@innodb_buffer_pool_size |

+---------------------------+

| 134217728 |

+---------------------------+

OK, we need to push 128MB of data. (This experiment could be simplified by resizing the buffer pool to something smaller, which in general could be done dynamically, unfortunately default size of “a chunk” is so large that we can’t get below 128MB anyway)

con2> CREATE TABLE big(

id INT PRIMARY KEY AUTO_INCREMENT,

blah_blah CHAR(200) NOT NULL

con2> INSERT INTO big VALUES (1,REPEAT('a',200));

con2> INSERT INTO big (blah_blah) SELECT blah_blah FROM big;

Query OK, 262144 rows affected (49.14 sec)

Records: 262144 Duplicates: 0 Warnings: 0

..should be enough. Let’s see the performance_schema.data_locks again:

> SELECT OBJECT_INSTANCE_BEGIN,INDEX_NAME,LOCK_TYPE,LOCK_DATA,LOCK_MODE

FROM performance_schema.data_locks WHERE OBJECT_NAME='t';

+-----------------------+------------+-----------+------------------------+-----------+

+-----------------------+------------+-----------+------------------------+-----------+

+-----------------------+------------+-----------+------------------------+-----------+

Ha! You see? Now we have NULLs in the LOCK_DATA column. But don’t be afraid, it’s just how the information is presented to humans – Lock System still knows which heap_no on which page are locked, so if you try to access one of these records from another client you’ll have to wait:

con2> DELETE FROM t WHERE id = 2;

So, don’t panic if you see NULLs in LOCK_DATA. It just means the page is currently not available in buffer pool.

As you may expect running the DELETE brought the page to the memory, so you can now see the data without problems:

> SELECT ENGINE_TRANSACTION_ID,INDEX_NAME,LOCK_DATA,LOCK_MODE,LOCK_STATUS

FROM performance_schema.data_locks

WHERE OBJECT_NAME='t' AND LOCK_TYPE='RECORD';

+-----------------------+------------+------------------------+---------------+-------------+

+-----------------------+------------+------------------------+---------------+-------------+

| 284486501679344 | PRIMARY | 1 | S | GRANTED |

| 284486501679344 | PRIMARY | 2 | S | GRANTED |

| 284486501679344 | PRIMARY | 5 | S | GRANTED |

| 284486501679344 | PRIMARY | 4 | S | GRANTED |

+-----------------------+------------+------------------------+---------------+-------------+

Lock splitting

As hinted earlier, the mental model of “axis” with “points” and “gaps between points” could (in theory) be modelled in two different ways in the Lock System:

Option A: Two separate kinds of resources. So, a gap (15,33) is one resource, and the point [33] is a different resource. Each can be independently requested and granted using a simple set of access modes (say, only X and S)
Option B: A single resource for a combination of the record and preceding gap, and a wider set of access modes which encode what you want to do with gap and record (X, X,REC_NOT_GAP, X,GAP, S, S,REC_NOT_GAP, S,GAP,…)

InnoDB (currently) uses Option B. The main benefit I see is that in the most common case (when transaction needs to lock both gaps and a records during scan) it requires just a single object in memory instead of two, which not only saves space, but also requires less memory lookups and often lets us use a fast-path for single object in a list.

However, this design decision is not set in stone, as conceptually it holds that X = X,GAP + X,REC_NOT_GAP and S = S,GAP + S,REC_NOT_GAP and InnoDB 8.0.18 got the ability to exploit these equations via so called “lock splitting” technique described below.

A common reason a transaction had to wait, or even deadlock, was when it already had a lock on record but not on the gap before it (say, it had X,REC_NOT_GAP) and had to “upgrade it” to also cover the gap before the record (say, it requested X), alas it had to wait for another transaction (say, this other transaction was waiting for S,REC_NOT_GAP). (The reason why transactions in general can not ignore requests which are themselves still WAITING is to avoid starving the waiters. You can see such scenarios described in detail in deadlock_on_lock_upgrade.test )

The “lock splitting” technique employs the equations given above, and derives from them that needed - possessed = missing, in our case:
X – X,REC_NOT_GAP = X,GAP,
so the transaction request for X is silently transformed to a more modest request: just for X,GAP. In this particular situation this means the request can be immediately granted (recall that *,GAP requests do not ever have to wait for anything) and thus waiting and deadlock are avoided.

Secondary indexes

As mentioned before, each index can be seen as a separate axis, with its own points and gaps, which can be locked, which slightly complicates the picture. You could probably figure out yourself which points and gaps have to be locked for a given query, by following some common sense rules. Basically, you want to make sure that if a transaction modifies something which impacts the result set of another transaction, then locks needed by this reading transaction have to conflict with the locks needed by the transaction doing modification regardless of the query plan. There are several ways one could design the rules to achieve this goal.

For example, consider a simple table:

CREATE TABLE point2D(

x INT NOT NULL PRIMARY KEY,

y INT NOT NULL UNIQUE

INSERT INTO point2D (x,y) VALUES

(0,3),

(1,2),

(3,1),

(2,0);

Let’s try to figure out what locks should be needed by:

DELETE FROM point2D WHERE x=1;

There are two axes: x and y. It seems reasonable, that we should lock at least point (1) on axis x. What about axis y? Can we avoid locking anything on axis y? Honestly, I believe it depends on the database implementation, but consider how

SELECT COUNT(*) FROM point2D WHERE y=2 FOR SHARE;

would have to operate if locks are stored only on x axis. Reasonably, this SELECT would start by using index on column y to find the matching row, but then to know if it is locked or not, it would have to learn its x value. This is still a reasonable requirement: indeed InnoDB does store primary key’s columns (x in our case) in each secondary index entry, so that looking up the value of x in the index for y is not a big deal. However, recall that in InnoDB locks are not really tied to the value of x (this could be a rather long string for example), but rather to the heap_no (a short number which we use as an offset in the bitmap) – you need to know the heap_no to check for the lock’s existence. Thus you now have to dive into the primary index and load up the page containing the record just to learn the value of heap_no for it. Not very fast.

So, an alternative approach is to make sure that no matter which index is used to find row with x=1, a lock on it will be spotted without having to consult any other index. This could be accomplished by locking the point on y axis with y=2. This way the SELECT query mentioned above will see that it is locked, when it tries to acquire its own locks. What locks should this SELECT take? Again, this could be implemented in several ways: it could lock just the point on axis y with y=2, or it could also do a dive to primary index and lock the point on axis x with x=1, too. As I’ve already said, for performance reasons the first approach seems faster as it avoids a lookup in the primary index.

So, let’s see if our suspicion matches reality. First, let’s examine locks held by a transaction which does a select through a secondary index (infrequently, it may happen, that the optimizer will choose a query plan in which it scans through the primary index as opposed to using a secondary index even for queries where you’d think it’s insane – there is an explore/exploit trade-off in decisions like these. Also, our human intuitions about what is faster might be wrong)

con1> BEGIN;

con1> SELECT COUNT(*) FROM point2D WHERE y=2 FOR SHARE;

con1> SELECT INDEX_NAME,LOCK_TYPE,LOCK_DATA,LOCK_MODE

FROM performance_schema.data_locks WHERE OBJECT_NAME='point2D';

+------------+-----------+-----------+---------------+

+------------+-----------+-----------+---------------+

| y | RECORD | 2, 1 | S,REC_NOT_GAP |

+------------+-----------+-----------+---------------+

Which matches our expectations: we see an intention lock on the whole table (IS), and a lock on specific record but not gap before it (S,REC_NOT_GAP), both “shared”. Note that the LOCK_DATA column describes this record as 2,1, because it lists the columns in the same order they are stored in the secondary index entry for this row: first the indexed columns (y) and then the missing pieces of primary key (x). So 2,1 means <y=2,x=1>.

Let’s ROLLBACK this transaction to return to original state, and now let’s check what locks are taken by the DELETE alone:

con1> COMMIT;

con1> BEGIN;

con1> DELETE FROM point2D WHERE x=1;

con1> SELECT INDEX_NAME,LOCK_TYPE,LOCK_DATA,LOCK_MODE FROM performance_schema.data_locks WHERE OBJECT_NAME='point2D';

+------------+-----------+-----------+---------------+

+------------+-----------+-----------+---------------+

+------------+-----------+-----------+---------------+

Huh, this is puzzling: we see the expected intention lock on the whole table (IX), we see the lock on the primary index record itself, both “exclusive”, but we don’t see any lock on secondary index. If DELETE only takes locks on the primary index, and SELECT only takes locks on the secondary index, then how does InnoDB prevent the two from executing concurrently?

Let’s keep this DELETE transaction open, and fire up another client, and see if it will be able to see the deleted row or not:

con2> BEGIN;

con2> SELECT COUNT(*) FROM point2D WHERE y=2 FOR SHARE;

Hm..the SELECT became blocked (which is good), but how?
Let’s inspect the performance_schema.data_locks to figure out what’s the situation:

con1> SELECT ENGINE_TRANSACTION_ID trx_id,INDEX_NAME,LOCK_TYPE,LOCK_DATA,LOCK_MODE,LOCK_STATUS

FROM performance_schema.data_locks WHERE OBJECT_NAME='point2D';

+-----------------+------------+-----------+-----------+---------------+-------------+

+-----------------+------------+-----------+-----------+---------------+-------------+

| 283410363307272 | y | RECORD | 2, 1 | S | WAITING |

| 1560 | y | RECORD | 2, 1 | X,REC_NOT_GAP | GRANTED |

+-----------------+------------+-----------+-----------+---------------+-------------+

Ha! Our transaction (283410363307272) is WAITING to get an S lock on secondary index record <y=2,x=1> (and the gap before it) and we can see that the probable reason it has to wait is that the transaction doing DELETE(1560) is locking the same <y=2,x=1> with X,REC_NOT_GAP.
But… we haven’t seen any such lock just a second before, when we were inspecting locks held by 1560 – this lock appeared only now, how come? This is even more puzzling given that 1560 is not “actively doing anything” at the moment – how could it acquire the lock?

Recall that performance_schema.metadata_locks only shows you explicit locks, but not the implicit locks, and that implicit locks can get converted to explicit as soon as they are needed to keep track of who has to wait for who. What it means in practice is that when 283410363307272 requests Lock System to grant it S lock on <y=2,x=1>, Lock System first checks if there is any implicit lock on this record which it could deduce. This is a rather complicated process (you can try to follow it through our source code starting from lock_sec_rec_some_has_impl “lock system: Does someone have an implicit lock on this secondary index record?”):

check the value of page_get_max_trx_id(page) – for each page we store the maximum id of all transactions which ever modified this secondary index page. The DELETE operation did “bump” it to its own id (unless it was already larger).
We then compare this max_trx_id to some trx_rw_min_trx_id() which tracks the minimum id among transactions which are still active. In other words we try to heuristically decide if it is even possible that some active transaction has an implicit lock on this secondary index, and we make some trade-offs here:
- On a secondary index, we don’t track max_trx_id per record, we track it per whole page, so less storage is used, but we can spuriously think that it is plausible that our record was modified even though in reality the change was applied to some other record on the same page
- we don’t check if this trx id belongs to the set of active transactions very carefully, just compare it to the minimum id among them (frankly, we have to do it this way to stay correct in the light of the previous simplification: we don’t know the actual id of the transaction modifying the row, just some upper bound for it)
if the heuristic says it’s impossible that anyone has an implicit lock on this record, because no active transaction has id lower than the maximum of ids of transactions which modified records mentioned on this page, we can stop here. This means we hadn’t have to consult the primary index, hurray!
otherwise, things get messy. We descend into row_vers_impl_x_locked which will :
- locate the record in primary index (which in some rare cases might already be missing due to a race with the purge thread)
- retrieve trx_id of the last transaction to modify this particular row (note how this is a more precise analogue of the first heuristic above) and
- check if trx_id is still active (note how this is a more precise analogue of the second heuristic above)
- if the transaction is still active, it can still be the case that it does not have an implicit lock *on secondary index*. You see, it could for example modify some non-indexed column, in which case the secondary index entry is conceptually not affected, so no implicit lock is required. To check for that we have to tediously retrieve previous version(s) of the row and precisely check if any of the indexed columns is affected in a way which conceptually means a lock was needed. This is mega-complicated. I’m not going to explain this here, but if you are curious, you can read my comment at row_clust_vers_matches_sec and row_vers_impl_x_locked_low
finally, if an implicit lock is deemed necessary, then it gets converted to an explicit lock (its always of X,REC_NOT_GAP type) on behalf of its rightful owner (the trx_id from the primary index record header)

The main takeaway here is that in the worst case you need to retrieve not only the primary index record, but also its previous versions from the undo log, just to determine if there is an implicit lock. In best case scenario, you just look at the secondary index page and say “nah”.

OK, so it looks like the thread doing DELETE is a bit lazy, and the thread doing SELECT has to do some extra work to make explicit what DELETE left implicit.

But, this should make you curious: what if it was SELECT that was performed first, and then we start DELETE – given that SELECT only locks the secondary index, and DELETE seems not to acquire any secondary index locks, how could it possibly get blocked by uncommitted SELECT ? Do we perform some implicit-to-explicit conversion in this scenario, too? Seems implausible, given that SELECT should not modify any rows, thus should not put its trx_id in row or page headers, so there is no trace of it to infer an implicit lock.

Perhaps we’ve found a bug? Let’s ROLLBACK

con1> ROLLBACK;

con2> ROLLBACK;

and check this new scenario:

con2> BEGIN;

con2> SELECT COUNT(*) FROM point2D WHERE y=2 FOR SHARE;

+----------+

| COUNT(*) |

+----------+

and now in another client the DELETE:

con1> BEGIN;

con1> DELETE FROM point2D WHERE x=1;

Seems there’s no bug, as DELETEis WAITING. Let’s see the explicit locks:

> SELECT ENGINE_TRANSACTION_ID trx_id,INDEX_NAME,LOCK_TYPE,LOCK_DATA,LOCK_MODE,LOCK_STATUS

FROM performance_schema.data_locks WHERE OBJECT_NAME='point2D';

+-----------------+------------+-----------+-----------+---------------+-------------+

+-----------------+------------+-----------+-----------+---------------+-------------+

| 2077 | y | RECORD | 2, 1 | X,REC_NOT_GAP | WAITING |

| 283410363307272 | y | RECORD | 2, 1 | S,REC_NOT_GAP | GRANTED |

+-----------------+------------+-----------+-----------+---------------+-------------+

(Technical note for super-perceptive readers: the 283410363307272 is not only a suspiciously long number, it’s also exactly the same id we saw in one of previous examples. Explanation of both mysteries is simple: for read-only transactions InnoDB doesn’t waste time for assigning a real monotonous transaction id, instead deriving it ad-hoc from trx’s memory address)

Cool, we’ve got a somewhat symmetric result to the previous one, but this time it is the SELECT that has the GRANTED lock and DELETE has a WAITING one. (Another difference is that this time SELECT has S,REC_NOT_GAP instead of S, and frankly, I do not recall why we need to also lock the gap in the previous case, fortunately, this doesn’t matter for our story)

OK, so how come that the transaction doing DELETE now has an explicit WAITING lock, even though we saw that doing DELETE alone did not create such a lock?

The answer is: DELETE does try to take a lock on the secondary index (by calling lock_sec_rec_modify_check_and_lock), but this involves a tricky optimization: when Lock System determines this lock could be GRANTED because there is no conflicting locks already, instead of creating an explicit lock, it refrains from it, because it is informed by the caller that an implicit lock can be deduced if needed. (Why? Probably to avoid having to allocate lock_t object: consider a single DELETE operation affecting many rows forming a continuos range on primary key – the secondary index entries corresponding to them could be all over the place, so could not benefit from the compression mechanism. Also, as long as there is any place in InnoDB which uses implicit locks, then you have to check for them anyway, and if you have to check for implicit locks anyway, then you might as well use them wherever applicable, as the cost of checking for them is already being paid)

In our case Lock System determines there is a conflict and thus creates an explicit WAITING lock to track it.

OK, so to summarize, which solution does current version of InnoDB use to prevent conflicts between DELETE and a SELECT on secondary index? Is

DELETE locking both indexes and SELECT just one? Or
DELETE locking just the primary and SELECT checks both?

I’d say, it’s complicated, but more like the first approach, with the caveat that the locks taken by DELETE on secondary indexes are implicit whenever possible.

OK, now we are more than ready to talk about deadlock detection, which is our next topic.

Thank you for using MySQL!

InnoDB Data Locking – Part 2.5 “Locks” (Deeper dive)

All together now

Record Locks’ compression (and missing LOCK_DATA)

Lock splitting

Secondary indexes

Recommend

The challenge is on! Community call for creating the fastest number series gener...

Azure Kubernetes Service on Azure Stack HCI documentation

Firefox on Fedora with OpenH264

Compositing Scheduling in KWin: Past, Present, and Future

The Halving of the Centre: Covid and its Effect on London Property

Deploying to Cloud Foundry and OpenShift

I Am Speaking at Oracle Developer Live – MySQL 2020

Department of Computer Science and Technology

Privilege Escalation in AWS Elastic Kubernetes Service (EKS) by compromising the...

Nanit’s Gangnam Style

About Joyk