When and How to Invalidate Cache

In this post, I will talk about one way to figure out when to invalidate cache entries. I will use the following setup as an example, which should be general enough, and at the end of the post, I will briefly talk about how to generalize it even more.

Example

Say a system has following components (a very typical web application setup):

Many stateless clients (web servers)
one cache server (say Memcache)
one database (say Postgres)

In the database we store a table (a set of relations) called friends. And its schema looks something like:

+---------------------+------------------+------+-----+---------+-------+
| Field               | Type             | Null | Key | Default | Extra |
+---------------------+------------------+------+-----+---------+-------+
| user_id             | bigint unsigned  | NO   | PRI | 0       |       |
| friend_id           | bigint unsigned  | NO   | PRI | 0       |       |
+---------------------+------------------+------+-----+---------+-------+

123456

Now in cache we want to store a set of relations of friends-of-friends because in our application we have code like getFriendsofFriends($bob). It's very common/natural to cache a list of friends-of-friends keyed by user-id in services like Redis or Memcache. Say the key in Memcache is user id (e.g. Bob's user id), and the value is a list of user ids who are friends of Bob's friends. That's it.

Definition

I will define "cache" here as any materialized view of what's in the source of truth in this post, and assume this definition in this post unless otherwise specified.
By "invalidate" I meant the action of updating/dropping relevant cache entries when the source of truth mutates, so no stale data is stored in cache indefinitely.

Requirements

When you are using cache, by definition, you are dealing with a distributed system – there is the cache, and the source of truth. Since you are dealing with a distributed system, we need to be clear about the expected state/knowledge distribution. Too many systems approach cache in an ad-hoc way, without realizing they are dealing with a distributed system, leading to the idea that figuring out when to invalidate cache is uniquely hard. If one runs a query against Postgres, and stores the data locally on its client cache, without the database knowing, it's obvious that invalidating the local client cache can't always be done correctly. E.g. another client can come and mutate Postgres underneath without the first client knowing. Postgres doesn't know the first client cached any data with its earlier query, etc.

In order to deal with a distributed system correctly, the cache and the database need to be aware of each other. That is to say, the database needs to know that some queries are being cached; and the cache needs to know where the data is coming from (and especially how mutations are ordered – if there is a single keyword in distributed systems, it would be order).

More specifically, cache data schema must be made known to the entire data system (not just the cache itself – a single running process somewhere). It's obvious that without these requirements, reliable cache invalidation can't be done generally (one might figure out an ad-hoc solution for a specific use case).

Mutations / Write Transactions

E.g. someone comes and INSERTs a new friendship between Bob and Alice:

BEGIN;

INSERT INTO friends (user_id, friend_id)
VALUES
  ($bob, $alice), ($alice, $bob);

COMMIT;

1234567

Our goal here is to figure out when and how to invalidate cache entries (friends-of-friends) reliably. Postgres has this feature supporting returning data from modified rows. It's handy in this case, as we want to know the user-ids, who's friends-of-friends list could have changed. Now the write transaction would look something like this:

BEGIN;

INSERT INTO friends (user_id, friend_id)
VALUES
  ($bob, $alice), ($alice, $bob) RETURNING user_id;

COMMIT;

1234567

In this simple example, it's very easy to figure out the user_ids that are mutated (you can achieve the same by having a SELECT after the INSERT). But in more complex queries, the RETURNING feature from Postgres becomes very handy.

What if we are caching a set of relations that depend on joined results between two tables, and one of them is friends? Then after you get back the list of user_ids, you can run the same join/filter (or other relational algebra operations) on the list of user_ids, which would then tell you which set of cache keys needs to be updated. More details at the end of the post.

Now when the db client executes this transaction it will receive a list of user_ids, whose friendship has changed. Now not only Alice's and Bob's friends-of-friends lists need to be updated in cache but also Alice's friends' and Bob's friends'. So once we have the user_ids, we need to run a select query to get all the users affected by this mutation:

SELECT friend_id
FROM friends
WHERE user_id in $user_ids;

Now we have a complete list of user ids to invalidate. What if the client crashes after receiving the commit success from DB, and never gets a chance to invalidate caches? So we have to make sure that the order of operations is:

client starts a transaction
client runs any mutations as needed
client collects user_ids whose cache entries need to be invalidated
client invalidates cache
client commits the transaction

In this case, if the client crashes after DB commit, it would still be fine because we invalidate cache first. This works when there's only one database (single shard, single primary).

What's nice is that the steps of "collecting user_ids and invalidating cache" can be encapsulated in your DB client, and so you don't have to worry about someone forgetting to invalidate the cache. You notice that the DB client in this case is aware of the cache schema (so it knows to collect user_ids, etc.). This is where the earlier requirement kicks in,

cache data schema must be made known to the entire data system.

In practice, whoever adds a new cache schema, the same person should update the db client for handling cache invalidation.

Generalization

Now let's generalize the previous setup – single cache, single database.

What if the cache schema is complicated?

Say, we have a cache that stores friends-of-friends who lives in US. So besides the friends table, we now have a home table where stores where each user lives. Now updates to either friends or home can affect our cache. Remember, one key requirement we listed earlier is that cache schemas are made known to the entire data system. What we can do e.g. is

get the list of user_ids affected by both friends mutations and home mutations

What if I have a lot of cache replicas, and a lot of cache schemas and data?

The aforementioned solution essentially handles cache invalidation synchronously. There is this write amplification (to cache) that happens synchronously. If you have a lot of cache replicas (not uncommon, since caches are commonly introduced for scaling read traffic), the synchronous write amplification is pretty terrible. It slows the write transaction down, and it tends to be error prone (higher error rate). One tradeoff you can make is to make cache invalidation asynchronous – record the user_ids to e.g. a log, and process the cache invalidation asynchronously. By doing cache invalidation asynchronously, sometimes users can observe stale data from cache.

What if I have a read-only db replica?

Here be dragons. After a cache is invalidated (after db mutation), if you have a db replica (always somewhat behind the db primary), there's always a chance that cache can fill its data from a stale db replica. A concrete example would look like this:

DB has data A
client updates A -> B and invalidates the cache
DB replica still has data A
cache fills from DB replica and puts A back in
DB replica catches up and updates A -> B
now cache will have A (a stale data) indefinitely

I will not go into details but there are a few ways to solve this. E.g.

send cache invalidation with version data
keep track of version data in cache, so it never goes back in time (e.g. after processing B's invalidate, it should never go back to A again)

What if I have multiple DB shards, and I have a cache that materializes data from multiple DB shards?

The solution mostly remains the same, the only problem you need to solve is again ordering (which being the single most important keyword in distributed systems). Depending on your system, you might be able to use vector clocks, or if you have Spanner (lucky you), you can just use TrueTime to resolve the ordering issues so cache states would never go back in time.

What if I don't store data in Postgres?

When designing a cache, you need to be aware of the features that your database supports. This is where the earlier requirement kicks in:

In order to deal with a distributed system correctly, the cache and the database need to be aware of each other.

When you are dealing with caches, you are dealing with a distributed system. Treat it as a distributed system. If you database doesn't support returning modified data synchronously, but supports binlog (e.g. MySQL), what you can do is

tail the binlog on mutations (where you can collect mutated user_ids)
send invalidation from the tailer

How about write amplifications?

Now in order to keep caches updated, you might be paying a huge cost due to write amplification (async or sync). This where we need to make tradeoffs, depending on the workload (read and write). Sometimes, using TTL (hence completely get rid of the invalidation) would help save capacity.

Example

Definition

Requirements

Mutations / Write Transactions

Generalization

Recommend

Active Record values_at and cache improvements

Cache, Loops and Performance

Gaining security and privacy by partitioning the cache

Invalidate form input that contains HTML

How is usually consistency maintained between DB and cache?

Everyone is a competitor and they don't want to get together to create a standar...

Flushing your DNS cache in Mac OS X and Linux

Easy and Simple in-memory cache in Golang

CloudFront cache problems and how to solve them

cache操作：clean、invalidate与flush的含义 - RobinLaw

About Joyk