4

The race condition that led to bankruptcy

 1 year ago
source link: https://vladmihalcea.com/race-condition/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Last modified: Jun 13, 2022

Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?

Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.

So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!

Introduction

It’s hard to imagine that a race condition bug could lead to the bankruptcy of a given online service, isn’t it?

In this article, I’m going to show you how a race condition led to the bankruptcy of Flexcoin in 2014.

What was Flexcoin

Flexcoin was a bitcoin digital waller that allowed users to easily receive or send funds.

According to the Wayback Machine, this is what Flexcoin was supposed to do:

Using Flexcoin, you can even send bitcoins to an e-mail address. It’s designed for anyone to use bitcoins without technical knowledge.

Flexcoin is an important part of the bitcoin infrastructure. Our technology allows for instant transfers of bitcoins to a username as compared to the next block wait to a huge bitcoin address.

And so it did!

Flexcoin got hacked

On the 2nd of March 2014, Flexcoin got hacked, and the attacker stole 896 bitcoins. The Wayback Machine recorded the following announcement:

On March 2nd 2014 Flexcoin was attacked and robbed of all coins in the hot wallet. The attacker made off with 896 BTC, dividing them into these two addresses: …

As Flexcoin does not have the resources, assets, or otherwise to come back from this loss, we are closing our doors immediately.

After some investigation, the owner published a new message describing how the theft was orchestrated:

The attacker logged into the flexcoin front end … under a newly created username and deposited to address …

The attacker then successfully exploited a flaw in the code that allows transfers between flexcoin users. By sending thousands of simultaneous requests, the attacker was able to “move” coins from one user account to another until the sending account was overdrawn, before balances were updated.

After the attacker stole all the available BTC, the company had no choice but to shut down the service. This story shows how terrible things can go when the concurrency control strategy has flaws.

Replicating the theft

From the official announcements, we can conclude that the theft was caused by a race condition, which is a situation when a shared data registry is modified by multiple concurrent threads without a rigorous synchronization mechanism.

So, let’s try to emulate this problem with the following transfer method:

void transfer(String fromIban, String toIban, long transferCents) {
long fromBalance = getBalance(fromIban);
if(fromBalance >= transferCents) {
addBalance(fromIban, (-1) * transferCents);
addBalance(toIban, transferCents);
}
}

The getBalance is implemented as follows:

long getBalance(final String iban) {
return doInJDBC(connection -> {
try(PreparedStatement statement = connection.prepareStatement("""
SELECT balance
FROM account
WHERE iban = ?
""")
) {
statement.setString(1, iban);
ResultSet resultSet = statement.executeQuery();
if(resultSet.next()) {
return resultSet.getLong(1);
}
}
throw new IllegalArgumentException(
"Can't find account with IBAN: " + iban
);
});
}

And, the addBalance like this:

void addBalance(final String iban, long balance) {
doInJDBC(connection -> {
try(PreparedStatement statement = connection.prepareStatement("""
UPDATE account
SET balance = balance + ?
WHERE iban = ?
""")
) {
statement.setLong(1, balance);
statement.setString(2, iban);
statement.executeUpdate();
}
});
}

And we have two users, Alice and Bob:

| iban      | balance | owner |
|-----------|---------|-------|
| Alice-123 | 10      | Alice |
| Bob-456   | 0       | Bob   |

To validate the transfer method, rewrite the following integration test:

assertEquals(10L, getBalance("Alice-123"));
assertEquals(0L, getBalance("Bob-456"));
transfer("Alice-123", "Bob-456", 5L);
assertEquals(5L, getBalance("Alice-123"));
assertEquals(5L, getBalance("Bob-456"));
transfer("Alice-123", "Bob-456", 5L);
assertEquals(0L, getBalance("Alice-123"));
assertEquals(10L, getBalance("Bob-456"));
transfer("Alice-123", "Bob-456", 5L);
assertEquals(0L, getBalance("Alice-123"));
assertEquals(10L, getBalance("Bob-456"));

And, when running it, we can see that it works just fine:

  • First, Alice sends 5 cents to Bob, so she has 5 left, and Bob has now a balance of 5.
  • Alice does the second transfer of 5 cents, so she doesn’t have any cents left while Bob has now 10.
  • The third transfer from Alice doesn’t do anything since Alice has no money left, leaving the state unaffected.

However, this integration test runs in the context of the same Java thread in a serial execution fashion while the Flexcoin theft was done using simultaneous concurrent requests.

So, let’s see how the transfer works when running it using multiple concurrent threads:

assertEquals(10L, getBalance("Alice-123"));
assertEquals(0L, getBalance("Bob-456"));
int threadCount = 8;
CountDownLatch startLatch = new CountDownLatch(1);
CountDownLatch endLatch = new CountDownLatch(threadCount);
for (int i = 0; i < threadCount; i++) {
new Thread(() -> {
awaitOnLatch(startLatch);
transfer("Alice-123", "Bob-456", 5L);
endLatch.countDown();
}).start();
}
LOGGER.info("Starting threads");
startLatch.countDown();
LOGGER.info("Main thread waits for all transfer threads to finish");
awaitOnLatch(endLatch);
LOGGER.info("Alice's balance: {}", getBalance("Alice-123"));
LOGGER.info("Bob's balance: {}", getBalance("Bob-456"));

We will start a number of Java Threads that execute the transfer at the same time.

We are using two CountDownLatch objects to coordinate the executions of the main and transfer threads:

  • the startLatch is used so that all transfer threads start at once
  • the endLatch is used so that the main thread can wait for all transfer threads to finish

After all transfer threads finished running, we are going to log Alice’s and Bob’s account balances, and this is what we will get:

Alice's balance: -30
Bob's balance: 40

That’s not good!

If we check the SQL statement log, we can see exactly what happened:

/*
All transfer threads read Alice’s balance
*/
[Thread-5]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
[Thread-3]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
[Thread-2]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
[Thread-7]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
[Thread-0]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
[Thread-4]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
[Thread-1]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
[Thread-6]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
/*
Since Alice’s balance is 10 and the transfer amount is 5
all transfer threads decide to do the transfer
First, the Alice’s account is debited
*/
[Thread-5]
/* Alice balance: 5 */
UPDATE account
SET balance = balance + (-5)
WHERE iban = 'Alice-123'
[Thread-3]
/* Alice balance: 0 */
UPDATE account
SET balance = balance + (-5)
WHERE iban = 'Alice-123'
[Thread-2]
/* Alice balance: -5 */
UPDATE account
SET balance = balance + (-5)
WHERE iban = 'Alice-123'
[Thread-7]
/* Alice balance: -10 */
UPDATE account
SET balance = balance + (-5)
WHERE iban = 'Alice-123'
[Thread-0]
/* Alice balance: -15 */
UPDATE account
SET balance = balance + (-5)
WHERE iban = 'Alice-123'
[Thread-4]
/* Alice balance: -20 */
UPDATE account
SET balance = balance + (-5)
WHERE iban = 'Alice-123'
[Thread-1]
/* Alice balance: -25 */
UPDATE account
SET balance = balance + (-5)
WHERE iban = 'Alice-123'
[Thread-6]
/* Alice balance: -30 */
UPDATE account
SET balance = balance + (-5)
WHERE iban = 'Alice-123'
/*
Second, the Bob’s account is credited
*/
[Thread-5]
/* Bob balance: 5 */
UPDATE account
SET balance = balance + 5
WHERE iban = 'Bob-456'
[Thread-3]
/* Bob balance: 10 */
UPDATE account
SET balance = balance + 5
WHERE iban = 'Bob-456'
[Thread-2]
/* Bob balance: 15 */
UPDATE account
SET balance = balance + 5
WHERE iban = 'Bob-456'
[Thread-7]
/* Bob balance: 20 */
UPDATE account
SET balance = balance + 5
WHERE iban = 'Bob-456'
[Thread-0]
/* Bob balance: 25 */
UPDATE account
SET balance = balance + 5
WHERE iban = 'Bob-456'
[Thread-4]
/* Bob balance: 30 */
UPDATE account
SET balance = balance + 5
WHERE iban = 'Bob-456'
[Thread-1]
/* Bob balance: 35 */
UPDATE account
SET balance = balance + 5
WHERE iban = 'Bob-456'
[Thread-6]
/* Bob balance: 40 */
UPDATE account
SET balance = balance + 5
WHERE iban = 'Bob-456'

It makes sense now!

Here’s the flow of statements:

  • The SELECT statements are issued by the transfer Threads right after they are started.
  • All the transfer Threads will see that Alice has enough money to issue the transfer, and the if branch will evaluate to true.
  • The transfer is started by all Threads.
  • Alice’s account is going to be debited by all Threads (e.g., 10 – (8 Threads x 5 cents+ = 10 – 40 = -30).
  • Bob’s account is going to be debited by all Threads (e.g., 0 + (8 Threads * 5 cents) = 0 + 40 = 40).

While relational database systems offer ACID guarantees, those only work if reads and writes are executed in the context of the same database transaction.

In our case, there were three transactions per transfer:

  • one that was selecting Alice’s account balance
  • one that was debiting Alice’s account
  • and another one that was crediting Bob’s account

The reason why we got three transactions per transfer instead of just one is that the doInJDBC methods are running the provided callbacks in a newly acquired database connection:

void doInJDBC(ConnectionVoidCallable callable) {
try {
Connection connection = null;
try {
connection = dataSource().getConnection();
connection.setAutoCommit(false);
callable.execute(connection);
connection.commit();
} catch (SQLException e) {
if(connection != null) {
connection.rollback();
}
throw e;
} finally {
if(connection !=  null) {
connection.close();
}
}
} catch (SQLException e) {
throw new IllegalStateException(e);
}
}
<T> T doInJDBC(ConnectionCallable<T> callable) {
try {
Connection connection = null;
try {
connection = dataSource().getConnection();
connection.setAutoCommit(false);
T result = callable.execute(connection);
connection.commit();
return result;
} catch (SQLException e) {
if(connection != null) {
connection.rollback();
}
throw e;
} finally {
if(connection !=  null) {
connection.close();
}
}
} catch (SQLException e) {
throw new IllegalStateException(e);
}
}

The transfer should execute in a single database transaction so that the read and write operations are wrapped in an atomic unit of work.

Replicating the theft using the default transaction guarantees

So, lets’ change the code so that the transfer is done in the context of the same database transaction:

assertEquals(10L, getBalance("Alice-123"));
assertEquals(0L, getBalance("Bob-456"));
int threadCount = 8;
String fromIban = "Alice-123";
String toIban = "Bob-456";
long transferCents = 5L;
CountDownLatch startLatch = new CountDownLatch(1);
CountDownLatch endLatch = new CountDownLatch(threadCount);
for (int i = 0; i < threadCount; i++) {
new Thread(() -> {
try {
doInJDBC(connection -> {
setIsolationLevel(connection);
awaitOnLatch(startLatch);
long fromBalance = getBalance(connection, fromIban);
if(fromBalance >= transferCents) {
addBalance(connection, fromIban, (-1) * transferCents);
addBalance(connection, toIban, transferCents);
}
});
} catch (Exception e) {
LOGGER.error("Transfer failure", e);
}
endLatch.countDown();
}).start();
}
LOGGER.info("Starting threads");
startLatch.countDown();
awaitOnLatch(endLatch);
LOGGER.info("Alice's balance: {}", getBalance("Alice-123"));
LOGGER.info("Bob's balance: {}", getBalance("Bob-456"));

This time, the transfer is done in the context of a single database transaction.

However, when running this new transfer logic using multiple concurrent threads, we will get the following output:

Alice's balance: -15
Bob's balance: 25

So, the problem persists. And, it doesn’t really matter what database we are using. It could be Oracle, SQL Server, PostgreSQL, or MySQL. By default, this problem will happen unless we are doing something explicitly to prevent it.

If you look at the application log, you will see that Alice’s account is debited even after it goes into a negative balance:

[Thread-1]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
[Thread-1]
/* Alice balance: 5 */
UPDATE account
SET balance = balance + (-5)
WHERE iban = 'Alice-123'
[Thread-1]
/* Bob balance: 5 */
UPDATE account
SET balance = balance + 5
WHERE iban = 'Bob-456'
[Thread-2]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
[Thread-3]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
[Thread-2]
/* Alice balance: 0 */
UPDATE account
SET balance = balance + (-5)
WHERE iban = 'Alice-123'
[Thread-4]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
[Thread-7]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
[Thread-2]
/* Bob balance: 10 */
UPDATE account
SET balance = balance + 5
WHERE iban = 'Bob-456'
[Thread-3]
/* Alice balance: -5 */
UPDATE account
SET balance = balance + (-5)
WHERE iban = 'Alice-123'
[Thread-3]
/* Bob balance: 15 */
UPDATE account
SET balance = balance + 5
WHERE iban = 'Bob-456'
[Thread-8]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
[Thread-4]
/* Alice balance: -10 */
UPDATE account
SET balance = balance + (-5)
WHERE iban = 'Alice-123'
[Thread-5]
/* Alice balance: -15 */
UPDATE account
SET balance = balance + (-5)
WHERE iban = 'Alice-123'
[Thread-6]
SELECT balance
FROM account
WHERE iban = 'Alice-123'
[Thread-4]
/* Bob balance: 20 */
UPDATE account
SET balance = balance + 5
WHERE iban = 'Bob-456'
[Thread-7]
/* Alice balance: -20 */
UPDATE account
SET balance = balance + (-5)
WHERE iban = 'Alice-123'
[Thread-7]
/* Bob balance: 25 */
UPDATE account
SET balance = balance + 5
WHERE iban = 'Bob-456'

Even if the transfer is done in the context of a database transaction, that doesn’t mean the database has to run it in a serializable fashion unless you explicitly tell the database to do so.

The default isolation level of the top most relational database systems is either Read Committed (e.g., Oracle, SQL Server, PostgreSQL) or Repeatable Read (e.g., MySQL), and this anomaly we’re facing here is not prevented by any of them.

The Lost Update anomaly

The anomaly that causes this race condition we’re facing here is called Lost Update, and it looks as follows:

Lost Update Anomaly

Both users manage to read the account balance of 5, but the second UPDATE will assume it changes the balance from 5 to 0 while, in reality, it changes it from 0 to -5 since the first UPDATE managed to execute first.

The reason why this flow is not Serializable is that the transaction schedule
interleaves read and write operations that belong to different transactions.

Since the SQL standard does not mention the Lost Update anomaly, my High-Performance Java resistance book covers this topic, and so this is how the Lost Update anomaly is prevented by various isolation levels depending on the underlying relational database system:

| Isolation Level | Oracle | SQL Server | PostgreSQL | MySQL |
|-----------------|--------|------------|------------|-------|
| Read Committed  | Yes    | Yes        | Yes        | Yes   |
| Repeatable Read | N/A    | No         | No         | Yes   |
| Serializable    | No     | No         | No         | No    |

So, if we’re using PostgreSQL and change the isolation level to Repeatable Read, then we can see the problem gets fixed, and Bob never gets more than the initial Alice’s account balance:

Alice's balance: 0
Bob's balance: 10

Behind the scenes, the PostgreSQL transaction engine manages to prevent the issue by aborting the transactions that would otherwise cause the Lost Update anomaly:

[Thread-3]
Transfer failure - org.postgresql.util.PSQLException: ERROR: could not serialize access due to concurrent update

While this is one way of preventing the Lost Update anomaly, there are many other solutions as well:

  • we could use optimistic locking, as explained in this article
  • we could use a pessimistic locking approach by locking Alice’s account record using a FOR UPDATE directive, as explained in this article

If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.

Conclusion

A race condition can have terrible effects on data integrity, and you should design your system to prevent such anomalies.

The first step is to write integration tests that can assert data integrity even when modifications are done by multiple concurrent requests.

The second step is to employ a concurrency control mechanism that can ensure that the logical units of work run atomically and get serialized as well.

Transactions and Concurrency Control eBook

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK