Understanding the Stellar Consensus Protocol

The Stellar Consensus Protocol was first described in a whitepaper by David Mazières in 2015. It is a “federated Byzantine agreement system” that allows decentralized, leaderless computing networks efficiently to reach a consensus outcome on some decision. The Stellar payment network uses SCP to provide a consistent view of the network’s transaction history to all participants.

Consensus protocols have a reputation for being difficult to understand. SCP is simpler than most but still shares that reputation — due in part to the mistaken idea that “federated voting,” which the whitepaper spends its first half describing, is SCP. But it’s not! Instead, it’s an essential building block used by the second half of the whitepaper to construct the actual Stellar Consensus Protocol.

In this article, we’ll give some brief background about what an “agreement system” even is, what can make one “Byzantine,” and why you’d want to make a Byzantine one “federated.” We’ll then explain the federated voting procedure described by the SCP whitepaper, and finally explain SCP itself.

Agreement systems

An agreement system allows a group of participants to reach the same decision about something — for example, what to order for lunch.

At the offices of Interstellar, we have implemented our own lunch-agreement system: we order whatever our operations manager, John, says. It’s a simple and effective agreement system. We all trust John to order something interesting and nutritious each day.

But what if John were to abuse that trust? He could unilaterally decide we must all become vegans. After a week or two of that we’d probably depose him and give his authority to Elizabeth, but maybe she’s on an avocado-and-anchovy-sandwich kick and thinks we all should be too. Power corrupts, we might decide, and so we would seek some more democratic method: some way to make sure different preferences are heard while still reaching a timely, unambiguous outcome, so that we don’t end up with no one ordering lunch, or five of us placing competing lunch orders, or no decision about what to order until 4pm.

It might seem that the solution is simple: just conduct a vote! But this is deceptive. Who gets to collect ballots and report the results? And why should the rest of us trust what they say? Perhaps we could first vote on a leader whom we all trust to run the vote — but then who gets to run that vote? What if we can’t agree on a single leader? Or, what if we can agree, but then that leader gets stuck in a meeting, or goes home sick?

Similar problems are common in distributed computing networks. All the participants, or nodes , must agree on some decision, such as whose turn it is to update a shared file or pull a task from a processing queue. In a cryptocurrency network, nodes repeatedly must decide what the complete history of the shared ledger looks like, from among multiple possible versions that occasionally conflict. This network-wide agreement allows the recipient of a crypto coin to have faith that the coin is both (a) valid (not counterfeit) and (b) not already spent elsewhere. It also assures them that they’ll be able to spend it in the future, because the new recipient will have the same faith in it, for the same reasons.

Any agreement system in a distributed computing network needs to be fault-tolerant: it must produce consistent results despite errors like slow communication links, unresponsive nodes, and misordered messages. A Byzantine agreement system is additionally tolerant of “Byzantine” faults: nodes that give false information, whether due to error or in a deliberate attempt to subvert the system or gain some advantage.¹ Consider the owner, Alice, of a crypto coin, who has to choose between buying a delicious gelato with it from Bob, or paying it to Carol to settle a debt. Alice might like to have it both ways by fraudulently paying the same coin to both Bob and Carol. To do so she must convince Bob’s computer that the coin was never paid to Carol, and she must convince Carol’s computer that the coin was never paid to Bob. A Byzantine agreement system can make this effectively impossible using a form of majority rule called a quorum . A node in such a network refuses to commit to a particular version of history until it sees that enough of its peers — a quorum — are also prepared to commit. Once that happens, they’ve formed a voting bloc large enough to force the remaining nodes in the network to agree with their decision. Alice might be able to cause some nodes to lie on her behalf, but if the network’s large enough, her attempt will be overwhelmed by the votes of honest nodes.

How many nodes does it take to form a quorum? At least a majority and more typically a supermajority to combat errors and fraud. But knowing when you have a majority means knowing how many total participants you have. In the Interstellar office, or in a county election, those numbers are easy to know. But if your collection of participants is a loosely defined network that members can join and leave at will, without needing to coordinate with any central authority, then you need a federated Byzantine agreement system: one that can determine quorums not from some predetermined roster of nodes, but dynamically, from an ever-changing and inevitably incomplete snapshot of membership at a point in time.

It might not seem possible to construct a quorum from only the limited perspective of a single node in a sprawling network, but it is. This quorum can even create confidence in the outcome of a decentralized vote. The SCP whitepaper shows how to do this using a procedure called federated voting .

For the impatient

The rest of this article describes federated voting and the Stellar Consensus Protocol in some detail. To serve as a guide to what follows — or if you don’t care about the detail and just want the tl;dr — here’s an overview of the process.

Nodes conduct rounds of federated voting on “nominees.” A round of federated voting means:
• A node casts a vote for some statement, such as “I nominate value V”;
• The node listens to votes from its peers until it finds one it can “accept”;
• The node seeks a “quorum” that also accepts the statement. This “confirms” the statement.
As soon as a node can confirm one or more nominees, it starts trying to “prepare” a “ballot” via more rounds of federated voting.
As soon as a node can verify that a ballot is prepared, it starts trying to “commit” the ballot via still more rounds of federated voting.
Once a node can confirm that a ballot is committed, it can “externalize” the value in that ballot, using it as the outcome of consensus.

These steps involve multiple rounds of federated voting that collectively form a single round of SCP. To understand what each step means, why so many are needed, how it all works, and what can go wrong, read on!

Federated voting

Federated voting is a procedure for discovering whether a network of participants can agree on a proposal. In a round of federated voting, each node must choose one of potentially many possible values as the outcome of that round. It cannot do so until it is sure that the other nodes in the network won’t choose any different outcome. To be sure of that, the nodes exchange a flurry of messages back and forth allowing each of them to confirm that a quorum of nodes accepts the same vote . The rest of this section explains the terms in that sentence and how such a confirmation can be achieved.

Quorums and quorum slices

Let’s start with identifying a quorum. As we discussed above, in a decentralized network with dynamic membership, it’s impossible to know ahead of time how many nodes there are, and therefore how many make up a majority. Federated voting solves this by introducing the novel idea of a quorum slice : a small collection of peers that a node trusts to convey information about the state of voting in the rest of the network. Every node defines its own quorum slice (of which it is also ipso facto a member).

To form a quorum, start with a quorum slice. For each member, add the members of its quorum slice. Then add the members of those members’ slices, and so on. As you continue you’ll encounter more and more nodes that you can’t add because they’re already included. When there are no more new nodes to add, stop: you have formed a quorum from a “transitive closure” of the starting node’s quorum slice.

To find a quorum from a given node…

…add the members of its quorum slice…

…then add the members of those nodes’ slices.

Continue until there are no nodes to add.

No remaining nodes to add. This is a quorum.

In fact each node may have more than one quorum slice. To form a quorum, choose just one of the slices and add the members; then choose any one slice for each of the members and add those members, and so on. This means that each node is a member of many possible quorums.

Choose just one quorum slice at each step.

One possible quorum. Alternatively…

…choosing different slices…

…(when possible)…

…produces a different quorum.

How does a node know the membership of another node’s quorum slices? The same way it knows anything else about other nodes: from the broadcasts that each node sends to the network whenever its voting state changes. Each broadcast includes the details of the sending node’s slices.²

Recall that in a non-federated Byzantine agreement system, a quorum is defined as a majority of all nodes.³ Once a proposal passes the quorum threshold, the rest of the network members are convinced that any competing proposals will fail. This is how the network converges on an outcome.

But in a federated Byzantine agreement system, not only can there be no majority (because no one knows the total size of the network), but the concept of majority is not even useful! If membership in the system is open, then someone could gain a majority simply by conducting a so-called Sybil attack: joining the network many times using multiple nodes. So what is it about the transitive closure of a node’s quorum slice that makes it into a quorum , and what makes that able to overwhelm competing proposals?

Technically, nothing! Imagine a network containing Alice, Bob, Carol, Dave, Elsie, and Frank. Alice has Bob and Carol in her quorum slice. Bob has Alice and Carol, Carol has Alice and Bob. Meanwhile, Dave, Elsie, and Frank all have one another in their respective quorum slices. The Alice-Bob-Carol subgroup can reach a decision that the Dave-Elsie-Frank group will never hear about, and vice versa. There is no way for this network to achieve consensus (except by accident).

So SCP requires that, in order for federated voting to work (and for the paper’s important theorems to apply), the network must enjoy a property called quorum intersection . In a network with this property, any two quorums you can construct always overlap in at least one node. For determining the prevailing sentiment of the network, this is as good as having a majority. Intuitively, it means that if any quorum agrees to statement X, no other quorum can ever agree to not-X, because it will necessarily include some node from the first quorum that has already voted for X.

If the network has “quorum intersection”…

…then any two quorums you might construct…

…will always overlap.

(Of course it could be that the overlapping nodes are all Byzantine — lying or otherwise misbehaving. In that case, having quorum intersection doesn’t help the network agree at all. For that reason, many of the results in the SCP whitepaper rely on explicitly stated assumptions, such as that the network enjoys quorum intersection even if misbehaving nodes are removed from the network . For the sake of clarity we’ll leave those assumptions implicit for the remainder of this article.)

It might seem unreasonable to expect a collection of independent nodes to organize their slices in such a way that the network will reliably enjoy quorum intersection. But there are two reasons why this isn’t so far-fetched.

The first reason is the existence of the Internet itself. The Internet is the perfect example of a network of independent nodes with quorum intersection. Most nodes on the Internet connect to just a few other local nodes, but those small sets overlap enough that every node is reachable from every other node by one route or another.

The second reason is specific to the Stellar payment network (the most widespread application of SCP). Each asset type in the Stellar network has an issuer , and Stellar best practices require each issuer to designate one or more nodes in the network for handling redemption requests. It’s in your interest to include those nodes in your quorum slices, directly or indirectly, for each asset you care about. The quorums for all nodes interested in a given asset will then overlap in at least those redemption nodes. Nodes interested in multiple assets will include in its quorum slices all the relevant issuers’ redemption nodes, and these will tend to bridge all assets together. Further, any assets that are not connected in this way to others on the network don’t need to be — it’s OK for the network to lack quorum intersection there. (Think of the way that banks operating in dollars sometimes want to trade with banks operating in euros and banks operating in pesos, so they are on a network together, but none of them care about the separate network of kids trading baseball cards.)

Of course, expecting that the network should enjoy quorum intersection is not the same as a guarantee. Other Byzantine agreement systems owe much of their complexity to making guarantees about quorums. An important innovation of SCP is that it removes the responsibility for making quorums from the consensus algorithm itself and pushes it into the application layer. This suggests that, although federated voting is general enough to work with any kind of value being voted on, in fact its robustness depends critically on the broader meaning of those values. Some hypothetical uses might not lend themselves as readily to producing well-connected networks as others.

Voting, accepting, and confirming

In a round of federated voting, a node optionally begins by casting a vote for some value V. This means broadcasting a message to the network saying, “I am node N, my quorum slices are Q, and I vote V.” When a node votes in this way, it promises that it has never voted against V and never will.

A node can see how its peers are voting from their broadcast messages. Once the node collects enough such messages, it can traverse the quorum slices in them to find quorums. If it can see a quorum of peers that all vote for V also, then it can move to accepting V, and it broadcasts this new message to the network. (“I am node N, my quorum slices are Q, and I accept V.”) Accepting provides a stronger guarantee than mere voting. When a node votes for V, it can never vote for not-V. But when a node accepts V, no node in the network will ever accept not-V. (Theorem 8 in the SCP whitepaper proves this.)

Of course, there’s a good chance that N won’t see a quorum of nodes agreeing with its V vote right off the bat. Other nodes may vote for other values. But there is another way for a node to advance from mere voting to accepting. N can accept a different value, W, even if N didn’t vote for it, and even if it doesn’t see a quorum voting for it, as long as it sees a blocking set accepting it. A blocking set is just one node chosen from each of N’s quorum slices. As its name suggests, it is capable of blocking any other value. If all nodes in such a set accept W, then (by Theorem 8) it will never be possible to form a quorum accepting not-W, and so it’s safe for N to accept W too.

A node, N, with three quorum slices.

B-D-F is a blocking set for N: it includes one node from each of N’s slices.

B-E is also a blocking set for N, because E appears in two of N’s slices.

But a blocking set is not a quorum. It would be too easy for someone to fool node N into accepting a value when it shouldn’t, if they can just subvert one node in each of N’s slices. So accepting a value is not the end of voting. Instead, N must confirm the value, meaning it sees a quorum of nodes all accepting it. If it gets this far, then as the SCP whitepaper proves (in Theorem 11), the rest of the network will also eventually confirm the same value, and so N has reached the end of federated voting with the value as its outcome.

Agreement systems

For the impatient

Federated voting

Quorums and quorum slices

Voting, accepting, and confirming

Recommend

GitHub - hashicorp/raft: Golang implementation of the Raft consensus protocol

GitHub - truechain/truechain-consensus-core: TrueChain Consensus Protocol

GitHub - canonical/raft: C implementation of the Raft consensus protocol

Understanding bytes in Go by building a TCP protocol - Ilija Eftimov

Understanding Raft Consensus

What's .self, .Type and .Protocol? Understanding Swift Metatypes

Viewstamped Replication: The Less-Famous Consensus Protocol

Maturity in Blockchain Tech Compels a Standard Consensus Protocol: Is It Nugenes...

Apache Kafka 3.3 Replaces ZooKeeper with the New KRaft Consensus Protocol

Buy Orbeon Protocol (ORBN) if You Are Holding Polygon (MATIC) and Stellar (XLM)

About Joyk