Learning Notes on Designing Data-Intensive Applications (ix)

Chapter 9. Consistency and Consensus

E.Y.

10 min readSep 1, 2021

This is a series of learning notes on Designing Data-Intensive Applications.

Consistency Guarantees

Linearizability vs Serializability
Relying on Linearizability
Implementing Linearizable Systems
The Cost of Linearizability

Ordering Guarantees

Ordering and Causality
Sequence Number Ordering

Total Order Broadcast

Using total order broadcast
Implementing linearizable storage using total order broadcast
Implementing total order broadcast using linearizable storage

Distributed Transactions and Consensus

Atomic Commit and Two-Phase Commit (2PC)
Distributed Transactions in Practice
Fault-Tolerant Consensus
Membership and Coordination Services

Consensus is one of the tricky abstractions that helps with getting all of the nodes to agree on something.

Consistency Guarantees

Most replicated databases provide at least eventual consistency.

Linearizability

Linearizability is a strict model of consistency that makes a system appear as if only one copy of the data, and all operations on it are atomic.

It is also a recency guarantee, meaning that the value read must be the most recent or up-to-date value. When any read returns a new value, all following reads on any client must also return the new value.

Linearizability vs Serializability:

Serializability is an isolation property of transactions that guarantees that reads and writes to multiple objects behave the same as if they had executed in some serial order. (prevent write skews)
Linearizability is a recency guarantee on reads and writes of a single object.

A database may provide both guarantees ( strong one-copy serializability or strict serializability), two-phase locking and actual serial execution are typically linearizable, while serial snapshot isolation is not as it does not include writes that are more recent than the snapshot during reads. But stale read is not a violation of serializability, as long as the transations still appear to be in a serial order (even they are not ).

Relying on Linearizability:

Linearizability is important in few areas such as

leader election using a lock (all nodes must agree the leader is);
constraints and uniqueness guarantees, and
cross-channel timing dependencies (If there is additional communication channel between systems, e.g. system A receives updates faster than B from C, and the data between A/B becomes out of sync)

Implementing Linearizable Systems :

Single-leader repolication is potentially linearizable (reads from leader) as long as no split brains.
Consensus algorithms is linearizable.
Multi-leader replication is not linearizable.(Writes concurrently to multiple nodes and async copies to other nodes, with conflices, it may need application level resolution)
Leaderless replication is probably not linearizable.

Downside of lineanerizability:

If your applicaiton requires linearizability, and some replicas are disconnected from the other replicas due to a network problem, the some replicas cannot process request while they are disconnected (unavailable).
If your application does not require, then it can be written in a way that each replica can process requests independently, even if it is disconnected from other replicas.

The unhelpful CAP theorem
CAP is sometimes presented as Consistency, Availability, Partition tolerance: pick 2 out of 3. But it should be either Consistency or Available when Partitioned.
CAP only considers one consistency model (linearizability) and one kind of fault (network partitions, or nodes that are alive but disconnected from each other). It doesn’t say anything about network delays, dead nodes, or other trade-offs. CAP has been historically influential, but nowadays has little practical value for designing systems.

The main reason for dropping linearizability is performance, not fault tolerance. Linearizabilit is slow not only during a network fault.

Although linearizability is a useful guarantee, few systems are actually linearizable in practice.

Ordering Guarantees

Ordering helps preserve causality.

Total vs. partial order:

With a total order, it means that they can always be compared. That is, with any two elements, you can always see which one is greater and which is smaller.
With a partial order, we can sometimes compare the elements and say which is bigger or smaller, but in other cases the elements are incomparable. For example, mathematical sets are not totally ordered. You can’t compare {a, b} with {b, c}.

This difference between total order and a partial order is reflected when we compare Linearizability and Causality as consistency models:

Causality uses partial order, where two operations are ordered if they are causally related (i.e. we can say which happened before the other), but are incomparable if they are concurrent. With concurrent operations, we can’t say that one happened before the other.
Linearizability uses a total order, where the system behaves as if there is only one copy of the data, and can always say which operation happened first on a timeline.

The version history of a system like Git is similar to a graph of causal dependencies. One commit often happens after another, but sometimes they branch off, and we create merges when those concurrently created commits are combined (Even each commits has timestamps, but it is used of system clock , which can be inaccurate).

Causal consistency is the strongest possible consistency model that doesn’t slow down or fail due to network failures or delays.

Capturing causal dependencies

Causal consistency needs to track causal dependencies across the entire database so version vectors can be used for that. However, keeping track of all dependencies can become an overhead, so a better way could be to use sequence numbers or timestamps (from a logical clock) to order events instead.

Sequence Number Ordering

The best known way of generating sequence numbers for causal consistency is Lamport timestamps. The idea here is that each node has a unique identifier, and also keeps a counter of the number of operations it has processed. The node includes them in every request it makes. The Lamport timestamp is then a pair of (counter, nodeID). When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.

https://people.cs.rutgers.edu/~pxk/417/notes/clocks/index.html

Lamport timestamps provide a total ordering: if there are two timestamps, the one with the greater counter value is the greater timestamp; if the counter values are the same, then we pick the one with the greater node ID as the greater timestamp.

But be careful with Lamport timestamps and causality. On the one way, if two events are causally related, the Lamport timestamp will always obey causality. On the other, with Lamport timestamps for two events, it does not mean that they are causally related.

Timestamp ordering is not sufficient:

They only define a total order of operations after all operations have finalised (report their timestamps back). If one operation needs to decide right now whether a decision should be made (e.g. 2 concurrent writes for a unique constraint), it might need to check with every other node that there’s no concurrently executing operation that could affect its decision.

Total Order Broadcast

In order to use total ordering between multiple nodes, we should use total order broadcast, which is a message exchanging protocol that guarantees

reliable delivery
total ordered delivery

Total order broadcast is used in database replication, serializable transactions, creating messages log, and lock for fencing tokens.

Total order broadcast is based on a principle known as state machine replication: If every message represents a write to the database, and every replica processes the same writes in the same order, then the replicas will remain consistent with each other.

Distributed Transactions and Consensus

The goal of consensus is to get several nodes to agree on something.

Situations where consensus is important:

Leader election: In a single-leader setup, all the nodes need to agree on which node is the leader.
Atomic commit: For a transaction spans several nodes or partitions, all the nodes must agree on the outcome of the transaction.

Note that based on FLP, there is no algo to reach consensus in a async model. But it’s made possible in practice when timeouts and crash-detection are allowed.

Atomic Commit and Two-Phase Commit (2PC)

Atomic commit is easy on a single node, as it just depends on the order in which data is durably written to disk (a write-ahead log that can be used to recover from crash).
For multi nodes, it’s not sufficient to send a commit request to all the nodes and then commit the transaction. As some nodes can commit and some may abort, which violates Atomicity.

Two-phase commit (or 2PC) is an algorithm used for achieving atomic transaction commit when multiple nodes are involved.

Prepare phase: The coordinator sends a prepare request to all the nodes participating in the transaction, for which the nodes have to respond with essentially a ‘YES’ or ‘NO’ message.
Commit phase: If all the participants reply ‘YES’, then the coordinator will send a commit request in the second phase for them to actually perform the commit. However, if any of the nodes reply ‘NO’, the coordinator sends an abort request to all the participants.

The problem with 2PC is that it can stuck waiting for coordinator recovering from crash, while it cannot release the locks it’s holding, which can cause larger parts of the application to become unavailable. This can be fixed by either a human manual interaction, or using an automated heuristic decisions.

Distributed Transactions in Practice

There are two types of distributed transaction:

Database-internal distributed transactions: This refers to transactions performed by a distributed database that spans multiple replicas or partitions. All the nodes participating in the transaction are running the same database software.
Heterogenous distributed transactions: Here, the participants are two or more different technologies. E.g. a database and a message brokers.

Exactly-once message processing:

An advantage of atomically committing a message together with the side effects of its processing is that it ensures that the message is effectively processed exactly once. If the transaction fails, the effects of processing the message can simply be rolled back.

However, this is only possible if all the systems involved in the transaction are able to use the same atomic commit protocol. For example, if a side effect involves sending an email and the email server does not support two-phase commit, it will be difficult to roll-back the email.

XA Transactions:

XA (eXtended Architecture) is a standard for implementing two-phase commit across heterogeneous technologies.

XA is a C API for interacting with a transaction coordinator, but bindings for the API exist in other languages. The coordinator is usually just a library that’s loaded into the same process as the application issuing the transactions.

It assumes that communication between your application and the participant databases/messaging services is done through a network driver (like JDBC) or a client library which supports XA. If the driver does support XA, it will call the XA API to find out whether an operation should be part of a distributed transaction — and if so, it sends the necessary information to the participant database server. The driver also exposes callbacks needed by the coordinator to interact with the participant, through which it can ask a participant to prepare, commit, or abort.

Fault-Tolerant Consensus

Consensus algorithm must satisfy the following properties:

Uniform agreement: No two nodes decide differently.
Integrity: No node decides twice.
Validity: If a node decides the value v, then v was proposed by some node.
Termination: Every node that does not crash eventually decides some value.

The termination property formalises the idea of fault tolerance. It means is that a consensus algorithm cannot sit idle and do nothing forever i.e. it must make progress. If some nodes fail, the other nodes must reach a decision.

Consensus algorithms and total order broadcast

The most popular fault-tolerant consensus algorithms are Paxos, Zab, Raft, and Viewstamped Replication. They decide on a sequence of values, which makes them total order broadcast algorithms.

Due to agreement property, all nodes decide to deliver the same messages in the same order.
Due to integrity, messages are not duplicated.
Due to validity, messages are not corrupted.
Due to termination, messages are not lost.

Single-leader replication and consensus

Epoch numbering and quorums

Every time the current leader is thought to be dead, a vote is started among a quorum of nodes to elect a new leader. This election is given an incremented epoch number, and thus epoch numbers are totally ordered. If there is a conflict, the leader with the higher epoch number wins (maybe the “dead” leader comes back to life but has a smaller number).

A node cannot trust its own judgement. It must collect votes. For every decision that a leader wants to make, it must send votes again.

There are two rounds of voting, once to choose a leader, and second time to vote on a leader’s proposal.

The biggest difference with 2PC, is that 2PC requires a “yes” vote for every participant.

Limitations of consensus

A potential downside of consensus systems is that they require a strict majority to operate. Another challenge is that they rely on timeouts to detect failed nodes, which can lead to frequent leader elections which could harm system performance.

Membership and Coordination Services

Zookeeper and etcd are typically described as “distributed key-value stores” or “coordination and configuration services”. They are designed to hold small amounts of data that can fit entirely in memory. Data is replicated across all the nodes using a fault-tolerant total order broadcast algorithm.

ZooKeeper is modeled after Google’s Chubby lock service and it provides some useful features:

Linearizable atomic operations: Usuing an atomic compare-and-set operation, you can implement a lock.
Total ordering of operations: When some resource is protected by a lock or lease, you need a fencing token to prevent clients from conflicting with each other in the case of a process pause. The fencing token is some number that monotonically increases every time the lock is acquired.
Failure detection: Clients maintain a long-lived session on ZooKeeper servers. When a ZooKeeper node fails, the session remains active. When ZooKeeper declares the session to be dead all locks held are automatically released.
Change notifications: Not only can one client read locks and values, it can also watch them for changes.

Allocating work to nodes

Zookeeper typically manages data that is quite slow-changing, e.g. the node running on IP address 10.1.1.23 is the leader for partition 7.

ZooKeeper, etcd, and Consul are also often used for service discovery, find out which IP address you need to connect to in order to reach a particular service.

Happy Reading!