Learning Notes on Designing Data-Intensive Applications (v)

Chapter 5. Replication

E.Y.

11 min readJun 30, 2021

This is a series of learning notes on Designing Data-Intensive Applications.

Reasons for replication:

Store data geographically close to your users
Improve availability
Improve throughput

The challenge is not storing the data, but handling changes to replicated data. Popular algorithms for replicating changes between nodes: single-leader, multi-leader, and leaderless replication.

Leaders and followers

Each node that stores a copy of the database is called a replica.

One of the replicas is designated the leader. Writes to the database must send requests to the leader.
Other replicas are known as followers. The leader sends the data change to all of its followers as part of a replication log or change stream.
Reads can be query the leader or any of the followers, while writes are only accepted on the leader.

MySQL, Oracle Data Guard, SQL Server’s AlwaysOn Availability Groups, MongoDB, RethinkDB, Espresso, Kafka and RabbitMQ are examples of these kind of databases.

Synchronous vs asynchronous

The difference between sync and async replication:

Setting up new followers

Setting up a follower can usually be done without downtime. The process looks like:

Take a snapshot of the leader’s database
Copy the snapshot to the follower node
Follower requests data changes that have happened since the snapshot was taken
Once follower processed the backlog of data changes since snapshot, it has caught up.

Catchup recovery

Follower failure: catchup recovery

Follower can connect to the leader and request all the data changes that occurred during the time when the follower was disconnected.

Leader failure: failover

If a leader fails, a follower has to be promoted to a leader, and clients need to send write requests to the new leader. The major steps are as follows:

Determining that the leader has failed e.g. after a node does not respond in a period of time.
Choosing a new leader, normally the replica with the most up-to-date changes from the old leader.
Reconfiguring the system to use the new leader, e.g. the old leader becomes a follower.

Issues that could arise:

The new leader may have received conflicting writes that is different to the old leader’s writes. The most common solution is to discard the old leader’s writes.
Discarding writes is dangerous if other database need the database contents.
It could happen that both leaders believe that they are the leader (split brain).
A timeout too short may result in unnecessary failover.

As a result, some operation teams prefer to perform failovers manually, even the software supports automatic failover.

Replication logs

Statement-based replication: The leader logs every write request and sends it to its followers (every INSERT, UPDATE or DELETE).

Problems:

Non-deterministic functions such as NOW() or RAND() will generate different values on replicas.
Statements that depend on existing data, like auto-increments, must be executed in the same order.
Statements with side effects may result on different results on each replica.

A solution to this is to replace any nondeterministic function with a fixed return value in the leader.

Write-ahead log (WAL) shipping

The log is an append-only sequence of bytes containing all writes to the database. The leader can send it to its followers.
Problem is the replication closely coupled with the storage engine. WALs may not be parsable if the software version between the leader is different from the followers.

Logical (row-based) log replication

To fix that, we can have a sequence of records describing writes to database tables at the granularity of a row.
For an inserted row, it creates new values for each impacted columns.
For a deleted row, it contains information that uniquely identifies that column.
For an updated row, it contains the information to uniquely identify that row and all the new values of the columns.
Since logical log is decoupled from the storage engine internals, it’s easier to make it compatible, e.g. for external applications to parse, useful for data warehouses, custom indexes and caches (change data capture).

Trigger-based replication

Move replication to the application layer.
A trigger registers custom application code that is automatically executed when a data change occurs. The trigger can write the data to a separate table, which can then be read by an external process and replicated to another system.
Pros are that this approach has greater overheads, is more prone to bugs.

Problems with replication lag

In a read-scaling architecture, you can increase the capacity for serving read-only requests simply by adding more followers. However, with an asynchronous replication, you can have followers not having up to date information.

Eventual Consistency: If an application reads from an asynchronous follower, it may see outdated information if the follower has fallen behind the leader. This inconsistency is temporary, and the followers will eventually catchup. That’s eventual consistency.

The delay between when a write happens on a leader and gets reflected on a follower is replication lag.

Reading your own writes

Read-after-write consistency, aka. read-your-writes consistency is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves, yet other users may still see a staled version of the data.

Reading from the leader when reading something a user may have recently modified, . A simple rule is always read the user’s own profile from the leader.
Tracking the timestamp of the database’s most recent write and ensure all reads to be from replicas that are at least as up-to-date.
Routing any request to the datacenter that has the leader if replicas are distributed across multiple datacenters.

When same user is accessing from multiple device — cross-device read-after-write consistency:

Tracking the timestamp of the most recent update is meaningless as this metadata will need to be centralised.
Since there is no guarantee that connections from different devices will be routed to the same datacenter. You may need to route requests from all of one user’s devices to the same datacenter (so not helping with throughput at all)

Monotonic reads

When user make several reads in sequence, they will not see time to go backward
Happens when replica A has outdated data than replica B and user read B and then A.
Make sure that each user always makes their reads from the same replica.

Consistent prefix reads

If a sequence of writes happens in a certain order, then all reads will see them appear in the same order.
Happens in partitioned databases as there is no global ordering of writes(Write into A logically cause Write into B, but B somehow arrives in Replica b before A arrives in Replica a).
Make sure any writes casually linked are written to the same partition.

Solutions for replication lag

Transactions are designed to provide stronger guarantees for applications.

Multi-leader replication

Leader-based replication has one major downside: there is only one leader, and all writes must go through it. A natural extension is to more nodes to accept writes where each leader simultaneously acts as a follower to the other leaders.

Use cases for multi-leader replication

A. Multi-datacenter operation:

One leader in each datacenter. Within each datacenter, regular leader-follower replication is used. Between datacenters, each datacenter leader replicates its changes to the leaders in other datacenters.

Compared to a single-leader replication:

Performance. Write is processed in the local datacenter and replicated asynchronously to other datacenters.
Tolerance of outages. In multi-leader, each datacenter can continue operating independently from others.
Tolerance of network problems. Multi-leader with asynchronous replication can tolerate network problems better.
HOWEVER: Problems can arise when some data is concurrently modified in multiple datacenters.

B. Clients with offline operation

For applications that needs to continue working when offline.
Local database accept writes when offline as each has a leader. These writes are then asynchronously replicated to other nodes when the network is back
CouchDB is designed for this mode of operation.

C. Collaborative editing

Real-time collaborative editing allow several people to edit a document simultaneously. Like Google Docs.
When user edits a document, the changes are instantly applied to their local replica and asynchronously replicated to other users.
However, to avoid editing conflicts, lock must be used. For faster collaboration, the unit of change very small (like a keystroke) and avoid locking.

Handling write conflicts

The biggest problem with multi-leader replication is conflict resolution.

Synchronous vs asynchronous conflict detection

Asynchronous: If two users change the same record, the writes may be successfully applied to their local leader. However, when the writes are asynchronously replicated, a conflict will be detected.

Synchronous: In theory the second writer can be blocked and wait the first one to complete, however there’s no point using multi-leader system in this way.

Conflict avoidance

The simplest strategy for dealing with conflicts is to avoid them.
Conflicts can be avoided by ensuring that all the writes for a particular record go through the same leader (e.g. when a user edits his own data)

Converging toward a consistent state

On single-leader, the last write determines the final value of the field. In multi-leader, it’s not the same.

Different ways of achieving consistence:

Give each write a unique ID (timestamp, UUID, or a has of the key and value), pick the write with the highest ID as the winner and throw away the other writes. This is last write wins (LWW) and is prone to data loss.
Give each replica a unique ID, writes that originated at a higher-numbered replica always take precedence. This also implies data loss.
Somehow merge the values together, e.g. ordering them alphabetically
Record the conflict and write application code that resolves it some later time.

Custom conflict resolution

On write. As soon as the database system detects a conflict in the log of replicated changes, it calls the conflict handler.
On read. All the conflicting writes are stored. On read, multiple versions of the data are returned to the application. The application may prompt the user or automatically resolve the conflict.

Automatic conflict resolution is difficult but feasible:

Conflict-free replicated datatypes (CRDTs) — Used in Riak 2.0
Mergeable persistent data structure — Similar to Git. Tracks history explicitly
Operational transformation: Algorithm behind Google Docs.

Multi-leader replication topologies

A replication topology describes the communication paths along which writes are propagated from one node to another.
The most general is all-to-all in which every leader sends its writes to every other leader. MySQL uses circular topology, where each nodes receives writes from one node and forwards those writes to another node. Another popular is star, with one designated node forwards writes to all of the other nodes.
All-to-all topology is more fault tolerant than the others because in those topologies, one node failing can interrupt the flow of replication messages across other nodes, making them unable to communicate until the node is fixed.

Leaderless replication

Any replica can directly accept writes from clients. Databases like Amazon’s in-house Dynamo, Riak, Cassandra and Voldemort.

For write: clients send the write to all replicas.
For read:clients send the read request to all replicas. The client may get different responses. Version numbers are used to prevent stale reads,

After a unavailable node come back online, it has two different mechanisms to catch up:

Read repair. When a client detect any stale responses (e.g. with lower verison number), write the newer value back to that replica. This works for frequently read values, but has the downside that any data that is not frequently read may be stale forever.
Anti-entropy process. There is a background process that constantly looks for differences in data between replicas and copies any missing data from one replica to he other. It does not copy in any particular order and it can be slow.

Quorums for reading and writing

Quorum reads and writes refer to the minimum number of votes for a read or a write to be valid.

If there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. As long as w + r > n, we expect to get an up-to-date value when reading.

A common choice is to make n and odd number (typically 3 or 5) and to set w = r = (n + 1)/2 .

Quorums don’t necessarily have to be majorities i.e. w + r > n. What matters is that the sets of nodes used by the read and write operations overlap in at least one node.

Limitations:

Sloppy quorum, the w writes may end up on different nodes than the r reads, so there is no longer a guaranteed overlap.
If two writes occur concurrently, and is not clear which one happened first. Therefore, the database may wrongly return the more stale one. If we pick a winner with last write wins, writes can be lost due to clock skew.
If a write happens concurrently with a read, the write may be reflected on only some of the replicas.
In a non-transaction model, if a write succeeds on some replicas but fails on others, it is not rolled back on the replicas where it succeeded.

Dynamo-style databases are generally optimised for use cases that can tolerate eventual consistency.

Sloppy quorums and hinted handoff

In case where client won’t be able to connect to some database nodes during a network interruption, how should we handle writes:

To return errors to all requests for which we cannot reach quorum of w or r nodes?
To accept writes anyway, and write them to some nodes that are reachable but aren’t among the n nodes on which the value usually lives?

The latter is known as sloppy quorum: writes and reads still require w and r successful responses, but those may include nodes that are not among the designated n “home” nodes for a value. Once the network interruption is fixed, any writes are sent to the appropriate “home” nodes (hinted handoff).

Sloppy quorums are useful for increasing write availability: as long as any w nodes are available, the database can accept writes.

Multi-datacenter operation

Each write from a client is sent to all replicas, but the client usually only waits for acknowledgement from a quorum of nodes within its local datacenter so that it is unaffected by delays and interruptions on cross-datacenter link.

Detecting concurrent writes

In order to become eventually consistent:

Last write wins (discarding concurrent writes). If losing data is not acceptable, LWW is a poor choice for conflict resolution.
The “happens-before” relationship and concurrency. We can simply say that to operations are concurrent if neither happens before the other. Either A happened before B, or B happened before A, or A and B are concurrent.

Capturing the happens-before relationship

Whenever we have two operations A and B, there are three possibilities:

Either A happened before B
Or B happened before A
Or A and B are concurrent.

In a single database replica, version numbers are used to determine concurrency.

Each key is assigned a version number, and that version number is incremented every time that key is written, and the database stores the version number along with the value written. That version number is returned to a client.
A client must read a key before writing. When it reads a key, the server returns the latest version number together with the values that have not been overwritten.
When a client wants to write a new value, it returns the last version number it received in the prior step alongside the write.
If the version number being passed with a write is higher than the version number of other values in the db, it means the new write is aware of those values at the time of the write, and can overwrite all values with that version number or below.
If there are higher version numbers, the database must keep all values with the higher version number.

Merging concurrently written values

These concurrent values are called siblings. Merging sibling values is the same problem as conflict resolution in multi-leader replication. A simple approach is to just pick one of the values on a version number or timestamp (last write wins). You may need to do something in application code to avoid losing data.

Version vectors

We need a version number per replica as well as per key. Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas.

The collection of version numbers from all the replicas is called a version vector.

That’s so much of it!

Happy Reading!