Learning Notes on Designing Data-Intensive Applications (viii)

Chapter 8. The Trouble with Distributed Systems

E.Y.

6 min readAug 31, 2021

This is a series of learning notes on Designing Data-Intensive Applications.

Unreliable Networks

Network Faults in Practice
Detecting Faults
Timeouts and Unbounded Delays
Synchronous Versus Asynchronous Networks

Unreliable Clocks

Monotonic vs Time-of-Day Clocks
Clock Synchronization and Accuracy
Relying on Synchronized Clocks
Process Pauses

Knowledge, Truth, and Lies

The Truth is Defined by the Majority
Byzantine Faults
System Model and Reality
System Model for Timing Assumptions

A program on a single computer is deterministic. In a distributed systems there are lots of ways for things to go wrong, and we should assume that it will go wrong.

Unreliable networks

Focusing on shared-nothing systems the network is the only way machines communicate.

Since most networks are asynchronous packet networks. A message is set but so many things can go wrong and it is impossible to tell why. The usual way of handling this issue is a timeout.

Detecting faults quickly ensures that:

A load balancer can stop sending requests to a dead node.
A new follower can be promoted to a leader if the leader fails in a single-leader replication.

Timeouts and unbounded delays

How long should a timeout be? A long timeout means a long wait until a node is declared dead. A short timeout detects faults faster, but carries a higher risk of incorrectly declaring a node dead when it could be a slowdown.

A reasonable timeout value can be 2d + r, when d is the maximum delay for a packet, and r is the node's processing time. However, most networks have unbounded delays, that is there is no upper limit on the time it may take to deliver a message.

Systems can continually measure response times and their variability (jitter), and automatically adjust timeouts according to the observed response time distribution.

Network congestion and queueing

Queueing is the most common cause for network packet delays.

Different nodes try to send packets to the same destination;
If CPU cores are busy, the request is queued by the operative system;
TCP performs flow control, in which a node limits its own rate of sending in order to avoid overloading a network link or the receiving node.

TCP vs UDP:

TCP is more reliable vs. UDP in:

Flow Control
Acknowledgement and Retransmission
Sequencing: messages arrive in the right order

UDP is used in latency-sensitive applications like videoconferencing. In UDP, delayed data is probably worthless so it does not try to retransmit it.

Unreliable clocks

Systems need clock to measure duration and describe points in time.

Each machine on the network has its own clock, slightly faster or slower than the other machines. It is possible to synchronise clocks with Network Time Protocol (NTP) (Time reported by a group of servers).

Monotonic vs Time-of-Day Clocks :

Modern computers have at least two different kinds of clocks:

Time-of-day clocks: Return the current date and time according to some calendar . If the local clock is ahead of the NTP server, it may be reset and appear to jump back to a previous point in time. This makes it is unsuitable for measuring elapsed time.
Monotonic clocks: These clocks are suitable for measuring a duration like a timeout or response time. They are guaranteed to move forward in time .

If a software requires synchronised clocks, it’s essential to monitor the clock offsets between all machines.

Timestamps for ordering events

It is tempting, but dangerous to rely on clocks for ordering of events across multiple nodes. This usually imply that last write wins (LWW), often used in both multi-leader replication and leaderless databases like Cassandra and Riak, and data-loss may happen.

The definition of “recent” also depends on local time-of-day clock, which may well be incorrect. Even NTP’s synchronization accuracy is limited by the network round-trip time, as well as quartz drift.

Logical clocks, based on counters, are safer alternative for ordering events. Logical clocks do not measure time of the day or elapsed time, only relative ordering of events. This contrasts with time-of-the-day and monotic clocks .

Clock readings have a confidence interval

It make sense to think of a clock reading as a range of times, within a confidence internval: for example, 95% confident that the time now is between 10.3 and 10.5.

Google Spanner implements snapshot isolation using clock’s confidence interval:

A = [A earliest, A latest]
B = [B earliest, B latest]

And those two intervals do not overlap (A earliest < A latest < B earliest < B latest), then B for sure happens after A.

Process pauses

How does a node know that it is still leader? A node must assume that its execution can be paused for a significant amount of time at any point without noticing that it was asleep until it resumes and checks the clock later.

There are many reasons a process can be paused, e.g.:

Garbage collection (stop the world)
In laptops execution may be suspended
Operating system context-switches
Swapping to disk (paging)

Knowledge, truth and lies

Since a node cannot necessarily trust its own judgement of a situation, many distributed systems rely on a quorum (voting among the nodes). The quorum is an absolute majority of more than half of the nodes.

Fencing tokens

To prevent a node believes it is the chosen one while it is dead already, we can use a fencing token. Basically, each time a lock server grants a lock or a lease, it also generates an ever increasing counter as a fencing token . We can then require that any client which wants to send a write request to the storage service must include the current fencing token.

The lock server will then perform validation on any request with the fencing token included and reject it if it has generated a fencing token with a higher number. (A dead node will have an outdated/lower number).

If ZooKeeper is used as lock service, the transaciton ID zcid or the node version cversion can be used as a fencing token.

Byzantine faults

Fencing tokens can detect and block a node that is inadvertently acting in error (unreliable but honest). But it becomes much harder if there is a risk that nodes may “lie” (byzantine fault).

A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network, e.g. with Aerospace environments or multi-participant orgs.

It can be tackled with an algo that needs a supermajority of more than 2/3 nodes to function correctly.

Weak forms of lying:

Even if we trust our nodes, there is still a weak form of lying, such as hardware issues, software bugs, or misconfiguration. Luckily, we can tolerate this using checksum on TCP or application level for example, and by input validation.

System Model and Reality

System models with regards to timing assumptions:

Synchronous model: not realistic as it assumes delays never exceeds an upper bound;
Partially synchronous model: assumes the system to behave synchronously for most of the time but occasionally exceed the upper bound, which is most realistic;
Asynchronous model : makes no assumptions about timings. It does not have a clock, and so it doesn’t use timeouts. This model is very restrictive.

System models with regards to node failures:

Crash-stop-faults: node only fails when it crashes, and is gone forever;
Crash-recovery-faults: the node can crash, but perhaps respond again after some unknown time
Byzantine-faults: nodes can do anything including trying to trick other nodes

The most useful model in real systems is the partially synchronous model with crash-recovery.

Happy Reading!