table of content

1 Thinking in distributed systems: Models, mindsets, and mechanics

1.1 Software engineering and mental models

1.1.1 Mental models: The foundation of reasoning

1.1.2 Correct mental models

1.1.3 Complete mental models

1.2 Mental model of software systems

1.3 Different types of models

1.3.1 Different models describing the same aspects

1.3.2 Different models describing different aspects of a system

1.4 Thinking about distributed systems

1.4.1 Correctness

1.4.2 Scalability and reliability

1.4.3 Responsiveness

1.5 Two big ideas

1.5.1 Systems of systems

1.5.2 Global view vs. local view

1.6 Distributed Systems Incorporated

1.7 Navigating complexity

1.7.1 Simple yet complex

1.7.2 Emergent behavior

1.7.3 Changing perspective

1.7.4 Think globally; act locally

1.8 Thinking above the code

2 System models, order, and time

2.1 System models

2.1.1 Theory and practice

2.1.2 Synchronous distributed systems

2.1.3 Asynchronous distributed systems

2.1.4 Partially synchronous systems

2.1.5 Component and network behavior

2.1.6 Realistic system models

2.2 Order and time

2.2.1 The happened-before relationship

2.2.2 Time and clocks

2.2.3 Physical time and physical clocks

2.2.4 Logical time and logical clocks

2.2.5 Physical clocks vs. logical clocks

3 Failure tolerance

3.1 In theory

3.2 Types of failure tolerance

3.2.1 Masking failure tolerance

3.2.2 Nonmasking failure tolerance

3.2.3 Fail-safe failure tolerance

3.2.4 None of the above

3.3 In practice

3.3.1 System model

3.3.2 Failure handling

3.3.3 Failure classification

3.3.4 Failure detection

3.3.5 Failure mitigation

3.3.6 Putting everything together

4 Message delivery and processing

4.1 Exchanging messages

4.2 The uncertainty principle of message delivery and processing

4.2.1 Before sending the request

4.2.2 After sending the request and before receiving a response

4.2.3 After receiving a response

4.3 Silence and chatter

4.4 Exactly-once processing semantics

4.5 Idempotence

4.6 Case study: Charging a credit card

5 Transactions

5.1 Abstractions

5.2 The magic of transactions

5.2.1 Concurrency

5.2.2 Failure

5.3 The model of transactions

5.3.1 Correctness

5.3.2 Serializability

5.3.3 Completeness

5.3.4 Application-level abort

5.3.5 Platform-level abort

6 Distributed transactions

6.1 Atomic commitment: From a single RM to multiple RMs

6.1.1 Transaction on a single RM

6.1.2 Transaction on multiple RMs

6.1.3 Blocking and nonblocking

6.2 The essence of distributed transactions

6.3 Two-Phase Commit protocol

6.3.1 In the absence of failure

6.3.2 In the presence of failure

6.3.3 Improvement

7 Partitioning

7.1 Encyclopedias and volumes

7.2 Thinking in partitions

7.3 The mechanics of partitioning and balancing

7.4 (Re)partitioning

7.4.1 Types of partitioning

7.4.2 Data item to partition assignment strategies

7.5 Common item-based assignment strategies

7.5.1 Range partitioning

7.5.2 Hash partitioning

7.6 Repartitioning

7.6.1 Range partitioning

7.6.2 Hash partitioning

7.7 Consistent hashing

7.8 (Re)balancing and overpartitioning

8 Replication

8.1 Redundancy

8.2 Thinking about replication and consistency

8.3 Replication

8.4 The mechanics of replication

8.4.1 System model

8.4.2 Replication lag

8.4.3 Synchronous vs. asynchronous replication

8.4.4 State-based vs. log-based replication

8.4.5 Single-leader, multileader, and leaderless systems

9 Consistency

9.1 Consistency models

9.1.1 Common consistency models

9.1.2 Virtues and limitations

9.2 Linearizability

9.2.1 Queue and stack

9.2.2 Formal definition of linearizability

9.3 Eventual consistency

9.3.1 The shopping cart

9.3.2 Variants of eventual consistency

9.3.3 Implementation

9.4 Consistency, availability, and partition tolerance

9.4.1 History

9.4.2 Conjecture vs. theorem

9.4.3 CAP theorem

10 Distributed consensus

10.1 The challenge of reaching agreement

10.2 System model

10.3 State machine replication

10.4 The origin—and irony—of consensus

10.5 Implementing consensus

10.5.1 Leader-based consensus

10.5.2 Quorum-based consensus

10.5.3 Combining leader and quorum

10.6 Raft

10.6.1 The log

10.6.2 Terms

10.6.3 Leader Election protocol

10.6.4 Log Replication protocol

10.6.5 State machine safety

10.7 Raft puzzles

10.7.1 Puzzle 1

10.7.2 Puzzle 2

10.7.3 Puzzle 3

11 Durable executions

11.1 The pitfalls of partial executions

11.2 System model

11.2.1 Process definition

11.2.2 Process execution

11.3 The concept of failure-transparent recovery

11.4 Strategies of failure-transparent recovery

11.4.1 Restart

11.4.2 Resume

11.5 Implementation of failure-transparent recovery

11.5.1 Application-level implementation: Sagas

11.5.2 Platform-level implementation: Durable execution

12 Cloud and services

12.1 From proactive to reactive

12.2 Cloud computing

12.3 Cloud-native computing

12.4 Serverless computing

12.4.1 Traditional

12.4.2 Serverless

12.4.3 Cold path vs. hot path

12.5 Service

12.5.1 Global view vs. local view

12.5.2 Example recommendation service

12.6 Final thoughts

Overview

10 Distributed consensus

Distributed consensus is presented as a foundational building block for reliable, scalable systems: it lets multiple processes advance in lockstep and act as a single fault-tolerant unit. The chapter explains why agreement on a single value (or a sequence of values) is both essential and hard in realistic settings with crashes and unreliable networks. It clarifies the safety (Validity, Integrity, Agreement) and liveness (Termination) goals, relates consensus to earlier atomic commitment work, and situates the problem amid the FLP impossibility result, which limits guaranteed liveness in fully asynchronous systems. The payoff is state machine replication: if identical deterministic replicas process the same ordered log of commands, they remain consistent and can mask failures, making consensus the mechanism that orders those commands.

To implement consensus in practice, the text contrasts an idealized single decision-maker with robust, failure-tolerant designs built from leaders and quorums. Majority quorums (> N/2) ensure overlapping knowledge, prevent split-brain, and yield the familiar N = 2f + 1 rule to tolerate f failures. Historically, Paxos introduced fault-tolerant consensus for a single decision and inspired Multi-Paxos for sequences; Viewstamped Replication independently addressed the same problem earlier; and modern systems commonly combine a leader that proposes values with quorum acknowledgments for commitment. This pairing preserves progress and consistency despite message loss, duplication, reordering, and crashes.

The chapter then focuses on Raft, a consensus protocol designed for understandability while managing a replicated log. Raft structures time into terms with leader election and log replication phases, uses a logical clock and term numbers as fencing tokens, and maintains log consistency and a commit index to protect state machine safety. Leaders accept, append, propagate, and commit entries after quorum acknowledgments; only candidates with the most up-to-date logs (by last term and index) can win elections, ensuring that committed entries are never lost. Through puzzles, the text highlights subtle implications: entries that exist only on a leader can be lost if the leader fails before replication, but once an entry reaches a quorum—even if the leader crashes before advancing its commit index—election rules ensure the next leader preserves and completes the commitment, maintaining safety and eventual consistency.

Transforming a process into a fault-tolerant process

State machine replication in action

Raft’s log abstraction

The Raft consensus protocol advances in terms, where each term consists of a leader election phase and a log replication phase.

Node states and state transition of the Leader Election Protocol

Message flow of the Log Replication Protocol

Puzzle 1

Puzzle 2

Puzzle 3

Left: Leader crashes before commit; right: Leader crashes after commit.

Summary

Distributed Consensus allows a group of redundant processes to advance in lockstep via State Machine Replication.
State Machine Replication achieves identical outputs by applying identical inputs in identical order to a group of identical processes.
Achieving consensus in realistic systems with process and network failures is notoriously challenging. Consensus algorithms like Viewstamped Replication, Paxos, and Raft address these challenges.
The Raft protocol is a popular consensus protocol, often praised for its emphasis on understandability. Yet Raft remains a complex protocol.
Raft divides finding consensus into leader election and log replication.

FAQ

What is distributed consensus and why is it difficult to achieve in real systems?

Distributed consensus is the process by which multiple processes agree on a single value so they can advance in lockstep and act as one. It is easy in an idealized model with no failures and perfectly reliable, ordered messaging, but difficult in realistic environments where nodes can crash and networks can delay, drop, duplicate, or reorder messages.

What safety and liveness properties define a consensus algorithm?

Safety: Validity (only proposed values can be decided), Integrity (a process decides at most once), and Agreement (no two correct processes decide different values). Liveness: Termination (every non-failed process eventually decides).

What does the FLP impossibility theorem imply for consensus?

FLP shows that in a fully asynchronous system without clocks, no algorithm can guarantee consensus termination under all conditions if even one node may fail. Practically, systems assume additional timing constraints (e.g., clocks, partial synchrony, or failure detectors) to achieve consensus with high probability and preserve safety.

How does state machine replication use consensus to provide fault tolerance?

State machine replication runs identical deterministic processes that start from the same state and apply the same inputs in the same order. A consensus protocol agrees on a single, ordered log of commands for all replicas, ensuring identical outputs and enabling the group to mask individual node crashes.

What is a quorum and why is N = 2f + 1 needed to tolerate f failures?

A quorum is a strict majority of nodes (size greater than N/2) whose acknowledgments are required to make a decision. Any two quorums intersect, preventing split-brain and preserving consistency; to tolerate f failures you need at least f + 1 live nodes to form a majority, hence N = 2f + 1.

Why combine a leader with quorums in protocols like Raft, Multi-Paxos, and Viewstamped Replication?

A leader orders proposals (simplifying sequencing), while quorum acknowledgments ensure durability and agreement despite failures. This pairing avoids single points of failure inherent in a naive “benevolent dictator” approach and copes with unreliable networks by requiring majority confirmation to commit.

In Raft, what are terms and how do they ensure only one node can act as leader?

Time is divided into monotonically increasing terms, each with a leader election phase followed by log replication. Term numbers act like fencing tokens: nodes reject messages from lower terms, so even if multiple nodes momentarily believe they’re leaders, only the node in the highest term can make progress.

How does Raft ensure log consistency and its State Machine Safety property?

Raft elects only candidates with the most up-to-date logs, comparing last-entry term first and then log length. Once an entry is stored on a quorum, any future leader must contain that entry, ensuring committed entries are never lost or overwritten and that no two processes apply different entries at the same index.

What are the steps of Raft’s log replication from client request to commit?

1) Leader accepts a client request and appends it locally. 2) Leader sends AppendEntries RPCs to followers. 3) Followers append and acknowledge. 4) When the leader receives acknowledgments from a quorum (including itself), it advances the commit index and replicas can apply the entry to their state machines.

Can an uncommitted log entry be lost in Raft, and when must it be preserved?

An entry that exists only on the leader can be lost if the leader fails before replication. However, if the entry has been replicated to a quorum (e.g., leader and one follower in a three-node cluster), it must not be lost even if the leader crashes before marking it committed; the follower with the most up-to-date log will win leadership and finish committing it.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $41.24

you save $13.75 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $41.24

you save $13.75 (25%)

include audio $24.99 $18.74

eBook

pdf, ePub, online

$54.99 $41.24

you save $13.75 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more