Think Distributed Systems you own this product

Dominik Tornow

August 2025
ISBN 9781633436176
192 pages

Included with a Manning Online subscription

printed in black & white

available in Russian, Simplified Chinese

catalog / Software Development / Software Engineering / Distributed Systems

table of content

1 Thinking in distributed systems: Models, mindsets, and mechanics

1.1 Software engineering and mental models

1.1.1 Mental models: The foundation of reasoning

1.1.2 Correct mental models

1.1.3 Complete mental models

1.2 Mental model of software systems

1.3 Different types of models

1.3.1 Different models describing the same aspects

1.3.2 Different models describing different aspects of a system

1.4 Thinking about distributed systems

1.4.1 Correctness

1.4.2 Scalability and reliability

1.4.3 Responsiveness

1.5 Two big ideas

1.5.1 Systems of systems

1.5.2 Global view vs. local view

1.6 Distributed Systems Incorporated

1.7 Navigating complexity

1.7.1 Simple yet complex

1.7.2 Emergent behavior

1.7.3 Changing perspective

1.7.4 Think globally; act locally

1.8 Thinking above the code

2 System models, order, and time

2.1 System models

2.1.1 Theory and practice

2.1.2 Synchronous distributed systems

2.1.3 Asynchronous distributed systems

2.1.4 Partially synchronous systems

2.1.5 Component and network behavior

2.1.6 Realistic system models

2.2 Order and time

2.2.1 The happened-before relationship

2.2.2 Time and clocks

2.2.3 Physical time and physical clocks

2.2.4 Logical time and logical clocks

2.2.5 Physical clocks vs. logical clocks

3 Failure tolerance

3.1 In theory

3.2 Types of failure tolerance

3.2.1 Masking failure tolerance

3.2.2 Nonmasking failure tolerance

3.2.3 Fail-safe failure tolerance

3.2.4 None of the above

3.3 In practice

3.3.1 System model

3.3.2 Failure handling

3.3.3 Failure classification

3.3.4 Failure detection

3.3.5 Failure mitigation

3.3.6 Putting everything together

4 Message delivery and processing

4.1 Exchanging messages

4.2 The uncertainty principle of message delivery and processing

4.2.1 Before sending the request

4.2.2 After sending the request and before receiving a response

4.2.3 After receiving a response

4.3 Silence and chatter

4.4 Exactly-once processing semantics

4.5 Idempotence

4.6 Case study: Charging a credit card

5 Transactions

5.1 Abstractions

5.2 The magic of transactions

5.2.1 Concurrency

5.2.2 Failure

5.3 The model of transactions

5.3.1 Correctness

5.3.2 Serializability

5.3.3 Completeness

5.3.4 Application-level abort

5.3.5 Platform-level abort

6 Distributed transactions

6.1 Atomic commitment: From a single RM to multiple RMs

6.1.1 Transaction on a single RM

6.1.2 Transaction on multiple RMs

6.1.3 Blocking and nonblocking

6.2 The essence of distributed transactions

6.3 Two-Phase Commit protocol

6.3.1 In the absence of failure

6.3.2 In the presence of failure

6.3.3 Improvement

7 Partitioning

7.1 Encyclopedias and volumes

7.2 Thinking in partitions

7.3 The mechanics of partitioning and balancing

7.4 (Re)partitioning

7.4.1 Types of partitioning

7.4.2 Data item to partition assignment strategies

7.5 Common item-based assignment strategies

7.5.1 Range partitioning

7.5.2 Hash partitioning

7.6 Repartitioning

7.6.1 Range partitioning

7.6.2 Hash partitioning

7.7 Consistent hashing

7.8 (Re)balancing and overpartitioning

8 Replication

8.1 Redundancy

8.2 Thinking about replication and consistency

8.3 Replication

8.4 The mechanics of replication

8.4.1 System model

8.4.2 Replication lag

8.4.3 Synchronous vs. asynchronous replication

8.4.4 State-based vs. log-based replication

8.4.5 Single-leader, multileader, and leaderless systems

9 Consistency

9.1 Consistency models

9.1.1 Common consistency models

9.1.2 Virtues and limitations

9.2 Linearizability

9.2.1 Queue and stack

9.2.2 Formal definition of linearizability

9.3 Eventual consistency

9.3.1 The shopping cart

9.3.2 Variants of eventual consistency

9.3.3 Implementation

9.4 Consistency, availability, and partition tolerance

9.4.1 History

9.4.2 Conjecture vs. theorem

9.4.3 CAP theorem

10 Distributed consensus

10.1 The challenge of reaching agreement

10.2 System model

10.3 State machine replication

10.4 The origin—and irony—of consensus

10.5 Implementing consensus

10.5.1 Leader-based consensus

10.5.2 Quorum-based consensus

10.5.3 Combining leader and quorum

10.6 Raft

10.6.1 The log

10.6.2 Terms

10.6.3 Leader Election protocol

10.6.4 Log Replication protocol

10.6.5 State machine safety

10.7 Raft puzzles

10.7.1 Puzzle 1

10.7.2 Puzzle 2

10.7.3 Puzzle 3

11 Durable executions

11.1 The pitfalls of partial executions

11.2 System model

11.2.1 Process definition

11.2.2 Process execution

11.3 The concept of failure-transparent recovery

11.4 Strategies of failure-transparent recovery

11.4.1 Restart

11.4.2 Resume

11.5 Implementation of failure-transparent recovery

11.5.1 Application-level implementation: Sagas

11.5.2 Platform-level implementation: Durable execution

12 Cloud and services

12.1 From proactive to reactive

12.2 Cloud computing

12.3 Cloud-native computing

12.4 Serverless computing

12.4.1 Traditional

12.4.2 Serverless

12.4.3 Cold path vs. hot path

12.5 Service

12.5.1 Global view vs. local view

12.5.2 Example recommendation service

12.6 Final thoughts

Overview

1 Thinking in distributed systems: Models, mindsets, and mechanics

This chapter argues that modern software is inherently distributed and frames the central question as how distributed an application needs to be. It motivates distribution as the only way to meet real-world fitness goals—correctness, scalability, and reliability—in the face of growing load and inevitable failures. A distributed system is presented as a set of concurrent components that communicate by exchanging messages, whose overall behavior and complexity emerge from the parts and their interactions. The author emphasizes moving from “knowing” to “understanding” through dependable mental models so we can reason with confidence about systems that are complex but unavoidable.

The core tool is the mental model: an internal representation of a system that should be both correct (no falsehoods) and complete (no relevant omissions). The chapter shows how multiple models can be equivalent or complementary, each illuminating different aspects, and recommends viewing distributed behavior as a state machine advancing one step at a time by a component or the network. It distinguishes global versus local viewpoints—an all-knowing observer versus components with only local state—and introduces “systems of systems” (holons/holarchies) to fluidly zoom between atomic components and higher-order subsystems. Correctness is framed via safety (nothing bad happens) and liveness (something good eventually happens); scalability and reliability are cast as responsiveness—meeting SLOs—formalized through SLIs, SLOs, error rates, and error budgets.

To make the mechanics tangible, the “Distributed Systems Inc.” analogy maps components to rooms, the network to pneumatic tubes, and the external interface to a mailbox, making it easy to reason about message loss, duplication, reordering, and crash semantics. Several AHA moments follow: interesting properties like scalability and reliability are emergent; different valid models exist for the same system; and the core challenge is to think globally while acting locally—designing global algorithms from local steps and limited knowledge. Finally, the chapter advocates “thinking above the code,” generalizing concepts like race conditions as incorrect subsets of possible interleavings (and connecting to serializability), setting up a disciplined mindset and vocabulary for the deeper, formal treatment that follows.

Mental model and system

Different models describing the same aspects of a system (the set of facts of each model totally overlaps)

The network as the buffer of inflight messages

The components as the buffer for inflight messages

Different models describing different aspects of a system (the set of facts of each model partially overlaps)

A distributed system as a set of concurrent, communicating components (local state of network not shown)

Behavior of a system as a sequence of states

Safety and liveness

Behavior space of a distributed transaction with two participants

A distributed system as a set of concurrent, communicating subsystems

Holons and holarchies

Two different holarchies, representing the same system

Global point of view

C1’s point of view

Distributed Systems Incorporated

Black box versus white box, a global point of view

Local point of view

Splitbrain

Reasoning about race conditions

Reasoning about serializability

Summary

A mental model is the internal representation of the target system and is the basis of comprehension and communication.
Striving for a deep understanding of distributed systems is better than merely knowing about their concepts.
A distributed system is a set of concurrent components that communicate by sending and receiving messages over a network.
The core challenge in designing distributed systems is creating a coherent system that functions as a whole despite each component having only local knowledge.
Ultimately, we are interested in the guarantees a system provides. We reason about these guarantees in terms of correctness—that is, in terms of safety and liveness guarantees as well as scalability and reliability guarantees.
Distributed systems can be visualized as a corporation, where rooms represent concurrent components, pneumatic tubes represent the network, and a mailbox represents the external interface.

FAQ

Why distribute applications if it adds complexity?

Because a single component cannot handle unbounded load or survive inevitable failures. We distribute to achieve correctness at scale: the system must do the right thing even as load increases (scalability) and components fail (reliability). Multiple collaborating components are necessary to meet these goals.

What is a distributed system in this chapter?

A distributed system is a set of concurrent components that communicate by exchanging messages over a network. Each component and the network have their own local state. System behavior and complexity emerge from component behaviors and their interactions.

What are mental models, and why do they matter?

Mental models are internal representations we use to understand and communicate about systems. Good models are both correct (no falsehoods) and complete (no relevant omissions). They move us from “knowing” terms to truly understanding behavior and trade-offs.

What makes a mental model correct and complete?

Correct: Every fact in the model is true of the system.
Complete: Every relevant system fact appears in the model. “Relevant” depends on the question you’re answering.

How does the chapter model system behavior?

As a state machine: behavior is a sequence of states, each produced by a discrete step of one component or the network. Steps can be external (send/receive) or internal (local computation). At any moment, exactly one entity takes exactly one step.

How is correctness defined (safety vs. liveness)?

Safety: Something bad never happens (prevents incorrect states).
Liveness: Something good eventually happens (prevents getting stuck).

A system is correct if every possible behavior satisfies both.

What do scalability and reliability mean here?

They’re framed as responsiveness: the ability to meet Service Level Objectives. Scalability is being responsive under load; reliability is being responsive under failure. Formally, responsiveness keeps the error rate under the error budget, using SLIs, SLOs, and error budgets.

Why are multiple models of the same system useful?

Different models can be equivalent (express the same facts differently) or complementary (focus on different aspects). Studying several models gives a more holistic understanding and helps reveal omissions or misconceptions in your own thinking.

What is the global vs. local view challenge?

An all-knowing observer can see the global system state; a component only sees its own state and messages. The core challenge is to think globally (design global guarantees) while acting locally (each component executes a local algorithm with limited knowledge).

How does the “Distributed Systems Inc.” analogy help?

Rooms are components (local state), pneumatic tubes are the network (message delivery), and the mailbox is the external interface. It makes failures (absences), and delivery semantics (loss, duplication, reordering) concrete, helping you reason about consequences and mitigations.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more