table of content

1 Thinking in distributed systems: Models, mindsets, and mechanics

1.1 Software engineering and mental models

1.1.1 Mental models: The foundation of reasoning

1.1.2 Correct mental models

1.1.3 Complete mental models

1.2 Mental model of software systems

1.3 Different types of models

1.3.1 Different models describing the same aspects

1.3.2 Different models describing different aspects of a system

1.4 Thinking about distributed systems

1.4.1 Correctness

1.4.2 Scalability and reliability

1.4.3 Responsiveness

1.5 Two big ideas

1.5.1 Systems of systems

1.5.2 Global view vs. local view

1.6 Distributed Systems Incorporated

1.7 Navigating complexity

1.7.1 Simple yet complex

1.7.2 Emergent behavior

1.7.3 Changing perspective

1.7.4 Think globally; act locally

1.8 Thinking above the code

2 System models, order, and time

2.1 System models

2.1.1 Theory and practice

2.1.2 Synchronous distributed systems

2.1.3 Asynchronous distributed systems

2.1.4 Partially synchronous systems

2.1.5 Component and network behavior

2.1.6 Realistic system models

2.2 Order and time

2.2.1 The happened-before relationship

2.2.2 Time and clocks

2.2.3 Physical time and physical clocks

2.2.4 Logical time and logical clocks

2.2.5 Physical clocks vs. logical clocks

3 Failure tolerance

3.1 In theory

3.2 Types of failure tolerance

3.2.1 Masking failure tolerance

3.2.2 Nonmasking failure tolerance

3.2.3 Fail-safe failure tolerance

3.2.4 None of the above

3.3 In practice

3.3.1 System model

3.3.2 Failure handling

3.3.3 Failure classification

3.3.4 Failure detection

3.3.5 Failure mitigation

3.3.6 Putting everything together

4 Message delivery and processing

4.1 Exchanging messages

4.2 The uncertainty principle of message delivery and processing

4.2.1 Before sending the request

4.2.2 After sending the request and before receiving a response

4.2.3 After receiving a response

4.3 Silence and chatter

4.4 Exactly-once processing semantics

4.5 Idempotence

4.6 Case study: Charging a credit card

5 Transactions

5.1 Abstractions

5.2 The magic of transactions

5.2.1 Concurrency

5.2.2 Failure

5.3 The model of transactions

5.3.1 Correctness

5.3.2 Serializability

5.3.3 Completeness

5.3.4 Application-level abort

5.3.5 Platform-level abort

6 Distributed transactions

6.1 Atomic commitment: From a single RM to multiple RMs

6.1.1 Transaction on a single RM

6.1.2 Transaction on multiple RMs

6.1.3 Blocking and nonblocking

6.2 The essence of distributed transactions

6.3 Two-Phase Commit protocol

6.3.1 In the absence of failure

6.3.2 In the presence of failure

6.3.3 Improvement

7 Partitioning

7.1 Encyclopedias and volumes

7.2 Thinking in partitions

7.3 The mechanics of partitioning and balancing

7.4 (Re)partitioning

7.4.1 Types of partitioning

7.4.2 Data item to partition assignment strategies

7.5 Common item-based assignment strategies

7.5.1 Range partitioning

7.5.2 Hash partitioning

7.6 Repartitioning

7.6.1 Range partitioning

7.6.2 Hash partitioning

7.7 Consistent hashing

7.8 (Re)balancing and overpartitioning

8 Replication

8.1 Redundancy

8.2 Thinking about replication and consistency

8.3 Replication

8.4 The mechanics of replication

8.4.1 System model

8.4.2 Replication lag

8.4.3 Synchronous vs. asynchronous replication

8.4.4 State-based vs. log-based replication

8.4.5 Single-leader, multileader, and leaderless systems

9 Consistency

9.1 Consistency models

9.1.1 Common consistency models

9.1.2 Virtues and limitations

9.2 Linearizability

9.2.1 Queue and stack

9.2.2 Formal definition of linearizability

9.3 Eventual consistency

9.3.1 The shopping cart

9.3.2 Variants of eventual consistency

9.3.3 Implementation

9.4 Consistency, availability, and partition tolerance

9.4.1 History

9.4.2 Conjecture vs. theorem

9.4.3 CAP theorem

10 Distributed consensus

10.1 The challenge of reaching agreement

10.2 System model

10.3 State machine replication

10.4 The origin—and irony—of consensus

10.5 Implementing consensus

10.5.1 Leader-based consensus

10.5.2 Quorum-based consensus

10.5.3 Combining leader and quorum

10.6 Raft

10.6.1 The log

10.6.2 Terms

10.6.3 Leader Election protocol

10.6.4 Log Replication protocol

10.6.5 State machine safety

10.7 Raft puzzles

10.7.1 Puzzle 1

10.7.2 Puzzle 2

10.7.3 Puzzle 3

11 Durable executions

11.1 The pitfalls of partial executions

11.2 System model

11.2.1 Process definition

11.2.2 Process execution

11.3 The concept of failure-transparent recovery

11.4 Strategies of failure-transparent recovery

11.4.1 Restart

11.4.2 Resume

11.5 Implementation of failure-transparent recovery

11.5.1 Application-level implementation: Sagas

11.5.2 Platform-level implementation: Durable execution

12 Cloud and services

12.1 From proactive to reactive

12.2 Cloud computing

12.3 Cloud-native computing

12.4 Serverless computing

12.4.1 Traditional

12.4.2 Serverless

12.4.3 Cold path vs. hot path

12.5 Service

12.5.1 Global view vs. local view

12.5.2 Example recommendation service

12.6 Final thoughts

Overview

7 Partitioning

Partitioning is presented as a core technique for scaling distributed systems by breaking a single logical dataset into multiple disjoint physical pieces and spreading them across nodes. Using the encyclopedia–volumes analogy, the text shows how thoughtful placement enables efficient find and fetch operations, while highlighting pitfalls such as uneven distribution, uneven demand, and cross-references that span partitions. Beyond sheer data size, partitioning helps overcome any single-node bottleneck (including request volume), and in a sense every distributed system is intrinsically partitioned because each component owns exclusive local state.

The mechanics center on two mappings: assigning items to partitions and assigning partitions to nodes. Choices like static versus dynamic partitioning influence elasticity and operational complexity, while horizontal (row-wise) and vertical (column/type-wise) partitioning can be combined to scale different concerns independently (for example, splitting user profile text from images, then sharding each). Real workloads (social media hot users, IoT time-based writes) expose skew and evolving requirements, making repartitioning and online migration necessary. Balancing and rebalancing move partitions among nodes to match demand; over-partitioning (more partitions than nodes) increases flexibility to scale in or out by reassigning existing partitions without reshaping the data layout.

Item-to-partition assignment strategies trade simplicity for control. Stateless item-based methods (range or hash) are easy to operate but coarse-grained: range partitioning risks hotspots, while hashing balances variance well yet forces many relocations when the partition count changes. Stateful directory-based methods enable fine-grained placement but introduce a potential bottleneck in the directory itself. The chapter evaluates strategies using variance (uniform spread) and relocation (data moved on resize) and motivates consistent hashing, which preserves good balance while moving only a small, proportional fraction of items when partitions change—supporting elastic growth with fewer disruptions. The overarching message is to pick and evolve partitioning, assignment, and balancing tactics to match a system’s unique access patterns, growth modes, and operational constraints.

Encyclopedias and volumes

Thinking about partitioning (replication is covered in chapter 8).

Partitioning a key-value store

Assignment of data items to partitions and the assignment of partitions to nodes

Partitioning—that is, assignment of data items to partitions

Horizontal and vertical partitioning

Partitioning user information

Item-based and directory-based assignment

Balancing—that is, assignment of partitions to nodes

Partitioning (left) and Over Partitioning (right)

Summary

Partitioning improves the scalability of distributed systems by distributing data across multiple resources, overcoming the limitations of a single resource.
Partitioning assigns items to partitions while balancing assigns partitions to nodes.
Static partitioning uses a fixed number of partitions, offering simplicity but lacking elasticity, while dynamic partitioning adapts to changing demands with a variable number of partitions, adding complexity.
Horizontal partitioning (or sharding) divides data by rows, and vertical partitioning divides data by columns; both can be combined to manage different data types and scale aspects of the application independently.
Item-based assignment is a partitioning strategy that assigns each data item to a partition based solely on its own characteristics. Directory-based assignment is a partitioning strategy using a separate component called a directory or lookup table.
Consistent hashing minimizes uneven distribution and data relocation during partition changes.
Designing an adequate partitioning strategy requires consideration of the system's unique characteristics and unique requirements and may change as the system evolves.

FAQ

What is partitioning in a distributed system, and how does it differ from replication?

Partitioning represents a single logical object as multiple, disjoint physical objects (partitions) spread across nodes to distribute load and improve scalability. Replication, by contrast, creates redundant copies to improve reliability. Partitioning tackles scalability limits; replication tackles reliability limits.

What challenges commonly arise when partitioning data?

Typical challenges include uneven distribution (some partitions hold far more data than others), uneven demand (some partitions receive disproportionately more traffic), and cross-references (operations need data from multiple partitions). The “right” strategy depends on your system’s requirements and access patterns.

What is the difference between partitioning, repartitioning, balancing, and rebalancing?

Partitioning assigns items to partitions. Repartitioning reassigns previously placed items to different partitions (e.g., when the number of partitions changes). Balancing assigns partitions to nodes. Rebalancing reassigns partitions to different nodes (e.g., in response to demand or cluster size changes).

What are static and dynamic partitioning, and what are their trade-offs?

Static partitioning uses a fixed number of partitions that changes only via administrative, typically offline, operations—low complexity but not elastic. Dynamic partitioning allows the number of partitions to change online as a normal operation—elastic but adds significant complexity (especially around safe repartitioning).

What is the difference between horizontal and vertical partitioning, and can they be combined?

Horizontal partitioning (sharding) splits data by rows (items). Vertical partitioning splits by columns (fields/attributes). They can be combined—for example, store text profile data and images in separate systems (vertical), and then shard each of those across multiple partitions (horizontal) to scale independently.

How do item-based and directory-based assignment strategies compare?

Item-based assignment computes a partition from the item itself (e.g., key or key hash). It is stateless, simple, and good for coarse-grained balancing, but cannot steer specific items to specific partitions (risking hot-spot collisions). Directory-based assignment uses a stateful lookup table to map items to partitions, enabling fine-grained placement but adding operational complexity and a potential scalability/reliability bottleneck in the directory.

What do “variance” and “relocation” mean when evaluating partitioning strategies?

Variance measures how evenly items are distributed across partitions (lower variance is better). Relocation measures how many items must move when the number of partitions changes (less relocation is better). Good strategies aim to minimize both.

How do range partitioning and hash partitioning compare?

Range partitioning assigns items based on key ranges; it’s simple but often yields uneven distribution (high variance) if keys are skewed. Hash partitioning applies a hash of the key to select a partition; it typically balances load well (low variance) but can cause many items to move when the partition count changes (high relocation).

What is consistent hashing and why is it useful?

Consistent hashing assigns items to partitions so that when the number of partitions changes, only about n/m items (on average) need to move (n items, m partitions). It reduces both variance and relocation, making scaling up or down far less disruptive than naive hashing or range schemes.

What is over-partitioning, and when should I use it?

Over-partitioning creates more partitions than nodes so each node hosts multiple partitions. This gives the system flexibility to rebalance: add nodes and spread partitions out as demand grows, or remove nodes and reassign partitions as demand shrinks. The trade-off is that the number of partitions still caps the maximum usable nodes.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $38.49

you save $16.50 (30%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $38.49

you save $16.50 (30%)

include audio $24.99 $17.49

eBook

pdf, ePub, online

$54.99 $38.49

you save $16.50 (30%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more