Overview

8 Replication

This chapter motivates replication through the lens of durability in transactional systems: once a system “makes a promise,” it must not backtrack—even in the face of crashes and network faults. To avoid single points of failure, we add redundancy, which is not just duplication but duplication plus coordination. Redundancy can increase reliability and sometimes scalability, yet the relationship is nuanced. The text distinguishes static (unchanging) from dynamic (evolving) redundancy, and uses classic majority voting to show how coordinated replicas can mask individual failures while preserving the behavior of a single logical component.

Before formal models, the chapter reframes replication as representing “one logical thing” with multiple physical instances, highlighting the ambiguity of identity and equivalence. Using the example of books, editions, and copies, it shows how different perspectives yield different answers to what counts as “the same,” a question at the heart of replication and consistency. Replication transparency—the system’s ability to hide many replicas behind the illusion of one—requires careful balance between concealing and exposing details, trading off among consistency, availability, and latency.

The mechanics center on stateful replication, where complexity arises from change: updates must be propagated, and not all changes are equal (monotonic updates preserve knowledge; non-monotonic ones can invalidate it). The system model embraces partial synchrony, failures, and partitions, introducing inherent and imposed replication lag. The chapter contrasts synchronous and asynchronous replication and common quorum hybrids; state-based versus log-based (with deterministic state machines), noting industry preference for log-based; and topology choices—single-leader, multi-leader, and leader-less—each requiring conflict resolution (for example, last-write-wins or CRDTs) with practical pitfalls. Finally, it cautions that follower reads may lag, so fresh reads often require the leader, underscoring the core trade-offs that shape replication strategies.

Redundancy as duplication and coordination
Library inventory of Structure and Interpretation of Computer Programs
Replication represents a single logical object by multiple, identical physical objects.
A replicated key-value store
The network as point-to-point communication links between components
Replication lag: Instantaneous propagation of changes is impossible, resulting in an inherent lag.
Synchronous replication
Asynchronous replication
Single-leader, multi-leader, and leader-less.

Summary

  • Redundancy aims to improve the reliability of a system, growing beyond the reliability limits of a single resource.
  • Redundancy refers to the duplication and coordination of subsystems, so that an increase in the duplication factor results in increased reliability.
  • Static redundancy refers to redundancy where the set of components and their interactions do not change, while dynamic redundancy refers to redundancy where they do change.
  • Replication, the employment of multiple instances of “the same thing,” is the most common implementation of duplication.
  • Replication improves the reliability of distributed systems by distributing data across multiple resources, overcoming the limitations of a single resource.
  • Replication lag is an inherent aspect of distributed systems and complicates replication transparency and consistency.
  • Synchronous replication ensures consistency but may impact latency and availability, while asynchronous replication improves latency and availability but may impact consistency.
  • State-based replication propagates the current state of the system, while log-based replication propagates the sequence of operations leading to the state.

FAQ

What is redundancy in distributed systems, and what are its main types?

Redundancy is the duplication and coordination of subsystems to improve reliability and/or scalability. There are two types: (1) Static redundancy: the set of components and their interactions do not change during the system’s lifetime (common in hardware). (2) Dynamic redundancy: components and their interactions can change over time (common in software).

How does redundancy relate to scalability?

Redundancy often aids scalability by distributing data across multiple nodes and enabling load partitioning (for example, round-robin across replicas). However, the relationship is not straightforward—coordination overheads or consistency requirements can also decrease scalability.

Why is durability important, and how does redundancy help achieve it?

Durability ensures that once a transaction is committed, its effects are permanent—no “backtracking.” In real systems subject to failures, a single component can fail and break promises (for example, shipping goods without actually capturing payment). Redundancy avoids single points of failure and helps uphold durable promises despite crashes or recoveries.

What does “duplication and coordination” mean in practice?

Duplicating components alone does not create a reliable system; the duplicates must be coordinated so they behave like one coherent component. Coordination can be simple (load balancing) or complex (consensus). For example, three replicated logic gates coordinated by a majority vote can tolerate one failure while preserving correct output.

What is replication and what is replication transparency?

Replication represents a single logical object by multiple, identical physical objects (replicas) to improve reliability. Replication transparency is the system’s ability to hide the fact that multiple replicas exist and present the illusion of a single object, balancing consistency, availability, and latency.

What system model assumptions matter for replication?

The system is partially synchronous and components can suffer Crash-Stop, Omission, and Crash-Recovery failures. Communication happens over unreliable, point-to-point links, which makes it natural to reason about network partitions (temporary, intermittent, or permanent link failures leading to message loss).

What is replication lag, and why is it unavoidable?

Replication lag is the delay between applying a change on one replica and that change being visible on others. Because only one component or the network takes a step at a time, updates cannot be applied simultaneously across replicas. There is inherent lag (fundamental to distribution) and imposed lag (from partitions and failures). Lag can break replication transparency, especially for reads on followers.

How do synchronous, asynchronous, and quorum (hybrid) replication differ?
  • Synchronous: an operation completes only after all replicas acknowledge the change. Pro: immediate consistency. Con: higher latency and reduced availability if any replica is slow or unreachable.
  • Asynchronous: an operation completes after the initial node processes it; replication happens in the background. Pro: low latency and higher availability. Con: replicas can be stale.
  • Quorum (hybrid): wait for acknowledgments from a majority (quorum) in the foreground; replicate to others asynchronously. Balances latency, availability, and consistency.
What’s the difference between state-based and log-based replication?

Assuming a deterministic state machine: (1) State-based replicates the current state (or a diff), regardless of the operations that produced it. (2) Log-based replicates the sequence of operations that led to the state. In practice, log-based is often preferred because it preserves ordering and enables deterministic replays.

How do single-leader, multi-leader, and leader-less replication compare, and how are conflicts handled?
  • Single-leader: one node accepts operations and propagates to followers—simple “chain of command,” but the leader can become a single point of failure.
  • Multi-leader: multiple leaders accept operations—avoids a single bottleneck but introduces concurrent updates that can conflict.
  • Leader-less: any node accepts operations—maximizes availability but requires conflict resolution.

Conflicts are resolved via strategies like last-write-wins (simple but can cause unexpected overwrites) or CRDTs (designed to converge without conflicts). Also note: reads from followers can be stale due to replication lag; read from the leader for the freshest data.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Think Distributed Systems ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Think Distributed Systems ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Think Distributed Systems ebook for free