Overview

16 Disaster management

Disaster management in Kafka focuses on anticipating failures, minimizing business impact, and restoring service with integrity. The chapter outlines how outages can translate into financial loss, customer dissatisfaction, compliance exposure, and reputational damage, and emphasizes distinguishing critical from noncritical scenarios to prioritize recovery strategies. It highlights both technical and human causes of incidents, advocates for strong monitoring and runbooks, and underscores Kafka’s resilience—durable logs and asynchronous design—while reminding teams that application behavior during outages (especially producers) ultimately determines end-to-end reliability.

Core failure modes include network, compute, storage, and full datacenter events. Client–broker network issues cause producer backlogs and require application choices (block, buffer, drop, or fail fast), while inter-broker partitions can disrupt replication and leadership until connectivity returns. Compute failures are mitigated by replication and write guarantees—using appropriate replication factor, min.insync.replicas, and acks=all—plus straightforward broker recovery with the same broker.id. Storage risks center on full disks and failed volumes; proactive capacity monitoring, stopping Kafka before exhaustion, and using multiple log directories help limit loss. For datacenter failures, a stretched cluster across at least three sites with rack-aware placement (broker.rack) improves tolerance, and client.rack can reduce cross-zone traffic by consuming from local replicas, though costs and latency constrain multi-region setups; where multi-DC is not viable, distribute brokers across independent racks within a single site.

Backup strategies are nuanced: some deployments can re-import from source systems or treat Kafka as transient, while others need stronger recovery plans. Filesystem snapshots are familiar but can be inconsistent across brokers and store redundant replicas; continuous exports via Kafka Connect (for example to object storage) capture data but often miss consumer offsets; tiered storage remains limited for certain topic types and predictability. Mirroring with MirrorMaker 2 provides continuous protection and portability by replicating topics, ACLs, and consumer offsets. Common topologies include active–passive (simple failover but post-event re-seeding complexity), active–active (bi-directional mirroring with remote topics to avoid loops and read-only protections, requiring care to prevent double-processing via idempotence and transactions), and hub-and-spoke (central aggregation plus selective fan-out to edges), enabling resilient operations even under intermittent connectivity.

We need to consider many different failure cases when operating Kafka or other distributed systems.
In a stretched cluster, our Kafka brokers are distributed among multiple data centers.
An Active-Passive cluster with MirrorMaker.
An Active-Active cluster with MirrorMaker 2.
A Hub-and-Spoke Topology which uses MirrorMaker 2 (Kafka Connect) to replicate topics from the spokes to the hub. And replicating the commands topic from the hub to the spokes.

Summary

  • Disaster management in Kafka focuses on strategies to handle failures and minimize the likelihood of disasters.
  • There are three types of failures in Kafka: network issues, broker problems, and persistent storage failures, often exacerbated by human error.
  • Network failures are common in distributed systems. Individual client connection issues can arise, but these are typically resolved quickly.
  • Broker issues can lead to data loss if messages are not properly committed before a broker failure. Ensuring acks=all and a sufficient number of in-sync replicas is critical for data delivery assurance.
  • Persistent storage failures are one of the most severe problems in Kafka. Ideally, these failures only affect a single broker, but they can lead to irreversible data corruption if not handled properly.
  • Conventional backups are not very practical in Kafka due to continuous message production and consumption, which can lead to potential data loss and inconsistencies.
  • Stretched clusters reduce the likelihood of total failure by operating a Kafka cluster across multiple data centers, mitigating risks from data center outages.
  • An active-passive pairing involves one active cluster mirroring another passive cluster, taking over in case of a failure, but not reverting back to active.
  • Active-active pairing consists of two equally capable clusters mirroring each other, allowing continuous operation without needing to rebuild clusters during a failure.
  • In active-active configurations, consumers can read from both clusters and need to aggregate data appropriately.
  • Remote topics and partitions, introduced in MirrorMaker 2, prevent endless loops of topic mirroring while ensuring seamless data availability during failures.
  • The hub-and-spoke topology features a central cluster that aggregates data from smaller local clusters, allowing independent operation even if the central cluster is down.
  • Both active-active and hub-and-spoke topologies aim to maintain data consistency even during failures.
  • It is recommended to avoid active-passive pairings and prefer active-active configurations for better resilience and performance.
  • Confluent Replicator provides a proprietary alternative to MirrorMaker for mirroring clusters.

FAQ

What kinds of failures should I plan for when operating Kafka?Plan for four technical domains—network, compute (brokers and controllers), storage (disks and volumes), and entire datacenter outages—plus human error (misconfigurations and operational mistakes). Each requires its own detection, mitigation, and recovery playbook.
How should producers and consumers respond to network outages?If clients cannot reach brokers, consumers typically just wait and resume; producers must choose to block, buffer, drop, or fail fast. Inter-broker issues can disrupt replication and trigger leader elections; the main remedy is restoring connectivity. Strong monitoring, alerting, and clear runbooks are essential; Kafka’s durability usually prevents data loss once the network recovers.
How do replication factor, min.insync.replicas, and acks affect fault tolerance?High availability for writes requires producers using acks=all and topics configured with min.insync.replicas (minISR). Example: with RF=3 and minISR=2, the cluster can lose one broker without impacting producers; a second loss blocks acks=all producers but consumers can still read from leaders. Raising RF (e.g., RF=4, minISR=2) increases tolerance but also cost; for higher resilience, consider multi-cluster patterns.
What’s the correct way to recover a failed broker?Bring the broker back with the same broker ID. If hardware is replaced and disks are empty, Kafka recognizes the ID, reassigns the original replicas, and automatically replicates missing data from leaders until the broker is in-sync and can reassume preferred leadership.
How do I prevent and handle storage-related incidents like full disks or disk loss?Never let Kafka disks fill up—Linux handles full volumes poorly and data in page cache may be lost. Enforce monitoring, alerting, and headroom; stop brokers before volumes fill. In Kubernetes/virtualized setups, expand disks; on bare metal, replace hardware. Configure multiple log.dirs (without RAID) to survive a single disk failure and let the broker resync the lost partitions. Central storage failures may require fixes and restarts, with some risk of loss.
Do I need to back up Kafka? What options exist and what are the pitfalls?Some deployments can re-ingest from sources and may not need Kafka backups. If backups are required, filesystem snapshots are simple but can miss partitions when moves occur and may store redundant replicas. Continuous offloads (e.g., via Kafka Connect to object storage) capture data but often miss consumer offsets. Tiered storage isn’t a full backup (no compacted topics or offsets, unpredictable movement). Commercial tools can provide end-to-end backup/restore and cloning.
What is a stretched Kafka cluster and when should I use it?A stretched cluster spreads brokers across multiple nearby datacenters/availability zones (latency ~≤30 ms). Use at least three locations to maintain coordination-quorum majority. Set broker.rack to distribute replicas evenly; if cross-AZ costs are high, use client.rack (KIP-392) to consume from local replicas. If multi-DC isn’t possible, distribute brokers across independent racks within a single DC.
How does MirrorMaker 2 work and what does it copy?MirrorMaker 2 (on Kafka Connect) uses three connectors: MirrorSourceConnector (topics and optionally ACLs), MirrorCheckpointConnector (consumer offsets), and MirrorHeartbeatConnector (link health). It creates “remote topics” with source-cluster prefixes and enforces read-only ACLs to prevent loops and accidental writes.
Active-passive vs. active-active mirroring: which should I choose and how do failovers work?Active-passive is simple: all traffic goes to the active cluster while the passive receives mirrored data; on failure, switch client endpoints. After failover, re-establish a fresh passive (or re-seed) to restore redundancy. Active-active runs in both sites with bidirectional mirroring; consumers often use regex subscriptions for local and remote topics. Beware duplicate processing for services that consume and then produce—use idempotence and transactional IDs with deduplication.
What is a hub-and-spoke Kafka topology and when is it useful?A central “hub” cluster aggregates data from multiple “spoke” clusters via MirrorMaker, and can also push commands/configuration back to spokes. It suits multi-site organizations needing local autonomy during link outages; sites keep operating with stale commands at worst, and data resynchronizes once connectivity is restored.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Apache Kafka in Action ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Apache Kafka in Action ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Apache Kafka in Action ebook for free