Apache Kafka in Action you own this product

From basics to production

Anatoly Zelenin, Alexander Kropp
Foreword by Adam Bellemare

April 2025
ISBN 9781633437593
368 pages

Included with a Manning Online subscription

printed in black & white

available in Russian

catalog / Data Science / Big Data / Stream Processing

table of content

Part 1 Getting started

1 Introduction to Apache Kafka

1.1 What is Apache Kafka, and how does it solve our problems?

1.2 Kafka in enterprise ecosystems

1.3 Architectural overview of Kafka

1.4 Running and using Kafka

1.5 Our learning path

Summry

2 First steps with Kafka

2.1 Introducing our use case

2.2 Producing messages

2.3 Consuming messages

2.4 Consuming and producing messages in parallel

2.5 Graphical user interfaces for Kafka

Summry

Part 2 Concepts

3 Exploring Kafka topics and messages

3.1 Topics

3.1.1 Viewing topics

3.1.2 Create, customize, and delete topics

3.2 Messages

3.2.1 Message types

3.2.2 Data formats

3.2.3 Message structure

Summry

4 Kafka as a distributed log

4.1 Logs

4.1.1 What exactly is a log?

4.1.2 Basic properties of a log

4.1.3 Kafka as a log

4.2 Kafka as a distributed system

4.2.1 Partitioning and keys

4.2.2 Consumer groups

4.2.3 Replication

4.3 Components of Kafka

4.3.1 Coordination cluster

4.3.2 Broker

4.3.3 Clients

4.4 Kafka in corporate use

Summry

5 Reliability

5.1 Acknowledgments

5.1.1 ACK strategies in Kafka

5.1.2 ACKs and ISRs

5.1.3 Message delivery guarantees in Kafka

5.2 Transactions

5.2.1 Transactions in databases

5.2.2 Transactions in Kafka

5.2.3 Transactions and consumers

5.3 Replication and the leader-follower principle

Summry

6 Performance

6.1 Configuring topics for performance

6.1.1 Scaling and load balancing

6.1.2 Determining how many partitions are needed

6.1.3 Changing the number of partitions

6.2 Producer performance

6.2.1 Producer configuration

6.2.2 Producer performance test

6.3 Broker configuration and optimization

6.3.1 Optimizing brokers

6.3.2 Determining broker count and sizing

6.4 Consumer performance

6.4.1 Consumer configuration

6.4.2 Consumer performance test

Summry

Part 3 Kafka deep dive

7 Cluster management

7.1 Apache Kafka Raft cluster management

7.2 ZooKeeper Cluster Management

7.3 Migrating from ZooKeeper to KRaft

7.4 Connection to Kafka

Summry

8 Producing and persisting messages

8.1 Producer

8.1.1 Producing messages

8.1.2 Production process for messages

8.1.3 Producer and ACKs

8.2 Broker

8.2.1 Receiving and persisting messages

8.2.2 Brokers and ACKs

8.3 Data and file structures

8.3.1 Metadata, checkpoints, and topics

8.3.2 Partitions directory

8.3.3 Log data and indices

8.3.4 Segments

8.3.5 Deleted topics

8.4 Replication

8.4.1 In-sync replicas

8.4.2 High Watermark

8.4.3 Effects of delays during replication

Summry

9 Consuming messages

9.1 Fetching messages

9.1.1 Fetch requests

9.1.2 Fetch from the closest replica

9.2 Broker handling of consumer fetch requests

9.3 Offsets and Consumer

9.3.1 Offset management

9.3.2 Understanding offsets in Kafka

9.4 Understanding and managing Kafka consumer groups

9.4.1 Consumer group management

9.4.2 Distribution of partitions to consumers

9.4.3 Static memberships

Summry

10 Cleaning up messages

10.1 Why clean up messages?

10.2 Kafka’s cleanup methods

10.3 Log retention

10.3.1 When is a log cleaned up via retention?

10.3.2 Offset retention

10.4 Log compaction

10.4.1 When is a log cleaned up via compaction?

10.4.2 How the log cleaner works

10.4.3 Tombstones

Summry

Part 4 Kafka in enterprise use

11 Integrating external systems with Kafka Connect

11.1 What is Kafka Connect?

11.2 Kafka Connect cluster: Distributed Mode

11.2.1 Configuring a Kafka Connect cluster

11.2.2 Creating a connector

11.2.3 Testing the connector

11.3 Scalability and fault tolerance of Kafka Connect

11.4 Worker configuration

11.5 The Kafka Connect REST API

11.5.1 Status of a Kafka Connect cluster

11.5.2 Creating, modifying, and deleting connectors

11.6 Connector configuration

11.6.1 General connector configuration

11.6.2 Error handling in Kafka Connect

11.7 Single message transformations

11.8 Kafka Connect example: JDBC Source Connector

11.8.1 Preparing the JDBC Source Connector

11.8.2 Configuring the JDBC Source Connector

11.8.3 Testing the JDBC Source Connector

11.9 Kafka Connect example: Change data capture connector

11.9.1 Preparing the Debezium connector for PostgreSQL

11.9.2 Configuring the Debezium connector for PostgreSQL

11.9.3 Testing the Debezium connector for PostgreSQL

Summry

12 Stream processing

12.1 Stream processing overview

12.1.1 Stream-processing libraries

12.1.2 Processing data

12.2 Stream processors

12.2.1 Processor types

12.2.2 Processor topologies

12.3 Stream processing using SQL

12.4 Stream states

12.4.1 Streams and tables

12.4.2 Aggregations

12.4.3 Streaming joins

12.4.4 Use case: Notifications

12.5 Streaming and time

12.5.1 Time is relative

12.5.2 Time windows

12.5.3 Use case: Fraud detection

12.6 Scaling Kafka Streams

Summry

13 Governance

13.1 Schema management

13.1.1 Why do we need schemas?

13.1.2 Compatibility levels

13.1.3 Schema registries

13.1.4 Avro

13.2 Security

13.2.1 Transport encryption

13.2.2 Authentication

13.2.3 Authorization

13.2.4 Encryption at rest

13.2.5 End-to-end encryption

13.2.6 ZooKeeper security

13.2.7 Securing an unsecured Kafka cluster

13.3 Quotas in Kafka: Protecting the cluster from overload

Summry

14 Kafka reference architecture

14.1 Useful components and tools

14.1.1 kcat

14.1.2 Graphical user interfaces

14.1.3 Managing Kafka resources

14.1.4 Cruise Control for Apache Kafka

14.2 Deployment environments

14.2.1 Kafka on a company’s own hardware

14.2.2 Kafka in virtualized environments

14.2.3 Kafka in Kubernetes: Strimzi

14.2.4 Running Kafka in the public cloud

14.3 Hardware requirements

14.3.1 Brokers

14.3.2 Coordination cluster

Summry

15 Kafka monitoring and alerting

15.1 Infrastructure metrics

15.2 Broker metrics

15.2.1 Kafka server metrics

15.2.2 Kafka log metrics

15.2.3 Kafka network metrics

15.2.4 Kafka controller metrics

15.3 Client metrics

15.3.1 General client metrics

15.3.2 Producer metrics

15.3.3 Consumer metrics

15.3.4 Kafka Connect and Kafka Streams metrics

15.4 Alerting

15.4.1 From metrics to alerts

15.4.2 From alerts to problem solving

15.5 Kafka deployment environments and their monitoring challenges

15.5.1 Kafka on a company’s own hardware

15.5.2 Kafka on virtual machines

15.5.3 Kafka in the public cloud

15.5.4 Kafka in Kubernetes

15.5.5 Kafka as a managed services

15.5.6 Security considerations across environments

Summry

16 Disaster management

16.1 What could possibly go wrong?

16.1.1 Network failures

16.1.2 Compute failures

16.1.3 Storage failures

16.1.4 Data center failures

16.2 Backing up Kafka

16.3 Mirroring Kafka clusters with MirrorMaker

16.3.1 Active-passive cluster

16.3.2 Active-active cluster

16.3.3 Hub-and-spoke topology

Summry

17 Comparison with other technologies

17.1 Data on the outside vs. data on the inside

17.2 Classic messaging systems vs. Kafka

17.2.1 Kafka is agnostic

17.2.2 Operational complexity in classic messaging systems

17.2.3 Governance of classic messaging systems

17.3 REST vs. Kafka

17.3.1 Challenges of synchronous communication

17.3.2 Alternative communication strategies

17.4 Relational databases vs. Kafka

17.4.1 Strengths and weaknesses of relational databases

17.4.2 Complementary roles of Kafka and relational databases in modern data architectures

17.5 Kafka is the core of a streaming platform

Summry

18 Kafka’s role in modern enterprise architectures

18.1 Kafka as the core of a data mesh

18.1.1 The challenges of traditional data management

18.1.2 Principles of a data mesh

18.1.3 Data mesh vs. traditional approaches

18.1.4 Kafka’s role and responsibilities in implementing a data mesh

18.2 Liberating data from core systems with Kafka

18.3 Kafka for big data

18.4 Kafka for the Industrial Internet of Things

18.4.1 Use cases for Kafka in the IIoT

18.4.2 Data storage and retention challenges

18.4.3 Data integration and access management

18.4.4 When to use multiple Kafka clusters

18.5 What Kafka is not

18.5.1 Kafka isn’t a relational database

18.5.2 Kafka isn’t a synchronous communication interface

18.5.3 Kafka isn’t a file exchange platform

18.5.4 Kafka for small applications is questionable

18.5.5 Kafka isn’t a substitute for good architecture

Summry

Appendix

Appendix A: Setting up a Kafka test environment

A.1 Operating systems

A.2 Downloading Kafka

A.3 Configuring Kafka

A.4 Preparing the data directories

A.5 Starting Kafka

A.6 Stopping Kafka

Appendix B: Monitoring setup

B.1 Prometheus

B.2 Prometheus Exporter

B.3 Prometheus Alertmanager

B.4 Grafana

Overview

16 Disaster management

Disaster management in Kafka focuses on anticipating failures, minimizing business impact, and restoring service with integrity. The chapter outlines how outages can translate into financial loss, customer dissatisfaction, compliance exposure, and reputational damage, and emphasizes distinguishing critical from noncritical scenarios to prioritize recovery strategies. It highlights both technical and human causes of incidents, advocates for strong monitoring and runbooks, and underscores Kafka’s resilience—durable logs and asynchronous design—while reminding teams that application behavior during outages (especially producers) ultimately determines end-to-end reliability.

Core failure modes include network, compute, storage, and full datacenter events. Client–broker network issues cause producer backlogs and require application choices (block, buffer, drop, or fail fast), while inter-broker partitions can disrupt replication and leadership until connectivity returns. Compute failures are mitigated by replication and write guarantees—using appropriate replication factor, min.insync.replicas, and acks=all—plus straightforward broker recovery with the same broker.id. Storage risks center on full disks and failed volumes; proactive capacity monitoring, stopping Kafka before exhaustion, and using multiple log directories help limit loss. For datacenter failures, a stretched cluster across at least three sites with rack-aware placement (broker.rack) improves tolerance, and client.rack can reduce cross-zone traffic by consuming from local replicas, though costs and latency constrain multi-region setups; where multi-DC is not viable, distribute brokers across independent racks within a single site.

Backup strategies are nuanced: some deployments can re-import from source systems or treat Kafka as transient, while others need stronger recovery plans. Filesystem snapshots are familiar but can be inconsistent across brokers and store redundant replicas; continuous exports via Kafka Connect (for example to object storage) capture data but often miss consumer offsets; tiered storage remains limited for certain topic types and predictability. Mirroring with MirrorMaker 2 provides continuous protection and portability by replicating topics, ACLs, and consumer offsets. Common topologies include active–passive (simple failover but post-event re-seeding complexity), active–active (bi-directional mirroring with remote topics to avoid loops and read-only protections, requiring care to prevent double-processing via idempotence and transactions), and hub-and-spoke (central aggregation plus selective fan-out to edges), enabling resilient operations even under intermittent connectivity.

We need to consider many different failure cases when operating Kafka or other distributed systems.

In a stretched cluster, our Kafka brokers are distributed among multiple data centers.

An Active-Passive cluster with MirrorMaker.

An Active-Active cluster with MirrorMaker 2.

A Hub-and-Spoke Topology which uses MirrorMaker 2 (Kafka Connect) to replicate topics from the spokes to the hub. And replicating the commands topic from the hub to the spokes.

Summary

Disaster management in Kafka focuses on strategies to handle failures and minimize the likelihood of disasters.
There are three types of failures in Kafka: network issues, broker problems, and persistent storage failures, often exacerbated by human error.
Network failures are common in distributed systems. Individual client connection issues can arise, but these are typically resolved quickly.
Broker issues can lead to data loss if messages are not properly committed before a broker failure. Ensuring acks=all and a sufficient number of in-sync replicas is critical for data delivery assurance.
Persistent storage failures are one of the most severe problems in Kafka. Ideally, these failures only affect a single broker, but they can lead to irreversible data corruption if not handled properly.
Conventional backups are not very practical in Kafka due to continuous message production and consumption, which can lead to potential data loss and inconsistencies.
Stretched clusters reduce the likelihood of total failure by operating a Kafka cluster across multiple data centers, mitigating risks from data center outages.
An active-passive pairing involves one active cluster mirroring another passive cluster, taking over in case of a failure, but not reverting back to active.
Active-active pairing consists of two equally capable clusters mirroring each other, allowing continuous operation without needing to rebuild clusters during a failure.
In active-active configurations, consumers can read from both clusters and need to aggregate data appropriately.
Remote topics and partitions, introduced in MirrorMaker 2, prevent endless loops of topic mirroring while ensuring seamless data availability during failures.
The hub-and-spoke topology features a central cluster that aggregates data from smaller local clusters, allowing independent operation even if the central cluster is down.
Both active-active and hub-and-spoke topologies aim to maintain data consistency even during failures.
It is recommended to avoid active-passive pairings and prefer active-active configurations for better resilience and performance.
Confluent Replicator provides a proprietary alternative to MirrorMaker for mirroring clusters.

FAQ

What kinds of failures should I plan for when operating Kafka?

Plan for four technical domains—network, compute (brokers and controllers), storage (disks and volumes), and entire datacenter outages—plus human error (misconfigurations and operational mistakes). Each requires its own detection, mitigation, and recovery playbook.

How should producers and consumers respond to network outages?

If clients cannot reach brokers, consumers typically just wait and resume; producers must choose to block, buffer, drop, or fail fast. Inter-broker issues can disrupt replication and trigger leader elections; the main remedy is restoring connectivity. Strong monitoring, alerting, and clear runbooks are essential; Kafka’s durability usually prevents data loss once the network recovers.

How do replication factor, min.insync.replicas, and acks affect fault tolerance?

High availability for writes requires producers using acks=all and topics configured with min.insync.replicas (minISR). Example: with RF=3 and minISR=2, the cluster can lose one broker without impacting producers; a second loss blocks acks=all producers but consumers can still read from leaders. Raising RF (e.g., RF=4, minISR=2) increases tolerance but also cost; for higher resilience, consider multi-cluster patterns.

What’s the correct way to recover a failed broker?

Bring the broker back with the same broker ID. If hardware is replaced and disks are empty, Kafka recognizes the ID, reassigns the original replicas, and automatically replicates missing data from leaders until the broker is in-sync and can reassume preferred leadership.

How do I prevent and handle storage-related incidents like full disks or disk loss?

Never let Kafka disks fill up—Linux handles full volumes poorly and data in page cache may be lost. Enforce monitoring, alerting, and headroom; stop brokers before volumes fill. In Kubernetes/virtualized setups, expand disks; on bare metal, replace hardware. Configure multiple log.dirs (without RAID) to survive a single disk failure and let the broker resync the lost partitions. Central storage failures may require fixes and restarts, with some risk of loss.

Do I need to back up Kafka? What options exist and what are the pitfalls?

Some deployments can re-ingest from sources and may not need Kafka backups. If backups are required, filesystem snapshots are simple but can miss partitions when moves occur and may store redundant replicas. Continuous offloads (e.g., via Kafka Connect to object storage) capture data but often miss consumer offsets. Tiered storage isn’t a full backup (no compacted topics or offsets, unpredictable movement). Commercial tools can provide end-to-end backup/restore and cloning.

What is a stretched Kafka cluster and when should I use it?

A stretched cluster spreads brokers across multiple nearby datacenters/availability zones (latency ~≤30 ms). Use at least three locations to maintain coordination-quorum majority. Set broker.rack to distribute replicas evenly; if cross-AZ costs are high, use client.rack (KIP-392) to consume from local replicas. If multi-DC isn’t possible, distribute brokers across independent racks within a single DC.

How does MirrorMaker 2 work and what does it copy?

MirrorMaker 2 (on Kafka Connect) uses three connectors: MirrorSourceConnector (topics and optionally ACLs), MirrorCheckpointConnector (consumer offsets), and MirrorHeartbeatConnector (link health). It creates “remote topics” with source-cluster prefixes and enforces read-only ACLs to prevent loops and accidental writes.

Active-passive vs. active-active mirroring: which should I choose and how do failovers work?

Active-passive is simple: all traffic goes to the active cluster while the passive receives mirrored data; on failure, switch client endpoints. After failover, re-establish a fresh passive (or re-seed) to restore redundancy. Active-active runs in both sites with bidirectional mirroring; consumers often use regex subscriptions for local and remote topics. Beware duplicate processing for services that consume and then produce—use idempotence and transactional IDs with deduplication.

What is a hub-and-spoke Kafka topology and when is it useful?

A central “hub” cluster aggregates data from multiple “spoke” clusters via MirrorMaker, and can also push commands/configuration back to spokes. It suits multi-site organizations needing local autonomy during link outages; sites keep operating with stale commands at worst, and data resynchronizes once connectivity is restored.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $35.99

you save $12.00 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $35.99

you save $12.00 (25%)

include audio $24.99 $18.74

eBook

pdf, ePub, online

$47.99 $35.99

you save $12.00 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more