Apache Kafka in Action you own this product

From basics to production

Anatoly Zelenin, Alexander Kropp
Foreword by Adam Bellemare

April 2025
ISBN 9781633437593
368 pages

Included with a Manning Online subscription

printed in black & white

available in Russian

catalog / Data Science / Big Data / Stream Processing

table of content

Part 1 Getting started

1 Introduction to Apache Kafka

1.1 What is Apache Kafka, and how does it solve our problems?

1.2 Kafka in enterprise ecosystems

1.3 Architectural overview of Kafka

1.4 Running and using Kafka

1.5 Our learning path

Summry

2 First steps with Kafka

2.1 Introducing our use case

2.2 Producing messages

2.3 Consuming messages

2.4 Consuming and producing messages in parallel

2.5 Graphical user interfaces for Kafka

Summry

Part 2 Concepts

3 Exploring Kafka topics and messages

3.1 Topics

3.1.1 Viewing topics

3.1.2 Create, customize, and delete topics

3.2 Messages

3.2.1 Message types

3.2.2 Data formats

3.2.3 Message structure

Summry

4 Kafka as a distributed log

4.1 Logs

4.1.1 What exactly is a log?

4.1.2 Basic properties of a log

4.1.3 Kafka as a log

4.2 Kafka as a distributed system

4.2.1 Partitioning and keys

4.2.2 Consumer groups

4.2.3 Replication

4.3 Components of Kafka

4.3.1 Coordination cluster

4.3.2 Broker

4.3.3 Clients

4.4 Kafka in corporate use

Summry

5 Reliability

5.1 Acknowledgments

5.1.1 ACK strategies in Kafka

5.1.2 ACKs and ISRs

5.1.3 Message delivery guarantees in Kafka

5.2 Transactions

5.2.1 Transactions in databases

5.2.2 Transactions in Kafka

5.2.3 Transactions and consumers

5.3 Replication and the leader-follower principle

Summry

6 Performance

6.1 Configuring topics for performance

6.1.1 Scaling and load balancing

6.1.2 Determining how many partitions are needed

6.1.3 Changing the number of partitions

6.2 Producer performance

6.2.1 Producer configuration

6.2.2 Producer performance test

6.3 Broker configuration and optimization

6.3.1 Optimizing brokers

6.3.2 Determining broker count and sizing

6.4 Consumer performance

6.4.1 Consumer configuration

6.4.2 Consumer performance test

Summry

Part 3 Kafka deep dive

7 Cluster management

7.1 Apache Kafka Raft cluster management

7.2 ZooKeeper Cluster Management

7.3 Migrating from ZooKeeper to KRaft

7.4 Connection to Kafka

Summry

8 Producing and persisting messages

8.1 Producer

8.1.1 Producing messages

8.1.2 Production process for messages

8.1.3 Producer and ACKs

8.2 Broker

8.2.1 Receiving and persisting messages

8.2.2 Brokers and ACKs

8.3 Data and file structures

8.3.1 Metadata, checkpoints, and topics

8.3.2 Partitions directory

8.3.3 Log data and indices

8.3.4 Segments

8.3.5 Deleted topics

8.4 Replication

8.4.1 In-sync replicas

8.4.2 High Watermark

8.4.3 Effects of delays during replication

Summry

9 Consuming messages

9.1 Fetching messages

9.1.1 Fetch requests

9.1.2 Fetch from the closest replica

9.2 Broker handling of consumer fetch requests

9.3 Offsets and Consumer

9.3.1 Offset management

9.3.2 Understanding offsets in Kafka

9.4 Understanding and managing Kafka consumer groups

9.4.1 Consumer group management

9.4.2 Distribution of partitions to consumers

9.4.3 Static memberships

Summry

10 Cleaning up messages

10.1 Why clean up messages?

10.2 Kafka’s cleanup methods

10.3 Log retention

10.3.1 When is a log cleaned up via retention?

10.3.2 Offset retention

10.4 Log compaction

10.4.1 When is a log cleaned up via compaction?

10.4.2 How the log cleaner works

10.4.3 Tombstones

Summry

Part 4 Kafka in enterprise use

11 Integrating external systems with Kafka Connect

11.1 What is Kafka Connect?

11.2 Kafka Connect cluster: Distributed Mode

11.2.1 Configuring a Kafka Connect cluster

11.2.2 Creating a connector

11.2.3 Testing the connector

11.3 Scalability and fault tolerance of Kafka Connect

11.4 Worker configuration

11.5 The Kafka Connect REST API

11.5.1 Status of a Kafka Connect cluster

11.5.2 Creating, modifying, and deleting connectors

11.6 Connector configuration

11.6.1 General connector configuration

11.6.2 Error handling in Kafka Connect

11.7 Single message transformations

11.8 Kafka Connect example: JDBC Source Connector

11.8.1 Preparing the JDBC Source Connector

11.8.2 Configuring the JDBC Source Connector

11.8.3 Testing the JDBC Source Connector

11.9 Kafka Connect example: Change data capture connector

11.9.1 Preparing the Debezium connector for PostgreSQL

11.9.2 Configuring the Debezium connector for PostgreSQL

11.9.3 Testing the Debezium connector for PostgreSQL

Summry

12 Stream processing

12.1 Stream processing overview

12.1.1 Stream-processing libraries

12.1.2 Processing data

12.2 Stream processors

12.2.1 Processor types

12.2.2 Processor topologies

12.3 Stream processing using SQL

12.4 Stream states

12.4.1 Streams and tables

12.4.2 Aggregations

12.4.3 Streaming joins

12.4.4 Use case: Notifications

12.5 Streaming and time

12.5.1 Time is relative

12.5.2 Time windows

12.5.3 Use case: Fraud detection

12.6 Scaling Kafka Streams

Summry

13 Governance

13.1 Schema management

13.1.1 Why do we need schemas?

13.1.2 Compatibility levels

13.1.3 Schema registries

13.1.4 Avro

13.2 Security

13.2.1 Transport encryption

13.2.2 Authentication

13.2.3 Authorization

13.2.4 Encryption at rest

13.2.5 End-to-end encryption

13.2.6 ZooKeeper security

13.2.7 Securing an unsecured Kafka cluster

13.3 Quotas in Kafka: Protecting the cluster from overload

Summry

14 Kafka reference architecture

14.1 Useful components and tools

14.1.1 kcat

14.1.2 Graphical user interfaces

14.1.3 Managing Kafka resources

14.1.4 Cruise Control for Apache Kafka

14.2 Deployment environments

14.2.1 Kafka on a company’s own hardware

14.2.2 Kafka in virtualized environments

14.2.3 Kafka in Kubernetes: Strimzi

14.2.4 Running Kafka in the public cloud

14.3 Hardware requirements

14.3.1 Brokers

14.3.2 Coordination cluster

Summry

15 Kafka monitoring and alerting

15.1 Infrastructure metrics

15.2 Broker metrics

15.2.1 Kafka server metrics

15.2.2 Kafka log metrics

15.2.3 Kafka network metrics

15.2.4 Kafka controller metrics

15.3 Client metrics

15.3.1 General client metrics

15.3.2 Producer metrics

15.3.3 Consumer metrics

15.3.4 Kafka Connect and Kafka Streams metrics

15.4 Alerting

15.4.1 From metrics to alerts

15.4.2 From alerts to problem solving

15.5 Kafka deployment environments and their monitoring challenges

15.5.1 Kafka on a company’s own hardware

15.5.2 Kafka on virtual machines

15.5.3 Kafka in the public cloud

15.5.4 Kafka in Kubernetes

15.5.5 Kafka as a managed services

15.5.6 Security considerations across environments

Summry

16 Disaster management

16.1 What could possibly go wrong?

16.1.1 Network failures

16.1.2 Compute failures

16.1.3 Storage failures

16.1.4 Data center failures

16.2 Backing up Kafka

16.3 Mirroring Kafka clusters with MirrorMaker

16.3.1 Active-passive cluster

16.3.2 Active-active cluster

16.3.3 Hub-and-spoke topology

Summry

17 Comparison with other technologies

17.1 Data on the outside vs. data on the inside

17.2 Classic messaging systems vs. Kafka

17.2.1 Kafka is agnostic

17.2.2 Operational complexity in classic messaging systems

17.2.3 Governance of classic messaging systems

17.3 REST vs. Kafka

17.3.1 Challenges of synchronous communication

17.3.2 Alternative communication strategies

17.4 Relational databases vs. Kafka

17.4.1 Strengths and weaknesses of relational databases

17.4.2 Complementary roles of Kafka and relational databases in modern data architectures

17.5 Kafka is the core of a streaming platform

Summry

18 Kafka’s role in modern enterprise architectures

18.1 Kafka as the core of a data mesh

18.1.1 The challenges of traditional data management

18.1.2 Principles of a data mesh

18.1.3 Data mesh vs. traditional approaches

18.1.4 Kafka’s role and responsibilities in implementing a data mesh

18.2 Liberating data from core systems with Kafka

18.3 Kafka for big data

18.4 Kafka for the Industrial Internet of Things

18.4.1 Use cases for Kafka in the IIoT

18.4.2 Data storage and retention challenges

18.4.3 Data integration and access management

18.4.4 When to use multiple Kafka clusters

18.5 What Kafka is not

18.5.1 Kafka isn’t a relational database

18.5.2 Kafka isn’t a synchronous communication interface

18.5.3 Kafka isn’t a file exchange platform

18.5.4 Kafka for small applications is questionable

18.5.5 Kafka isn’t a substitute for good architecture

Summry

Appendix

Appendix A: Setting up a Kafka test environment

A.1 Operating systems

A.2 Downloading Kafka

A.3 Configuring Kafka

A.4 Preparing the data directories

A.5 Starting Kafka

A.6 Stopping Kafka

Appendix B: Monitoring setup

B.1 Prometheus

B.2 Prometheus Exporter

B.3 Prometheus Alertmanager

B.4 Grafana

Overview

18 Kafka’s role in modern enterprise architectures

Kafka’s role in modern enterprise architectures is to serve as a real-time, scalable backbone that connects operational and analytical systems while decoupling teams. The chapter surveys how organizations use Kafka to integrate heterogeneous domains, act on streams for analytics and automation, and modernize legacy landscapes. It balances enthusiasm with caution, emphasizing that Kafka delivers the most value when paired with clear ownership, good governance, and fit-for-purpose patterns—and that it is not a universal solution for every data challenge.

In a data mesh, departments publish “data products” and own their schemas, while a platform team supplies self-service infrastructure and guardrails; Kafka underpins this with high-throughput topics, schema enforcement, access controls, and economical retention via tiered storage. The chapter shows how to liberate data from core systems using CDC, reshape raw, normalized records into consumable streams, and choose where to perform heavy joins (often a relational database feeding Kafka via an outbox). It also revisits Kafka’s big-data heritage—moving vast clickstreams reliably and buffering load spikes—capabilities that remain central for bulk data movement with low operational overhead.

For industrial IoT, Kafka aggregates device and sensor data for predictive maintenance, factory automation, supply-chain visibility, and energy management, often bridging protocols like MQTT and supporting both real-time and batch consumers; tiered storage, self-service access, and careful multi-cluster choices round out the operating model. Equally important are the antipatterns: Kafka is not a relational database, not for synchronous request–response, and not a file-transfer bus; messages should be denormalized, and large documents belong in object storage with Kafka carrying references. For small, single-team workloads, simpler tools may suffice. Above all, success depends on sound architecture, disciplined governance, and pragmatic use of Kafka where it truly fits.

The core principles of a data mesh.

A team of data engineers ensures that all data from all services is centrally collected in a data lake. Since these teams cannot accurately assess the quality of the data, the data lake often turns into a data swamp.

Kafka in a data mesh architecture.

With Kafka, we can cost-effectively provide data from mainframes or core systems to other services.

Debezium accesses the commit log of the database directly.

Kafka was originally developed at LinkedIn to quickly, securely, and efficiently move large amounts of data from sensors on the website and app into their Hadoop cluster.

Kafka as an interface between Message Queuing Telemetry Transport (MQTT) and other applications.

Do not use Kafka for synchronous request-response communication. Instead, notify users about asynchronous processes and confirm the process when it is done.

Do not use Kafka to exchange files between systems. Use, for example, an object storage solution and use Kafka to notify others about the file that was uploaded or changed.

Summary

The data mesh decentralizes data management, empowering departments to take ownership of their data products while the data team focuses on technical support.
Key principles of the data mesh include treating data as a product, domain ownership, self-service data infrastructure, and decentralized governance.
Apache Kafka acts as a central hub for real-time data exchange within a data mesh, enhancing data quality and accelerating data movement across the organization.
Core systems are essential but rigid; liberating data from them with Kafka enables agile services while minimizing direct interaction and maintenance costs.
Debezium facilitates real-time data exchange using Change Data Capture, but new data structures must be created to avoid complexity and improve usability for other applications.
Kafka was designed for efficiently transferring large volumes of data in big data environments, utilizing a log-based architecture for reliable delivery and buffering against load spikes.
Kafka efficiently handles the growing data volumes from industrial applications, enabling real-time monitoring, predictive maintenance, and centralized data collection for various use cases.
Tiered storage in Kafka allows organizations to balance performance and cost by retaining frequently accessed data while offloading historical data to lower-cost storage, streamlining data management.
Effective data integration and access management can be achieved through protocols like Message Queuing Telemetry Transport (MQTT) for reliable data transmission, while centralized Kafka clusters simplify management and support self-service tools for teams.
Kafka is not a relational database, making it unsuitable for complex queries or maintaining the current state, as it lacks full ACID guarantees, particularly transactional isolation.
Kafka is not a synchronous communication interface; it operates asynchronously and requires relational databases for immediate feedback.
Kafka is not a file exchange platform, as it is not designed for large files like PDFs; it is more effective to send machine-readable data or links to files stored externally.
Kafka is not ideal for small applications with low data volumes. In these cases, simpler solutions like a database may be more effective.

FAQ

What problems in traditional data management does a data mesh address?

Centralized data teams become bottlenecks for ETL and quality control.
Frequent source changes (e.g., CRM/ERP schema tweaks) break pipelines.
Lack of domain ownership creates poor documentation and inconsistent data.
Data lakes drift into “data swamps” without governance and standards.

What are the four core principles of a data mesh?

Domain ownership: business domains own and publish their data.
Data as a product: data is treated like a product with clear contracts and consumers.
Self-service data platform: domain-agnostic tooling to publish/consume safely.
Decentralized governance: shared standards for security, semantics, and interoperability.

How does Kafka support a data mesh in practice?

Real-time publishing and consumption of data products at scale.
Schema enforcement for reliable, compatible data exchange.
Integration with governance tools for ACLs, quotas, and compliance.
Cost-effective long-term retention with tiered storage.

Who is responsible for what in a Kafka-backed data mesh?

Producing business units: define, document, and publish data products; manage lifecycle and access; honor schema contracts.
Consuming business units: request access; comply with schemas, security, and governance.
Data platform team: provide the platform, tooling, and guardrails; enable self-service.

How can Kafka help “liberate” data from core systems?

Use CDC (e.g., Debezium) to stream changes from core databases into Kafka.
Avoid exposing raw, normalized core schemas broadly; build consumable, denormalized data products.
Derive consolidated streams with Kafka Streams (noting complexity of large joins) or perform joins in a separate relational DB and publish results via outbox/CDC.

When should you avoid large joins in Kafka Streams, and what are alternatives?

Avoid when joins span many tables and require heavy state; operational complexity and cost rise sharply.
Prefer pre-joining in a relational DB fed by CDC, then publish curated results via outbox.
Design messages to be self-contained and denormalized to minimize downstream joins.

What IIoT use cases fit Kafka, and how do devices send data?

Use cases: predictive maintenance, factory automation, supply chain optimization, energy management, real-time production analytics.
Ingestion: write directly to Kafka if devices/networks are robust; otherwise use MQTT brokers that bridge into Kafka.
Security: avoid direct broker access from untrusted devices; handle outages with buffering; apply topic-level ACLs and consider a Kafka proxy for finer control.

How should organizations handle Kafka data retention and historical data?

Retain long enough for all consumers to process.
Use tiered storage to keep hot data on brokers and cold data on cheaper storage.
Plan for historical access patterns (e.g., ML training) and retrieval by data age.

When does it make sense to run multiple Kafka clusters?

Performance isolation for workloads with different latency/throughput needs.
Data security and regulatory isolation (e.g., HR vs. manufacturing).
Geographic distribution for latency and data sovereignty.
Extreme scale and fault tolerance by splitting domains.
Otherwise, prefer a single cluster to reduce operational overhead.

What is Kafka not well-suited for, and what are common antipatterns?

Not a relational database: avoid current-state queries and heavy multi-table joins.
Not for synchronous request–response: use DB/REST for sync, and CDC/Outbox for async workflows.
Not a file exchange platform: store files in object storage; send references or split large payloads into events.
Often overkill for small, single-team, low-volume monoliths.
Not a substitute for sound architecture and domain understanding.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $35.99

you save $12.00 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $35.99

you save $12.00 (25%)

include audio $24.99 $18.74

eBook

pdf, ePub, online

$47.99 $35.99

you save $12.00 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more