1 Getting to know Kafka as an Architect
This chapter positions Apache Kafka as a foundational platform for event-driven architecture, explaining how it decouples producers from multiple consumers to turn events into low-latency actions. It highlights Kafka’s evolution from high-throughput messaging to a durable, real-time event streaming ecosystem that enables use cases like fraud detection, personalized recommendations, operational alerts, and predictive maintenance.
Beyond technology mechanics, the chapter centers on architectural judgment: when Kafka is a good fit, how it reshapes system design and operations, and what governance and event design practices ensure sustainable adoption. It emphasizes that success depends less on code and more on patterns, trade-offs, and integration strategies appropriate to enterprise environments.
- Principles of event-driven architecture
- Overview of the Kafka ecosystem
- Utilizing Kafka in enterprise environments
Key takeaways
- Kafka serves as a durable, scalable backbone for real-time event processing, enabling many consumers to react to a single published event without brittle point-to-point integrations.
- Event-driven systems unlock time-sensitive value across industries; latency reductions translate directly to business impact.
- Adoption requires architectural rigor: event modeling, topic design, consumer patterns, operational considerations, and governance.
- The chapter guides architects (and technical leads) in comparing Kafka to alternatives, assessing fit, and making deliberate design choices rather than focusing on language-specific implementation details.
Outcomes for readers
- Confidently evaluate Kafka’s suitability for a given problem and environment.
- Design event-driven systems that leverage Kafka’s decoupling, durability, and scalability.
- Establish governance and patterns that support long-term maintainability and enterprise integration.
1.1 How an architect sees Kafka
Traditional system integration often relied on synchronous request-response communication (commonly REST) between narrowly focused services. While straightforward and well-tooled, chaining synchronous calls introduces tight coupling, coordination complexity, fragility, and a higher risk of cascading failures. Modern requirements push architects toward more autonomous, flexible interactions where components react to changes independently.
Request Response Design Pattern

1.1.1 Event-driven architecture
EDA centers on producing, detecting, consuming, and reacting to events via an intermediary channel. With Kafka as the backbone, services publish events describing changes, and interested consumers react asynchronously. For example, instead of calling CustomerService for each order, OrderService subscribes to address-change events and maintains a local copy, enabling loose coupling, scalability, and resilience. Multiple consumers can subscribe to the same event stream, and offline components can catch up later, provided delivery is reliable.
EDA style of the communication. In EDA, systems communicate by publishing events that describe changes, allowing others to react asynchronously.

EDA introduces trade-offs: eventual consistency (temporary state divergence), concerns about ordering and duplicates, and the need for idempotency and robust error handling. It can add latency and operational complexity, which architects must weigh against the benefits of decoupling and autonomy.
1.1.2 Handling myriads of data
For high-volume, low-latency scenarios—such as user behavior analytics, log aggregation, fraud detection, and predictive maintenance—event-driven pipelines are often the only practical choice. Kafka excels at sustained throughput and low latency, helping architects implement reliable, scalable EDA and efficiently handle high-rate message streams. The chapter concludes by setting up a scenario where a team evaluates adopting EDA with Kafka.
1.2 Field notes: Journey of an event-driven project
An account manager brings a wave of change requests to a team already stretched by a sluggish data warehouse. Despite modern practices and upgrades, the nightly load now takes 14 hours due to a fivefold data growth and poor source data quality, delaying reports and eroding trust.
Amid this pressure, the business proposes a high-profile Customer 360 initiative—cloud-based and backed by marketing—to unify customer views across touchpoints. While the lead architect anticipates added complexity and cost, the senior data engineer suggests a pivot: adopt an event-driven approach with Kafka.
Eva frames Kafka as a low-latency, durable event log enabling fan-out to multiple consumers. Customer 360 can subscribe to streams and build its own projections, potentially reducing batch delays and decoupling downstream needs from upstream constraints. The account manager remains skeptical but open, asking for a concrete proposal with estimates comparable to the traditional approach. The team agrees to explore the Kafka-based design in the next meeting, setting the stage for a shift toward event-driven architecture.
1.3 Key players in the Kafka ecosystem
This section introduces Kafka’s core roles and runtime architecture so architects can evaluate the costs and benefits of adopting Kafka as a foundational system. Although Kafka has grown into a distributed platform for processing real-time events, its essence remains a reliable, high-throughput message broker focused on transporting and persisting messages.
At the data plane, producers push messages to a cluster of brokers, which acknowledge delivery, persist data to disk (with optional tiered storage for older data), and make messages available to consumers. Kafka uses a pull model: consumers subscribe to topics and fetch records on demand. An application can simultaneously act as both producer and consumer. Brokers work in a fault-tolerant cluster, distributing load and reporting their health via heartbeats.
At the control plane, Kafka uses controllers (KRaft) to manage cluster metadata and coordinate broker operations. Any server may run as a broker, a controller, or both. One controller is active while others are hot standbys; if the active controller fails, a new one is elected. Controllers maintain the metadata log (topics, partitions, broker registrations), replicate this metadata for durability, and monitor broker liveness via heartbeats to detect and handle failures.
- Producers: push messages to brokers; receive delivery acknowledgments.
- Brokers: persist and serve messages, replicate data, balance load, and ensure durability.
- Consumers: pull messages by subscribing to topics; control read pace and position.
- Controllers (KRaft): own cluster metadata, coordinate operations, elect leaders, and handle failover.
- Storage: messages are always persisted; tiered storage can offload older data to cheaper media.
1.4 Architectural principles
This section outlines Kafka’s core architectural principles for event-driven enterprise systems. Kafka goes beyond traditional messaging by combining decoupled communication, durable storage, and replayable event streams—enabling scalable, fault-tolerant, and auditable data flows across services.
1.4.1 The Publish-Subscribe pattern
Kafka uses publish-subscribe to decouple producers and consumers. Producers publish events in a fire-and-forget style with optional acknowledgments and without direct knowledge of subscribers. Multiple consumers can independently receive and process the same event. In practice, producers emit events driven by explicit business needs, while owning and stabilizing the event schema to maximize reuse.
Example: When a customer address changes, OrderService updates its database and emits an event. Both BillingService and AnalyticalService can subscribe and process that event independently as part of a Customer 360 scenario.
Publish–subscribe example: CustomerService publishes a “customer updated” event to a channel; all subscribers receive it independently.

Data governance and the organization of event pipelines are addressed later (see Chapter 6).
1.4.2 Reliable delivery
Reliability is achieved through a Kafka cluster of cooperating brokers with replicated data for fault tolerance. Producers receive acknowledgments upon durable write; if not received in time, they retry. Messages are persisted on disk with replication, ensuring survival of broker failures.
Consumers track processed messages and can recover after outages by resuming from prior positions. Kafka enables replay: consumers may re-read retained messages to rebuild state or reprocess data intentionally. Client-side configurations balance throughput and durability, with best practices discussed in later chapters.
Acknowledgments: Once the cluster accepts a message, it sends an acknowledgement to the service. If no acknowledgment arrives within the timeout, the service treats the send as failed and retries.

1.4.3 The commit log
Kafka organizes data as an append-only commit log, preserving the exact arrival order of messages for durability and traceability—similar to database write-ahead logs. “Commit” can mean safe storage confirmation for producers or recorded progress for consumers.
Events are immutable; incorrect data cannot be altered or deleted individually. Corrections are issued as new events, and consumers can replay the log to reconstruct current state, benefiting both recovery and intentional reprocessing within defined retention periods.
1.5 Designing and managing data flows
This section reframes Kafka’s role in microservice architectures from simple messaging to data-first integration. Events become explicit data contracts that must be modeled, validated, versioned, and governed. It introduces a trio of supporting components—Schema Registry, Kafka Connect, and stream processing frameworks—that help define message structure, replicate data at scale, and transform streams where appropriate.
Key architectural questions addressed:
- What structure and guarantees do messages provide?
- How do producers and consumers evolve safely as schemas change?
- Can Kafka replicate data between systems reliably and at scale?
- Where should transformations live: producer, consumer, or processing layer?
1.5.1 Schema Registry: handling data contracts
Kafka brokers treat messages as opaque bytes for performance and flexibility, so structure isn’t enforced by Kafka itself. Schema Registry externalizes data contracts:
- Serves as a central source of truth for message schemas.
- Producers register schemas and embed a schema ID in each message.
- Consumers fetch the schema by ID to deserialize accurately.
- Schemas are immutable; changes create new versions with compatibility checks.
Working with Schema Registry: Schemas are managed by a separate Schema Registry cluster; messages carry only a schema ID, which clients use to fetch (and cache) the writer schema.

This model enables teams to treat message definitions as durable contracts and evolve them without breaking consumers, provided compatibility rules are observed.
1.5.2 Kafka Connect: data replication without code
Instead of writing bespoke producers and consumers to sync data across services, Kafka Connect offers configuration-driven pipelines:
- Runs as a separate cluster with pluggable source and sink connectors.
- Moves data between Kafka and external systems (databases, warehouses, storage) at scale.
- Example: stream address updates from a CustomerService database into Kafka via a JDBC source connector; deliver them to OrderService’s database via a sink connector.
Kafka Connect architecture: connectors integrate Kafka with external systems—moving data in and out.

This reduces custom code and standardizes operational patterns for ingestion and delivery.
1.5.3 Data transformation: streaming frameworks
Different consumers need differently shaped data. While Kafka Connect can do simple, stateless changes, richer logic belongs in a processing layer. Architectural options:
- Producer-side branching (more upstream complexity).
- Consumer-side filtering (wasteful at scale).
- Dedicated processing services using stream frameworks.
Frameworks like Kafka Streams and Apache Flink enable content-based routing, filtering, joins, aggregations, windowing, and stateful processing with strong delivery guarantees.
An example of the streaming application. The RoutingService implements content-based routing. It consumes messages from Addresses and, based on their contents (e.g., address type), publishes them to ShippingAddresses or BillingAddresses.

The net result: Kafka becomes not just a transport for events, but the backbone for well-governed, evolvable data flows—where structure is explicit, movement is automated, and transformations are applied in the right place.
1.6 Impacting operations and infrastructure
Architects must design solutions that meet functional needs while remaining operable, supportable, and maintainable over time. Effective monitoring, preventive maintenance, and cost-aware operational practices are essential to avoid failures and ensure long-term sustainability. A key strategic choice is where to run Kafka—on-premise or in the cloud—balancing cost, scalability, performance, security, and compliance against project objectives.
1.6.1 Kafka tuning and maintenance
Even when day-to-day operations sit with DevOps, architects need to understand how Kafka fits into enterprise infrastructure and its implications for deployment, security, and data protection. Early requirement gathering directly impacts cost and feasibility.
- Sizing and SLAs: Determine hardware needs for brokers and clients, define key metrics, and shape service level agreements.
- Testing strategy: Plan functional and performance testing and prepare for disaster recovery.
- Observability and troubleshooting: Decoupled systems increase debugging complexity; ensure robust monitoring, tracing, and tooling to locate data loss or bottlenecks.
- Evolution and scalability: Anticipate growth and change; be ready to extend or restructure clusters as requirements evolve.
1.6.2 On-premise and cloud options
On-premise deployments remain common, offering maximum control at the cost of operational responsibility and infrastructure overhead.
- On-premise: Full control over configuration and operations; requires investment in hardware, setup, physical security, monitoring, maintenance, and skilled personnel.
- Managed cloud service: Simplifies provisioning and administration with provider-backed SLAs, but imposes constraints:
- Fixed Kafka versioning governed by the provider.
- Limited availability of some ecosystem components (e.g., commercially licensed tools such as Schema Registry) depending on provider.
- Restricted low-level broker tuning; baseline performance is the provider’s responsibility.
- Constrained choice of cluster management tools.
1.6.3 Solutions from other cloud providers
Competing messaging systems—such as Apache Pulsar, Amazon Kinesis, Google Pub/Sub, and Azure Event Hubs—offer different trade-offs and deployment models. Some support the Kafka protocol, enhancing interoperability. Choosing between on-premise, managed, or hybrid approaches influences enterprise capabilities and costs; the book further examines these options to guide selection.
1.7 Applying Kafka in enterprise
Kafka is not a one-size-fits-all solution. Architects should evaluate its fit per project. This section highlights two primary enterprise uses—reliable message delivery and long-lived state storage—and outlines what makes Kafka distinct.
1.7.1 Using Kafka for sending messages
- In event-driven systems, services publish events that signal internal state changes.
- Kafka persists events and replicates them across brokers for durability and availability.
- Retention ensures subscribers that fail can reprocess events; once processed, events are typically no longer relevant to consumers.
1.7.2 Using Kafka for storing data
- With infinite retention, Kafka can act as an immutable log for event sourcing.
- Example: CustomerService emits address-change events forming a change log; consumers rebuild current state by replaying from the start and may materialize it in memory or a local database.
- Supports real-time processing, such as joining an orders stream with an address change-log topic for enrichment.
- Kafka is not a full database replacement: ad hoc queries (e.g., geospatial lookups) are inefficient because clients must read, deserialize, and compute logic outside the broker.
1.7.3 How Kafka is different
- Designed as a distributed commit log: an immutable sequence enabling high throughput over large data volumes.
- Disk persistence and inter-broker replication provide durability and fault tolerance.
- Historical retention supports event sourcing and replay.
- Part of a growing ecosystem with connectors and stream processing frameworks for real-time integration and transformations.
1.8 Field notes: Getting started with a Kafka project
This field-note conversation distills the initial planning needed to kick off a Kafka initiative for a Customer 360 use case while staying lean and budget-conscious.
- Objective: Start small with clear use cases and a pragmatic, cost-aware plan.
- Use cases: Prioritize the key customer events to capture and process for a holistic customer view.
- Data sources: Inventory event producers (e.g., transactions, customer interactions, social mentions) to understand variety and volume.
- Event modeling: Define robust, scalable event schemas to ensure compatibility across services and over time.
- Performance and scalability: Plan topics, partitions, replication factors, and cluster sizing to meet current and future throughput/availability needs.
- Cluster management and deployment: Establish monitoring, scaling, and maintenance practices; decide between self-managed clusters and managed Kafka services.
- Security: Enforce encryption, authentication, and authorization to protect customer data end to end.
- Integration: Ensure smooth data flow between Kafka and existing databases, applications, and analytics tools.
- Cost estimation:
- Infrastructure: Hardware, storage, and network sized to projected volumes; include potential third-party or managed-service costs.
- Development and operations: Building producers/consumers, plus ongoing monitoring, troubleshooting, and upgrades.
- Licensing: Align required features with appropriate Kafka licensing options.
Action items:
- Define and rank use cases and data sources.
- Draft initial event schemas.
- Estimate performance needs and cluster sizing.
- Choose management model (self-managed vs managed service) and monitoring approach.
- Design security controls.
- Map integrations with existing systems.
- Compile infrastructure, development, operations, and licensing cost estimates.
1.9 Online Resources
This section curates key references to help architects deepen their Kafka expertise, spanning official documentation, platform-specific materials, events, books, and articles on event-driven architecture and microservices.
- Apache Kafka Documentation: Official guides and API references for configuration, development, and operations.
- Confluent Platform Documentation: Ecosystem guidance covering platform components such as connectors, ksqlDB, and Schema Registry.
- The Data Streaming Event: Community and industry event offering trends, real-world case studies, and best practices in data streaming.
- Kafka: The Definitive Guide: Comprehensive book on Kafka’s architecture, core concepts, and operational know-how.
- Kafka in Action: Practical patterns and hands-on examples for building Kafka-based applications.
- Microservices, Apache Kafka, and Domain-Driven Design: Guidance on aligning event streams with bounded contexts and microservice design.
- Building Event-Driven Microservices: Strategies and patterns for designing and implementing event-driven systems.
- What is Event-Driven Architecture?: Introductory article explaining event-first thinking and its impact on system design.
Together, these resources provide a balanced path from foundational understanding to practical implementation of Kafka-centric, event-driven architectures.