Flink in Action
Sameer B. Wadkar, Hari Rajaram
  • MEAP began September 2016
  • Publication in March 2017 (estimated)
  • ISBN 9781617293924
  • 375 pages (estimated)
  • printed in black & white
We regret that Manning Publications will not be publishing this title.

True high-velocity and high-volume data stream processing requires more than just handling each record, one at a time, as it arrives. Unlike batch processing, where all data is available, stream processing has to handle incomplete data and late or out-of-order arrivals and at the same time be resilient to failure, all without compromising performance or accuracy. You've also got to incorporate event-time processing to make sure that your stream processing system is every bit as accurate as a batch processing system. And you need one system that performs both stream and batch processing. It's a tall order and Apache Flink is your solution.

Flink in Action makes the complex topic of stream processing with Flink easy to understand and apply. Starting with lots of use cases and crystal clear explanations, this book explains how batch and streaming event processing differ. Then you'll get the big picture of how Flink works, along with crucial topics like windowing and reprocessing. Next, you'll get hands-on by creating your own Flink project using step-by-step instructions and lots of annotated images. With the basics well in hand, you'll move on to advanced topics like the Flink API, Kafka, and fault tolerance. The last part of the book covers working with Flink along with external tools and libraries. By the end, you'll have a strong foundation in the concepts and the challenges of implementing streaming systems capable of handling high-velocity and high-volume streaming data and meeting those challenges with Flink.

Table of Contents detailed table of contents

Part 1: Stream Processing using Flink

1. Introducing Apache Flink

1.1. Stream event processing

1.2.1. The need for a durable event store like Kafka

1.4. Time-based windows

1.4.1. Event time-based windows

1.4.2. Processing time-based windows

1.4.3. Ingestion time-based windows

1.4.4. Tumbling versus sliding time windows

1.5. Count-based windows

1.6. Batch processing as special case of stream processing

1.7. Pipelined processing and backpressure handling

1.7.1. Backpressure handling

1.8. Failure recovery and exactly once processing using checkpoints

1.9. Reprocessing using Save points

1.10. Real world example — news website

1.10.1. Event Schema

1.11. Summary

2. Getting started with Flink

2.1.1. Software Requirements

2.4.1. Example dataset

2.4.3. ExecutionEnvironment

2.4.4. Reading from the source

2.4.5. Transforming the dataset

2.4.6. Applying the aggregation operators

2.4.7. Executing the program

2.5.2. StreamExecutionEnvironment

2.5.3. Reading from the source

2.5.4. Transforming the DataStream

2.5.5. Applying the aggregation operators

2.5.6. Executing the program

2.6. Introducing the Table API

2.7. Summary

3. Batch processing using the DataSet API

3.1. Batch processing operations

3.1.1. The NEWSFEED schema

3.2. Newsfeed event parser

3.3. Single row operators

3.3.1. Using the Map operator

3.3.2. Using the Project operator

3.3.3. Using the mapPartition operator

3.3.4. Using the Filter operator

3.4. Working with grouped datasets

3.4.1. Using the Reduce operator

3.4.2. Using the ReduceGroup operator

3.4.3. Using the ReduceGroup operator using secondary sort

3.4.4. Optimizing group reduce using groupCombine

3.5. Using the Aggregation operators

3.5.1. Combining Aggregation operators

3.6. Join Operators

3.6.1. Joining two DataSets of domain objects

3.6.2. Joining a DataSet of domain object with a DataSet of Tuple

3.6.3. Joining two Datasets of Tuples

3.6.4. Flattening a join of two Tuple Datasets using projection

3.6.5. Join with hints based on size of the Datasets

3.6.6. Outer Joins

3.7. Data sources and sinks

3.7.1. Data Sources

3.7.2. Data Sinks

3.8. Summary

4. Stream processing using DataStream API

4.1. Source streams and Kafka

4.1.1. Reading from a Kafka topic

4.1.2. Custom SourceFunction for Unit Testing

4.2. Writing a basic streaming program

4.3. Introducing windows

4.3.1. Different types of windows and their uses

4.3.2. Defining our use case

4.4. Global windows

4.5. Count windows

4.5.1. Tumbling count windows

4.5.2. Sliding count windows

4.6. Time-based windows

4.6.1. Processing time windows

4.7. Summary

5. Basics of event time processing

Part 2: Advanced Stream Processing using Flink

6. Session windows and custom windows

7. Using the Flink API in practice

8. Using Kafka with Flink

9. Fault tolerance in Flink

Part 3: Out in the wild

10. Domain-specific libraries in Flink — CEP and Streaming SQL

11. Apache Beam and Flink

Appendixes

A.1. Software Requirements

Appendix B: Installing Apache Kafka

B.1. What is Apache Kafka?

B.2. Download Kafka and execute it locally on Linux

B.3. Download Kafka and execute it locally on Windows

B.4. Create Topic and Partitions for Chapter 4

What's inside

  • Full of everyday use cases
  • Implementing streaming and batch solutions
  • Using windowing based on count, time, or custom criteria
  • Failure-resilient distributed streaming application
  • Developing distributed iterative applications on large data scales
  • Using libraries like Streaming SQL and Complex Event Processing

About the reader

For Java developers with backend experience. Prior experience in working with MapReduce programming in Hadoop or Spark is desirable.

About the authors

Sameer Wadkar has more than 15 years of experience in implementing high-volume distributed systems for clients in the commercial and federal market space. For the past 5 years he has been engaged in implementing Big Data Systems that can handle more than 5 billion transactions a day.

Hari Rajaram is a Chief Architect & Big Data Practice Leader at Arcogent and has more than 17 years of experience in Information Technology, which encompasses finance, media, newspaper, and grants management industries.


Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.