Data Pipelines with Apache Airflow, Second Edition you own this product

Orchestration for data and AI

Julian de Ruiter, Ismael Cabral, Kris Geusebroek, Daniel van der Ende, Bas Harenslak

MEAP began December 2024
Last updated November 2025
Publication in February 2026 (estimated)

ISBN 9781633436374
450 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean, Russian, Simplified Chinese

catalog / Software Development / Cloud / Data Engineering / Data Pipelines

table of content

PART 1: GETTING STARTED

1 Meet Apache Airflow

1.1 Introducing data pipelines

1.1.1 Data pipelines as graphs

1.1.2 Executing a pipeline graph

1.1.3 Pipeline graphs vs. sequential scripts

1.1.4 Running pipelines using workflow managers

1.2 Introducing Airflow

1.2.1 Defining pipelines flexibly in (Python) code

1.2.2 Integration with external systems

1.2.3 Scheduling and executing pipelines

1.2.4 Monitoring and handling failures

1.2.5 Incremental loading and backfilling

1.3 When to use Airflow

1.3.1 Reasons to choose Airflow

1.3.2 Reasons not to choose Airflow

1.4 The rest of this book

1.5 Summary

2 Anatomy of an Airflow DAG

2.1 Collecting data from numerous sources

2.1.1 Exploring the data

2.2 Writing your first Airflow DAG

2.2.1 Tasks vs. operators

2.2.2 Running arbitrary Python code

2.3 Running a DAG in Airflow

2.3.1 Running Airflow in a Python environment

2.3.2 Running Airflow with Docker

2.3.3 Inspecting the DAG in Airflow

2.4 Running at regular intervals

2.5 Handling failing tasks

2.6 Dag Versioning

2.7 Summary

3 Time-based scheduling in Airflow

3.1 An example: processing user events

3.2 The basic components of an Airflow schedule

3.3 Running regularly using trigger-based schedules

3.3.1 Defining a daily schedule

3.3.2 Using Cron expressions

3.3.3 Using shorthand expressions

3.3.4 An alternative: using frequency-based timetables

3.3.5 Summarizing trigger timetables

3.4 Incremental processing with data intervals

3.4.1 Processing data incrementally

3.4.2 Defining incremental schedules with data intervals

3.4.3 Defining intervals using time deltas

3.4.4 Summarizing interval-based schedules

3.5 Handling irregular intervals

3.6 Managing backfilling of historical data

3.7 Designing well-behaved tasks

3.7.1 Atomicity

3.7.2 Idempotency

3.8 Summary

4 Asset-aware scheduling

4.1 Challenges of scaling time-based schedules

4.2 Introducing asset-aware scheduling

4.3 Producing asset events

4.4 Consuming asset events

4.5 Adding extra information to events

4.6 Skipping updates

4.7 Consuming multiple assets

4.8 Combining time- and asset-based schedules

4.9 Summary

5 Templating tasks using the Airflow context

5.1 Inspecting data for processing with Airflow

5.1.1 Determining how to load incremental data

5.2 Task context and Jinja templating

5.2.1 Templating operator arguments

5.2.2 Templating the PythonOperator

5.2.3 Passing additional variables to the PythonOperator

5.2.4 Inspecting templated arguments

5.2.5 What is available for templating?

5.3 Bringing it all together

5.4 Summary

PART 2: BEYOND THE BASICS

6 Defining dependencies between tasks

6.1 Basic dependencies

6.1.1 Linear dependencies

6.1.2 Fan-in/-out dependencies

6.2 Branching

6.2.1 Branching within tasks

6.2.2 Branching within the DAG

6.3 Conditional tasks

6.3.1 Conditions within tasks

6.3.2 Making tasks conditional

6.3.3 Using built-in operators

6.4 More about trigger rules

6.4.1 What is a trigger rule?

6.4.2 The effect of failures

6.4.3 Other trigger rules

6.5 Sharing data between tasks

6.5.1 Sharing data using XComs

6.5.2 When (not) to use XComs

6.5.3 Using custom XCom backends

6.5.4 XCom cleanup

6.6 Chaining Python tasks with the Taskflow API

6.6.1 Simplifying Python tasks with the Taskflow API

6.6.2 Using the TaskFlow API to define a new DAG

6.6.3 When (not) to use the Taskflow API

6.7 Summary

7 Triggering workflows with external input

7.1 Polling conditions with sensors

7.1.1 Polling custom conditions

7.1.2 Sensors outside the happy flow

7.2 Starting workflows with REST/CLI

7.3 Triggering workflows with messages

7.4 Summary

8 Communicating with external systems

8.1 Installing additional operators

8.2 Developing a machine learning model

8.2.1 Use case: classifying handwritten digits

8.2.2 Setting up the pipeline

8.2.3 Developing locally with external systems

8.3 Moving data from between systems

8.3.1 Use case: Analyzing Airbnb listings

8.3.2 Implementing a PostgresToS3Operator

8.3.3 Outsourcing the heavy work

8.4 Summary

9 Extending Airflow with custom operators and sensors

9.1 Starting with a PythonOperator

9.1.1 Simulating a movie rating API

9.1.2 Fetching ratings from the API

9.1.3 Building the actual DAG

9.2 Building a custom hook

9.2.1 Designing a custom hook

9.2.2 Building our DAG with the MovielensHook

9.3 Building a custom operator

9.3.1 Defining a custom operator

9.3.2 Building an operator for fetching ratings

9.4 Building custom sensors

9.5 Building custom deferrable operator

9.5.1 Executing Asynchronous Tasks Using the Triggerer

9.5.2 Running our Movielens sensor asynchronously

9.6 Packaging your components

9.6.1 Bootstrapping a Python package

9.6.2 Installing your package

9.6.3 Sharing your package with others

9.7 Summary

10 Testing

10.1 Getting started with testing

10.1.1 Integrity testing all DAGs

10.1.2 Setting up a CI/CD pipeline

10.1.3 Writing unit tests

10.1.4 Pytest project structure

10.1.5 Testing with files on disk

10.2 Working with DAGs and task context in tests

10.2.1 Working with external systems

10.3 Using tests for development

10.4 Testing Complete DAGs

10.4.1 Using dag.test() to test your whole DAG

10.4.2 Emulate production environments with Whirl

10.4.3 Create DTAP environments

10.5 Summary

PART 3: AIRFLOW IN PRACTICE

11 Running tasks in containers

11.1 Challenges of many different operators

11.1.1 Operator interfaces and implementations

11.1.2 Complex and conflicting dependencies

11.1.3 Moving toward a generic operator

11.2 Introducing containers

11.2.1 What are containers?

11.2.2 Running our first Docker container

11.2.3 Creating a Docker image

11.2.4 Persisting data using volumes

11.3 Containers and Airflow

11.3.1 Tasks in containers

11.3.2 Why use containers?

11.4 Running tasks in Docker

11.4.1 Introducing the DockerOperator

11.4.2 Creating container images for tasks

11.4.3 Building a DAG with Docker tasks

11.4.4 Docker-based workflow

11.5 Running tasks in Kubernetes

11.5.1 Introducing Kubernetes

11.5.2 Setting up Kubernetes

11.5.3 Using the KubernetesPodOperator

11.5.4 Diagnosing Kubernetes-related issues

11.5.5 Differences with Docker-based workflows

11.6 Summary

12 Best practices

12.1 Writing clean DAGs

12.1.1 Use style conventions

12.1.2 Manage credentials centrally

12.1.3 Specify configuration details consistently

12.1.4 Avoid doing any computation in your DAG definition

12.1.5 Use factories to generate common patterns

12.1.6 Group related tasks using task groups

12.1.7 Be explicit when specifying your DAG schedule

12.1.8 Use Dynamic Task Mapping to dynamically generate tasks

12.2 Designing reproducible tasks

12.2.1 Always require tasks to be idempotent

12.2.2 Ensure task results are deterministic

12.2.3 Design tasks using functional paradigms

12.3 Handling data efficiently

12.3.1 Limit the amount of data being processed

12.3.2 Load/process data incrementally

12.3.3 Cache intermediate data

12.3.4 Don’t store data on local file systems

12.3.5 Offload work to external/source systems

12.4 Managing concurrency using pools

12.5 Summary

13 Project: Finding the fastest way to get around NYC

13.1 Use case: investigating traffic in New York

13.2 Understanding the data

13.2.1 Yellow Cab file share

13.2.2 Citi Bike REST API

13.2.3 Deciding on a plan of approach

13.3 Extracting the data

13.3.1 Downloading Citi Bike data

13.3.2 Downloading Yellow Cab data

13.4 Applying similar transformations to data

13.5 Structuring a data pipeline

13.6 Developing idempotent data pipelines

13.7 Summary

PART 4: AIRFLOW IN PRODUCTION

14 Project: Keeping family traditions alive with Airflow and Generative AI

14.1 Use case: bringing family recipes to life

14.2 Fine-tuning an existing LLM

14.3 RAG to the rescue

14.4 Uploading recipes to the Recipe Vault

14.5 Preprocess the recipes with DockerOperator

14.6 Creating a collection to store our recipes

14.6.1 Defining how to vectorize our text

14.6.2 Creating a schema for the collection

14.6.3 Preparing our collection of recipes

14.7 Updating and creating new records in the Vector database

14.8 Deleting outdated records from the vector database

14.9 Adding recipes to the vector database

14.10 RAG in action

14.10.1 The R is for retrieving

14.10.2 Structuring our questions with prompt templates

14.10.3 Searching for recipes

14.11 Summary

15 Operating Airflow in production

15.1 Revisiting the Airflow architecture

15.2 Choosing the executor

15.2.1 Overview of different executor types

15.2.2 Which executor is right for you?

15.2.3 Installing each executor

15.3 Configuring the metastore

15.4 Configuring the scheduler

15.4.1 Configuring scheduler components

15.4.2 Running multiple schedulers

15.4.3 System performance configurations

15.4.4 Controlling the maximum number of running tasks

15.5 Configuring the DAG Processor Manager

15.6 Capturing logs

15.6.1 Capturing API server output

15.6.2 Capturing scheduler output

15.6.3 Capturing task logs

15.6.4 Sending logs to remote storage

15.7 Visualizing and monitoring Airflow metrics

15.7.1 Collecting metrics from Airflow

15.7.2 Configuring Airflow to send metrics

15.7.3 Configuring Prometheus to collect metrics

15.7.4 Creating dashboards with Grafana

15.7.5 What should you monitor?

15.8 Setting up alerts

15.9 Scaling Airflow beyond a single instance

15.10 Summary

16 Securing Airflow

16.1 Role Based Access in the Airflow UI

16.1.1 Adding users

16.1.2 Configuring the RBAC interface

16.2 Encrypting data at rest

16.2.1 Creating a Fernet key

16.3 Connecting with an LDAP service

16.3.1 Understanding LDAP

16.3.2 Fetching users from an LDAP service

16.4 Encrypting traffic to the webserver

16.4.1 Understanding HTTPS

16.4.2 Configuring a certificate for HTTPS

16.5 Fetching credentials from secret management systems

16.6 Summary

17 Airflow deployment options

17.1 Managed Airflow

17.1.1 Astronomer

17.1.2 Google Cloud Composer

17.1.3 Amazon Managed Workflows for Apache Airflow

17.2 Airflow on Kubernetes

17.2.1 Preparing our Kubernetes cluster

17.2.2 Connecting to your Kubernetes cluster

17.2.3 Deploying with The Apache Airflow Helm chart

17.2.4 Changing the default deployment configuration

17.2.5 Changing the apiserver secret key

17.2.6 Using an external database for Airflow metadata

17.2.7 DAG deployment

17.2.8 Python library deployment

17.2.9 Configuring the Executor(s)

17.3 Choosing a deployment strategy

17.4 Summary

Appendices

Appendix A: Running code samples

A.1 Code structure

A.2 Running the examples

A.2.1 Starting the Docker environment

A.2.2 Inspecting running services

A.2.3 Tearing down the environment

Appendix B: Prometheus metric mapping

Overview

1 Meet Apache Airflow

Modern organizations depend on ever-growing volumes of high-quality data, which makes reliable, well-orchestrated pipelines essential. This chapter introduces Apache Airflow as an orchestration platform that coordinates tasks across diverse systems—such as data ingestion, transformation, analytics, and ML—so that teams can produce trustworthy results efficiently. The book takes a practical, production-minded approach, aiming to equip readers to build Airflow pipelines thoughtfully, evaluate when Airflow is a good fit, and take their first steps with confidence.

The chapter first frames data pipelines as directed acyclic graphs (DAGs), where tasks are nodes and dependencies are directed edges. This acyclic structure avoids deadlocks and enables a straightforward execution algorithm that schedules tasks when their upstream dependencies are satisfied. Compared with monolithic scripts, DAGs make dependencies explicit, allow parallel execution of independent branches, and support targeted reruns after failures. Airflow is then positioned within the broader ecosystem of workflow managers, highlighting trade-offs in how workflows are defined (code vs. static files) and which features are built in (scheduling, monitoring, UI), underscoring that tool selection should align with specific requirements.

Airflow’s core strengths include defining pipelines as code (primarily in Python) with dynamic construction, extensive integrations via providers, and robust scheduling for recurring or event-driven runs. Its architecture—DAG Processor, scheduler, workers, triggerer, and API server/metastore—coordinates parsing, scheduling, execution, and observability, while the web UI (graph and grid views) aids monitoring, debugging, retries, and selective reruns. Airflow also excels at incremental processing and backfilling across time intervals. It is a strong choice for batch-oriented, time- or event-triggered workflows, broad system integrations, and teams applying software engineering best practices; it is less suited to real-time streaming, low-code teams, or scenarios demanding built-in lineage/versioning. The chapter closes by outlining the book’s path from fundamentals to advanced patterns and deployment operations.

For this weather dashboard, weather data is fetched from an external API and fed into a dynamic dashboard.

Graph representation of the data pipeline for the weather dashboard. Nodes represent tasks and directed edges represent dependencies between tasks (with an edge pointing from task 1 to task 2, indicating that task 1 needs to be run before task 2).

Cycles in graphs prevent task execution due to circular dependencies. In acyclic graphs (top), there is a clear path to execute the three different tasks. However, in cyclic graphs (bottom), there is no longer a clear execution path due to the interdependency between tasks 2 and 3.

Using the DAG structure to execute tasks in the data pipeline in the correct order: depicts each task’s state during each of the loops through the algorithm, demonstrating how this leads to the completed execution of the pipeline (end state)

Overview of the umbrella demand use case, in which historical weather and sales data are used to train a model that predicts future sales demands depending on weather forecasts

Independence between sales and weather tasks in the graph representation of the data pipeline for the umbrella demand forecast model. The two sets of fetch/cleaning tasks are independent as they involve two different data sets (the weather and sales data sets). This independence is indicated by the lack of edges between the two sets of tasks.

Airflow pipelines are defined as DAGs using Python code in DAG files. Each DAG file typically defines one DAG, which describes the different tasks and their dependencies. Besides this, the DAG also defines a schedule interval that determines when the DAG is executed by Airflow.

The main components involved in Airflow are the Airflow API server, scheduler, DAG processor, triggerer and workers.

Developing and executing pipelines as DAGs using Airflow. Once the user has written the DAG, the DAG Processor and scheduler ensure that the DAG is run at the right moment. The user can monitor progress and output while the DAG is running at all times.

The login page for the Airflow web interface. In the code examples accompanying this book, a default user “airflow” is provided with the password “airflow”.

The main page of Airflow’s web interface, showing a high-level overview of all DAGs and their recent results.

The DAGs page of Airflow’s web interface, showing a high-level overview of all DAGs and their recent results.

The graph view in Airflow’s web interface, showing an overview of the tasks in an individual DAG and the dependencies between these tasks

Airflow’s grid view, showing the results of multiple runs of the umbrella sales model DAG (most recent + historical runs). The columns show the status of one execution of the DAG and the rows show the status of all executions of a single task. Colors (which you can see in the e-book version) indicate the result of the corresponding task. Users can also click on the task “squares” for more details about a given task instance, or to manage the state of a task so that it can be rerun by Airflow, if desired.x

Summary

Directed Acyclic Graphs (DAGs) are a visual tool used to represent data workflows in data processing pipelines. A node in a DAG denote the task to be performed, and edges define the dependencies between them. This is not only visually more understandable but also aids in better representation, easier debugging + rerunning, and making use of parallelism compared to single monolithic scripts.
In Airflow, DAGs are defined using Python files. Airflow 3.0 introduced the option of using other languages. In this book we will focus on Python. These scripts outline the order of task execution and their interdependencies. Airflow parses these files to construct and understand the DAG's structure, enabling task orchestration and scheduling.
Although many workflow managers have been developed over the years for executing graphs of tasks, Airflow has several key features that makes it uniquely suited for implementing efficient, batch-oriented data pipelines.
Airflow excels as a workflow orchestration tool due to its intuitive design, scheduling capabilities, and extensible framework. It provides a rich user interface for monitoring and managing tasks in data processing workflows.
Airflow is comprised of five key components:

DAG Processor: Reads and parses the DAGs and stores the resulting serialized version of these DAGs in the Metastore for use by (among others) the scheduler
Scheduler: Reads the DAGs parsed by the DAG Processor, determines if their schedule intervals have elapsed, and queues their tasks for execution.
Worker: Execute the tasks assigned to them by the scheduler.
Triggerer: It handles the execution of deferred tasks, which are waiting for external events or conditions.
API Server: Among other things, presents a user interface for visualizing and monitoring the DAGs and their execution status. The API Server also acts as the interface between all Airflow components

Airflow enables the setting of a schedule for each DAG, specifying when the pipeline should be executed. In addition, Airflow’s built-in mechanisms are able to manage task failures, automatically.
Airflow is well-suited for batch-oriented data pipelines, offering sophisticated scheduling options that enable regular, incremental data processing jobs. On the other hand, Airflow is not the right choice for streaming workloads or for implementing highly dynamic pipelines where DAG structure changes from one day to the other.

FAQ

What is Apache Airflow and what problem does it solve?

Airflow is an open source orchestrator for building, scheduling, and monitoring data pipelines. It represents workflows as directed acyclic graphs (DAGs) of tasks and coordinates work across systems to produce reliable, high‑quality data.

What is a DAG in Airflow, and why must it be acyclic?

A DAG is a directed acyclic graph where nodes are tasks and edges represent dependencies. The “acyclic” property prevents circular dependencies that would deadlock execution (for example, task A waiting on task B while B also waits on A).

How does Airflow execute a pipeline end to end?

- DAG files are parsed by the DAG Processor and stored in the metastore (via the API server).
- The Scheduler checks each DAG’s schedule; when due, it evaluates task dependencies and queues ready tasks.
- Workers pick up queued tasks and run them; results and logs are recorded in the metastore.
- The Triggerer monitors certain async/deferred tasks so workers can focus on runnable work.
- Users observe and manage runs through the API server’s web interface.

Why represent pipelines as graphs instead of a single sequential script?

Graphs make dependencies explicit, enable parallelism for independent branches, and isolate failures so you only rerun failed tasks (and downstream dependents) instead of re-executing a monolithic script. They also scale better as pipelines grow in complexity.

How are pipelines defined in Airflow?

You write DAGs as code in Python DAG files that declare tasks, dependencies, and scheduling metadata. Airflow 3.0 also introduces options for non‑Python languages, but Python remains the primary way to author DAGs and the language Airflow itself is written in.

How does Airflow integrate with external systems?

Because tasks are implemented in Python, Airflow leverages a large ecosystem of “providers” and hooks/operators to connect to databases, big data platforms, cloud services, and more—letting you orchestrate work across many technologies in one pipeline.

How does scheduling work in Airflow?

You assign each DAG a schedule (hourly, daily, weekly, cron-like, or event-driven patterns). The scheduler evaluates when each run is due and triggers tasks once their upstream dependencies are satisfied. Airflow’s time semantics also track previous and next intervals, which power incremental processing and backfills.

How do I monitor runs and handle failures?

Use the web UI (via the API server). The Graph view shows task structure and statuses; the Grid view shows historical and current runs. Tasks can be configured with retries and delays; failures are logged, can trigger notifications, and you can clear specific tasks to rerun them (including dependent tasks).

What are incremental loading and backfilling in Airflow?

Incremental loading processes only the data for a DAG run’s time interval (the delta) instead of reprocessing everything. Backfilling lets you run a DAG for past intervals to build or rebuild historical datasets—useful after code changes or when initializing new tables.

When is Airflow a good fit—and when isn’t it?

Good fit: batch or regularly scheduled workflows (or irregular, event-driven schedules), strong integration needs, applying software engineering best practices to pipelines, and easy backfilling at scale. Not a fit: real-time streaming/event-at-a-time processing, teams with little coding experience preferring UI-only tools, or cases needing built-in lineage/versioning (you’d combine Airflow with specialized tools for those).

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $28.79

you save $19.20 (40%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $28.79

you save $19.20 (40%)

eBook

pdf, ePub, online

$47.99 $28.79

you save $19.20 (40%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more