table of content

Part 1 Getting started

1 Meet Apache Airflow

1.1 Introducing data pipelines

1.1.1 Drawing a pipeline as a graph

1.1.2 Executing a pipeline graph

1.1.3 Pipeline graphs vs. sequential scripts

1.1.4 Running pipelines using workflow managers

1.2 Introducing Airflow

1.2.1 Defining pipelines flexibly in (Python) code

1.2.2 Integrating with external systems

1.2.3 Scheduling and executing pipelines

1.2.4 Monitoring and handling failures

1.2.5 Incremental loading and backfilling

1.3 When to use Airflow

1.3.1 Reasons to choose Airflow

1.3.2 Reasons not to choose Airflow

1.4 The rest of this book

2 Anatomy of an Airflow DAG

2.1 Collecting data from numerous sources

2.2 Writing your first Airflow DAG

2.2.1 Tasks vs. operators

2.2.2 Running arbitrary Python code

2.3 Running a DAG in Airflow

2.3.1 Running Airflow in a Python environment

2.3.2 Running Airflow with Docker

2.3.3 Inspecting the DAG in Airflow

2.4 Running at regular intervals

2.5 Handling failing tasks

2.6 DAG versioning

3 Time-based scheduling

3.1 Processing user events

3.2 The basic components of an Airflow schedule

3.3 Running regularly using trigger-based schedules

3.3.1 Defining a daily schedule

3.3.2 Using cron expressions

3.3.3 Using shorthand expressions

3.3.4 Using frequency-based timetables

3.3.5 Summarizing trigger timetables

3.4 Incremental processing with data intervals

3.4.1 Processing data incrementally

3.4.2 Defining incremental schedules with data intervals

3.4.3 Defining intervals using frequencies

3.4.4 Summarizing interval-based schedules

3.5 Handling irregular intervals

3.6 Managing backfilling of historical data

3.7 Designing well-behaved tasks

3.7.1 Atomicity

3.7.2 Idempotency

4 Asset-aware scheduling

4.1 Challenges of scaling time-based schedules

4.2 Introducing asset-aware scheduling

4.3 Producing asset events

4.4 Consuming asset events

4.5 Adding extra information to events

4.6 Skipping updates

4.7 Consuming multiple assets

4.8 Combining time- and asset-based schedules

5 Templating tasks using the Airflow context

5.1 Inspecting data for processing with Airflow

5.2 Task context and Jinja templating

5.2.1 Templating operator arguments

5.2.2 Templating the PythonOperator

5.2.3 Passing additional variables to the PythonOperator

5.2.4 Inspecting templated arguments

5.3 What is available for templating

5.4 Bringing it all together

6 Defining dependencies between tasks

6.1 Basic dependencies

6.1.1 Linear dependencies

6.1.2 Fan-in/fan-out dependencies

6.2 Branching

6.2.1 Branching within tasks

6.2.2 Branching within the DAG

6.3 Conditional tasks

6.3.1 Conditions within tasks

6.3.2 Making tasks conditional

6.3.3 Using built-in operators

6.4 Exploring trigger rules

6.4.1 What is a trigger rule?

6.4.2 The effect of failures

6.4.3 Other trigger rules

6.5 Sharing data between tasks

6.5.1 Sharing data using XComs

6.5.2 When and when not to use XComs

6.5.3 Using custom XCom backends

6.5.4 XCom cleanup

6.6 Chaining Python tasks with the Taskflow API

6.6.1 Simplifying Python tasks with the Taskflow API

6.6.2 Using the Taskflow API to define a new DAG

6.6.3 When and when not to use the Taskflow API

Part 2 Beyond the basics

7 Triggering workflows with external input

7.1 Polling conditions with sensors

7.1.1 Polling custom conditions

7.1.2 Working with sensors outside the happy flow

7.2 Starting workflows with the REST API and CLI

7.3 Triggering workflows with messages

8 Communicating with external systems

8.1 Installing additional operators

8.2 Developing a machine learning model

8.2.1 Use case: Classifying handwritten digits

8.2.2 Setting up the pipeline

8.2.3 Developing locally with external systems

8.3 Moving data from between systems

8.3.1 Use case: Analyzing Airbnb listings

8.3.2 Implementing a PostgresToS3Operator

8.3.3 Outsourcing the heavy work

9 Extending Airflow with custom operators and sensors

9.1 Starting with a PythonOperator

9.1.1 Simulating a movie-rating API

9.1.2 Fetching ratings from the API

9.1.3 Building the actual DAG

9.2 Building a custom hook

9.2.1 Designing a custom hook

9.2.2 Building a DAG with the MovielensHook

9.3 Building a custom operator

9.3.1 Defining a custom operator

9.3.2 Building an operator to fetch ratings

9.4 Building custom sensors

9.5 Building a custom deferrable operator

9.5.1 Executing asynchronous tasks using the triggerer

9.5.2 Running the Movielens sensor asynchronously

9.6 Packaging the components

9.6.1 Bootstrapping a Python package

9.6.2 Installing the package

9.6.3 Sharing the package with others

10 Testing

10.1 Getting started with testing

10.1.1 Integrity testing all DAGs

10.1.2 Setting up a CI/CD pipeline

10.1.3 Writing unit tests

10.1.4 Creating the pytest project structure

10.1.5 Testing with files on disk

10.2 Working with external systems

10.3 Using tests for development

10.4 Testing complete DAGs

10.4.1 Using dag.test() to test the whole DAG

10.4.2 Emulating production environments with Whirl

10.4.3 Creating DTAP environments

11 Running tasks in containers

11.1 Challenges of different operators

11.1.1 Operator interfaces and implementations

11.1.2 Complex and conflicting dependencies

11.1.3 Moving toward a generic operator

11.2 Introducing containers

11.2.1 What are containers?

11.2.2 Running a first Docker container

11.2.3 Creating a Docker image

11.2.4 Persisting data using volumes

11.3 Containers and Airflow

11.3.1 Tasks in containers

11.3.2 Why use containers?

11.4 Running tasks in Docker

11.4.1 Introducing the DockerOperator

11.4.2 Creating container images for tasks

11.4.3 Building a DAG with Docker tasks

11.4.4 Docker-based workflow

11.5 Running tasks in Kubernetes

11.5.1 Introducing Kubernetes

11.5.2 Setting up Kubernetes

11.5.3 Using the KubernetesPodOperator

11.5.4 Diagnosing Kubernetes-related issues

11.5.5 Differences between Kubernetes- and Docker-based workflows

Part 3 Airflow in practice

12 Best practices

12.1 Writing clean DAGs

12.1.1 Using style conventions

12.1.2 Managing credentials centrally

12.1.3 Specifying configuration details consistently

12.1.4 Avoiding computation in your DAG definition

12.1.5 Using factories to generate common patterns

12.1.6 Grouping related tasks with task groups

12.1.7 Being explicit when specifying your DAG schedule

12.1.8 Using Dynamic Task Mapping to generate tasks dynamically

12.2 Designing reproducible tasks

12.2.1 Requiring tasks to be idempotent

12.2.2 Ensuring that task results are deterministic

12.2.3 Designing tasks using functional paradigms

12.3 Handling data efficiently

12.3.1 Limiting the amount of data being processed

12.3.2 Loading/processing data incrementally

12.3.3 Caching intermediate data

12.3.4 Avoiding storing data on local filesystems

12.3.5 Offloading work to external/source systems

12.4 Managing concurrency using pools

13 Project: Finding the fastest way to get around NYC

13.1 Use case: Investigating traffic in New York City

13.2 Understanding the data

13.2.1 Yellow Cab file share

13.2.2 Citi Bike REST API

13.2.3 Deciding on a plan of approach

13.3 Extracting the data

13.3.1 Downloading Citi Bike data

13.3.2 Downloading Yellow Cab data

13.4 Applying similar transformations to data

13.5 Structuring a data pipeline

13.6 Developing idempotent data pipelines

14 Project: Keeping family traditions alive with Airflow and generative AI

14.1 Use case: Bringing family recipes to life

14.2 Fine-tuning an existing LLM

14.3 RAG to the rescue

14.4 Uploading recipes to the Recipe Vault UI

14.5 Preprocessing the recipes with DockerOperator

14.6 Creating a collection to store our recipes

14.6.1 Defining how to vectorize our text

14.6.2 Creating a schema for the collection

14.6.3 Preparing our collection of recipes

14.7 Updating and creating new records in the vector database

14.8 Deleting outdated records from the vector database

14.9 Adding recipes to the vector database

14.10 RAG in action

14.10.1 The R is for retrieving

14.10.2 Structuring our questions with prompt templates

14.10.3 Searching for recipes

Part 4 Airflow in production

15 Operating Airflow in production

15.1 Revisiting the Airflow architecture

15.2 Choosing the executor

15.2.1 Overview of executor types

15.2.2 Which executor is right for you?

15.2.3 Installing each executor

15.3 Configuring the metastore

15.4 Configuring the scheduler

15.4.1 Configuring scheduler components

15.4.2 Running multiple schedulers

15.4.3 Configuring system performance

15.4.4 Controlling the maximum number of running tasks

15.5 Configuring the DAG processor manager

15.6 Capturing logs

15.6.1 Capturing API server output

15.6.2 Capturing scheduler output

15.6.3 Capturing task logs

15.6.4 Sending logs to remote storage

15.7 Visualizing and monitoring Airflow metrics

15.7.1 Collecting metrics from Airflow

15.7.2 Configuring Airflow to send metrics

15.7.3 Configuring Prometheus to collect metrics

15.7.4 Creating dashboards with Grafana

15.7.5 What should you monitor?

15.8 Setting up alerts

15.9 Scaling Airflow beyond a single instance

16 Securing Airflow

16.1 Role-based access in the Airflow UI

16.1.1 Adding users

16.1.2 Configuring the RBAC interface

16.2 Encrypting data at rest

16.3 Connecting with a directory service

16.3.1 Understanding LDAP

16.3.2 Fetching users from an LDAP service

16.4 Encrypting traffic to the web server

16.4.1 Understanding HTTPS

16.4.2 Configuring a certificate for HTTPS

16.5 Fetching credentials from secrets-management systems

17 Airflow deployment options

17.1 Managed Airflow

17.1.1 Astronomer

17.1.2 Google Cloud Composer

17.1.3 Amazon Managed Workflows for Apache Airflow

17.2 Airflow on Kubernetes

17.2.1 Preparing the Kubernetes cluster

17.2.2 Connecting to your Kubernetes cluster

17.2.3 Deploying with the Apache Airflow Helm Chart

17.2.4 Changing the default deployment configuration

17.2.5 Changing the apiserver secret key

17.2.6 Using an external database for Airflow metadata

17.2.7 Deploying DAGs

17.2.8 Deploying a Python library

17.2.9 Configuring the executor(s)

17.3 Choosing a deployment strategy

Appendices

Appendix A: Running code samples

A.1 Understanding the code structure

A.2 Running the examples

A.2.1 Starting the Docker environment

A.2.2 Inspecting running services

A.2.3 Tearing down the environment

Appendix B: Prometheus metric mapping

Overview

6 Defining dependencies between tasks

This chapter deepens the reader’s understanding of how Airflow expresses and enforces relationships between tasks so pipelines run in the right order and make efficient use of parallelism. It starts from simple linear chains and expands to fan-out/fan-in patterns, explaining how explicit dependencies let the scheduler start tasks only when prerequisites have succeeded, propagate failures downstream, and exploit parallel execution where branches are independent. The discussion emphasizes that clear dependency modeling is more robust than time-based orchestration and that retries and error propagation are core to reliable execution.

The chapter then introduces dynamic control flow: branching and conditional execution. It contrasts “branching inside a task” (embedding if/else logic in operators) with “branching in the DAG” using a dedicated branching task that selects downstream paths explicitly, making behavior visible in the UI and enabling specialized operators. A key theme is trigger rules—the conditions that determine when a task can run. Using the default all_success with branching can inadvertently skip downstream work; rules such as none_failed or adding an explicit join task resolve that. Beyond branching, the chapter shows how to make tasks conditional (e.g., deploy only on the latest run) via a guard task that skips downstream tasks, or with built-ins like LatestOnlyOperator or ShortCircuitOperator. It rounds out trigger rules with practical guidance on propagation and alternative rules for cleanup, early-failure signaling, and eager continuation.

For passing small pieces of state across tasks, the chapter covers XComs: how to push and pull values explicitly or implicitly, how templating can consume them, and why to use them sparingly. It cautions about hidden dependencies, serialization and size limits, and suggests custom XCom backends and cleanup strategies when needed. Finally, it presents the Taskflow API, which simplifies Python task definition and chaining with decorators, turning returned values into downstream inputs while making data dependencies explicit in the DAG. The API improves readability but still relies on XCom under the hood and covers a subset of operators, so mixing Taskflow with the regular operator API may be necessary and should be done thoughtfully.

Our rocket-picture-fetching DAG from chapter 2 (originally shown in figure 2.3) consists of three tasks: downloading metadata, fetching the images, and sending a notification.

Overview of the DAG from the umbrella use case in chapter 1

The umbrella DAG, as rendered by Airflow’s graph view. This DAG performs several tasks, including fetching and cleaning sales data, combining them into a data set, and using the data set to train a machine learning model. Note that the handling of sales/weather data happens in separate branches of the DAG, as these tasks are not directly dependent on each other.

The execution order of tasks in the umbrella DAG, with numbers indicating the order in which tasks are run. Airflow starts by executing the start task, after which it can run the sales/weather fetch and clean tasks in parallel (as indicated by the a/b suffix). Note that this means that the weather/sales paths run independently, meaning that 3b may, for example, start executing before 2a. After completing both clean tasks, the rest of the DAG proceeds linearly with the execution of the join, train, and deployment tasks.

A possible example of different sets of tasks between the two ERP systems. If there are a lot of commonalities between different cases, you may be able to get away with a single set of tasks and some internal branching. However, if there are many differences between the two flows (such as shown here for the two ERP systems), you may be better off taking a different approach.

Example run for a DAG that branches between two ERP systems within the fetch_sales and clean_sales tasks. Because this branching happens within these two tasks, it is not possible to see which ERP system was used in this DAG run from this view. This means we would need to inspect our code (or possibly our logs) to identify which ERP system was used.

Supporting two ERP systems using branching within the DAG, implementing different sets of tasks for both systems. Airflow can choose between these two branches using a specific branching task (here, “Pick ERP system”), which tells Airflow which set of downstream tasks to execute.

Combining branching with the wrong trigger rules will result in downstream tasks being skipped. In this example, the fetch_sales_old task is skipped. This results in all tasks downstream of the fetch_sales_old task also being skipped, which is not what we want.

Branching in the umbrella DAG using trigger rule none_failed for the join_datasets task, which allows it (and its downstream dependencies) to still execute after the branch

To make the branching structure more clear, you can add an extra join task after the branch, which ties the lineages of the branch together before continuing with the rest of the DAG. This extra task has the added advantage that you don’t have to change any trigger rules for other tasks in the DAG, as you can set the required trigger rule on the join task. (Note that this means you no longer need to set the trigger rule for the join_datasets task.)

Example run for umbrella DAG with a condition inside the deploy_model task, which ensures that the deployment is only performed for the latest run. Because the condition is checked internally within the deploy_model task, we cannot discern from this view whether the model was actually deployed.

An alternative implementation of the umbrella DAG with conditional deployment, in which the condition is included as a task in the DAG, making the condition much more explicit than in our previous implementation.

Result of our latest_only condition for three runs of our umbrella DAG. This tree view shows that our deployment task was only run for the most recent execution window, as the deploy_model task was skipped on previous executions. This shows that our condition indeed functions as expected.

Tracing the execution of the basic umbrella DAG (figure 6.4) using the default trigger rule all_success. (A) Airflow initially starts executing the DAG by running the only task that has no preceding tasks that have not been completed successfully: the start task. (B) Once the start task has been successfully completed, other tasks become ready for execution and are picked up by Airflow.

An upstream failure stops downstream tasks from being executed with the default trigger rule all_success, which requires all upstream tasks to be successful. Note that Airflow does continue executing tasks that do not have any dependency on the failed task (fetch_weather and clean_weather).

Overview of registered XCom values (under Browse > XComs in the web interface)

Implicit XComs from the PythonOperator are registered under the return_value key.

Subset of our previous DAG containing the train/deploy tasks, in which tasks and their dependencies are defined using the Taskflow API

Combining the Taskflow-style train/deploy tasks back into the original DAG, which also contains other (non-PythonOperator-based) operators

Summary

Airflow's basic task dependencies define linear and fan-in/fan-out structures in DAGs, maintaining task order and ensuring tasks execute only after their upstream dependencies have been completed.
Downstream tasks are skipped if dependent tasks fail, delaying execution until issues are resolved, and preventing downstream task execution in case of upstream failure.
Branching enables the definition of parallel workflows and multiple execution paths based on user-defined conditions.
The BranchPythonOperator enables the implementation of branches in DAGs, allowing the use of Python code to conditionally select the dag_id of the next task to be executed.
Conditional tasks enable skipping tasks if a specific execution condition is not met, offering flexibility in task execution.
Explicitly encoding branches/conditions in your DAG structure provides substantial benefits in terms of the interpretability of how your DAG was executed.
The execution of Airflow tasks is governed by trigger rules, which dictate the consequences if an upstream task is skipped or fails. For example, by changing the default all_success rule to none_failed, it is possible to continue with the DAG workflow even if a task upstream was skipped.
XComs facilitate the sharing of small data chunks among tasks, particularly useful when the output of one task is relevant on the execution of downstream tasks.
XComs are not suitable for sharing large data among tasks. For example, they are ideal for sharing filenames, paths, or small API responses. For larger datasets, it's recommended to use external resources such as external databases or blob storage.
The Taskflow API simplifies DAGs by converting Python functions into Airflow tasks using decorators, streamlining task creation.

FAQ

How do I define basic dependencies (linear, fan-out, fan-in) between Airflow tasks?

Linear: chain tasks in order with the bitshift operator.

t1 >> t2 >> t3

Fan-out (one-to-many): one upstream to multiple downstream tasks.

start >> [fetch_weather, fetch_sales]

Fan-in (many-to-one): multiple upstream tasks converge to one downstream task.

[clean_weather, clean_sales] >> join_datasets

These structures let Airflow schedule tasks as soon as their upstream dependencies complete successfully.

Which tasks run in parallel, and how does Airflow decide execution order?

Airflow continuously checks tasks and schedules any whose upstream dependencies have met their trigger rules. Independent branches (e.g., weather vs. sales) can run in parallel if your executor has capacity. After fan-out, sibling tasks run concurrently; after fan-in, the downstream task waits until all required upstream tasks satisfy its trigger rule.

Branching inside a task vs. branching in the DAG: what’s the difference?

- In-task branching: put if/else logic inside a PythonOperator. Pros: simple for minor differences. Cons: logic is hidden in code/logs; you can’t leverage specialized operators; the UI can’t show which path ran. - DAG-level branching: use BranchPythonOperator that returns the task_id(s) to run. Pros: explicit in the graph; works with specialized operators; easier to reason about and monitor.

def choose(**context):
    return "fetch_sales_new"  # or a list of task_ids

pick = BranchPythonOperator(task_id="pick_erp_system", python_callable=choose)
pick >> [fetch_sales_old, fetch_sales_new]

Why are downstream tasks skipped after a BranchPythonOperator, and how do I fix it?

By default, tasks use trigger_rule="all_success". When branching, unchosen tasks are marked skipped, so a downstream join that expects all parents to succeed will be skipped too. Fix by using a trigger rule that tolerates skips, e.g. none_failed (or none_failed_min_one_success).

join_datasets = PythonOperator(
    task_id="join_datasets",
    ...,
    trigger_rule="none_failed",
)

Should I add a dedicated “join” task after branching?

Often yes. Insert an EmptyOperator that merges the branch lineages and apply the permissive trigger rule on that join, keeping the rest of the DAG unchanged and clearer.

from airflow.providers.standard.operators.empty import EmptyOperator

join_branch = EmptyOperator(task_id="join_erp_branch", trigger_rule="none_failed")
[clean_sales_old, clean_sales_new] >> join_branch
join_branch >> join_datasets

What are trigger rules, and which ones are most useful?

Trigger rules control when a task is ready to run based on upstream states. - all_success (default): run when all parents succeeded. - none_failed: run when no parent failed (success or skipped allowed). - all_done: run when all parents finished (success/failed/skipped). - one_success / one_failed: run as soon as at least one parent succeeds/fails. - none_skipped: run when no parent was skipped. - all_failed: run when all parents failed. Use none_failed to join branches, all_done for cleanup, and eager rules (one_success/one_failed) to react early.

How do I run a task only for the most recent DAG run (e.g., deploy only once)?

Use LatestOnlyOperator to allow downstream tasks only on the latest run; older runs will mark downstream as skipped.

from airflow.providers.standard.operators.latest_only import LatestOnlyOperator

latest_only = LatestOnlyOperator(task_id="latest_only")
train_model >> latest_only >> deploy_model

Alternatively, insert a Python task that evaluates the condition and raises AirflowSkipException to skip downstream tasks when not the latest run.

How do XComs work, and what are common pitfalls?

- Push: context["task_instance"].xcom_push(key, value) or return a value from PythonOperator/@task. - Pull: context["task_instance"].xcom_pull(task_ids="train_model", key="model_id") or via Jinja templates. Pitfalls: - Hidden dependencies (scheduler doesn’t enforce order); always add explicit task dependencies. - Keep payloads small and serializable; XComs are stored in the metastore (limits and performance concerns). - Avoid using XComs to pass expiring credentials or large datasets.

Can I store larger or custom XCom payloads? How do I clean them up?

Yes. Implement a custom XCom backend by subclassing BaseXCom and defining serialize_value/deserialize_value, or use provider backends (e.g., object storage). Configure via the xcom_backend setting. For the default metastore, use airflow db clean to purge old XComs; with custom backends, implement your own retention/cleanup.

What is the Taskflow API, and when should I use it?

The Taskflow API lets you define Python tasks with @task and wire dependencies by passing function outputs to inputs. Returned values are passed via XCom automatically.

from airflow.sdk import task

@task
def train_model():
    return "model_id"

@task
def deploy_model(model_id):
    print(model_id)

mid = train_model()
deploy_model(mid)

Use it to reduce boilerplate and make data flow explicit. You can mix with classic operators, but be mindful of differing dependency syntax and XCom serialization limits. Not all operators have Taskflow decorators; use the regular API where needed.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $33.59

you save $14.40 (30%)

include audio $24.99 $17.49

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more