Overview

1 Meet Apache Airflow

Modern organizations increasingly rely on data, and the rising scale of that data makes robust orchestration essential. This chapter introduces Apache Airflow as an orchestration platform that coordinates work across disparate systems rather than processing data itself. It frames pipelines as collections of time-bounded, batch tasks expressed in Python, highlighting Airflow’s strength as a flexible, extensible “conductor” that stitches together diverse technologies to run reliable, scheduled data workflows.

The chapter explains why modeling pipelines as directed acyclic graphs (DAGs) is powerful: tasks are nodes, dependencies are edges, and the acyclic property prevents deadlocks. This structure enables a straightforward scheduling algorithm, natural parallelism where branches are independent, and the ability to rerun only failed pieces instead of entire scripts. It also situates Airflow within the broader ecosystem of workflow managers, noting key differences such as code-first versus configuration- or UI-driven definitions and the breadth of built-in features like scheduling, monitoring, and backfilling.

Airflow’s core concepts and runtime are outlined: define DAGs in Python, leverage a large provider ecosystem to integrate with external systems, and use schedules to run pipelines on time- or event-driven cadences. The platform’s components (DAG processor, scheduler, workers, triggerer, and API server) collaborate to parse, schedule, execute, and track tasks, with a web UI for visualization, logs, retries, and failure handling. The chapter closes with guidance on fit: Airflow excels at batch and recurring workflows, time-based partitioning, integrations, and engineering best practices, while it’s less suited for real-time streaming, highly fluid DAG structures, or teams without Python experience; it focuses on orchestration rather than data lineage or versioning. The rest of the book takes a hands-on path from fundamentals to advanced patterns and deployment considerations.

For this weather dashboard, weather data is fetched from an external API and fed into a dynamic dashboard.
Graph representation of the data pipeline for the weather dashboard. Nodes represent tasks and directed edges represent dependencies between tasks (with an edge pointing from task 1 to task 2, indicating that task 1 needs to be run before task 2).
Cycles in graphs prevent task execution due to circular dependencies. In acyclic graphs (top), there is a clear path to execute the three different tasks. However, in cyclic graphs (bottom), there is no longer a clear execution path due to the interdependency between tasks 2 and 3.
Using the DAG structure to execute tasks in the data pipeline in the correct order: depicts each task’s state during each of the loops through the algorithm, demonstrating how this leads to the completed execution of the pipeline (end state)
Overview of the umbrella demand use case, in which historical weather and sales data are used to train a model that predicts future sales demands depending on weather forecasts
Independence between sales and weather tasks in the graph representation of the data pipeline for the umbrella demand forecast model. The two sets of fetch/cleaning tasks are independent as they involve two different data sets (the weather and sales data sets). This independence is indicated by the lack of edges between the two sets of tasks.
Airflow pipelines are defined as DAGs using Python code in DAG files. Each DAG file typically defines one DAG, which describes the different tasks and their dependencies. Besides this, the DAG also defines a schedule interval that determines when the DAG is executed by Airflow.
The main components involved in Airflow are the Airflow API server, scheduler, DAG processor, triggerer and workers.
Developing and executing pipelines as DAGs using Airflow. Once the user has written the DAG, the DAG Processor and scheduler ensure that the DAG is run at the right moment. The user can monitor progress and output while the DAG is running at all times.
The login page for the Airflow web interface. In the code examples accompanying this book, a default user “airflow” is provided with the password “airflow”.
The main page of Airflow’s web interface, showing a high-level overview of all DAGs and their recent results.
The DAGs page of Airflow’s web interface, showing a high-level overview of all DAGs and their recent results.
The graph view in Airflow’s web interface, showing an overview of the tasks in an individual DAG and the dependencies between these tasks
Airflow’s grid view, showing the results of multiple runs of the umbrella sales model DAG (most recent + historical runs). The columns show the status of one execution of the DAG and the rows show the status of all executions of a single task. Colors (which you can see in the e-book version) indicate the result of the corresponding task. Users can also click on the task “squares” for more details about a given task instance, or to manage the state of a task so that it can be rerun by Airflow, if desired.x

Summary

  • Directed Acyclic Graphs (DAGs) are a visual tool used to represent data workflows in data processing pipelines. A node in a DAG denote the task to be performed, and edges define the dependencies between them. This is not only visually more understandable but also aids in better representation, easier debugging + rerunning, and making use of parallelism compared to single monolithic scripts.
  • In Airflow, DAGs are defined using Python files. Airflow 3.0 introduced the option of using other languages. In this book we will focus on Python. These scripts outline the order of task execution and their interdependencies. Airflow parses these files to construct and understand the DAG's structure, enabling task orchestration and scheduling.
  • Although many workflow managers have been developed over the years for executing graphs of tasks, Airflow has several key features that makes it uniquely suited for implementing efficient, batch-oriented data pipelines.
  • Airflow excels as a workflow orchestration tool due to its intuitive design, scheduling capabilities, and extensible framework. It provides a rich user interface for monitoring and managing tasks in data processing workflows.
  • Airflow is comprised of five key components:
    1. DAG Processor: Reads and parses the DAGs and stores the resulting serialized version of these DAGs in the Metastore for use by (among others) the scheduler
    2. Scheduler: Reads the DAGs parsed by the DAG Processor, determines if their schedule intervals have elapsed, and queues their tasks for execution.
    3. Worker: Execute the tasks assigned to them by the scheduler.
    4. Triggerer: It handles the execution of deferred tasks, which are waiting for external events or conditions.
    5. API Server: Among other things, presents a user interface for visualizing and monitoring the DAGs and their execution status. The API Server also acts as the interface between all Airflow components
  • Airflow enables the setting of a schedule for each DAG, specifying when the pipeline should be executed. In addition, Airflow’s built-in mechanisms are able to manage task failures, automatically.
  • Airflow is well-suited for batch-oriented data pipelines, offering sophisticated scheduling options that enable regular, incremental data processing jobs. On the other hand, Airflow is not the right choice for streaming workloads or for implementing highly dynamic pipelines where DAG structure changes from one day to the other.

FAQ

What is Apache Airflow, and what problem does it solve?Apache Airflow is an open source platform for authoring, scheduling, and monitoring workflows as directed acyclic graphs (DAGs). It orchestrates tasks across systems rather than processing data itself. Airflow is especially suited to batch and time-based pipelines, coordinating work so tasks run in the right order and at the right time.
How does Airflow fit into the ecosystem of workflow managers?Airflow is one of many workflow managers (e.g., Luigi, Dagster, Prefect, Argo, Oozie). Its strengths include defining workflows as code (Python), rich scheduling semantics, backfilling, horizontal scalability, and a powerful web UI. Compared to tools that use static formats (XML/YAML/JSON) or lack schedulers, Airflow offers flexible, code-driven pipelines with built-in scheduling and monitoring.
What is a DAG, and why must it be acyclic?A DAG (directed acyclic graph) represents a pipeline as tasks (nodes) connected by directed dependencies (edges). “Acyclic” means there are no loops, which prevents circular dependencies that would deadlock execution. This structure lets Airflow determine a valid execution order and parallelize independent tasks safely.
How are pipelines defined in Airflow?Pipelines are defined in Python DAG files that declare tasks and their dependencies, plus metadata such as schedules. Airflow parses these files to build the DAG in its metastore and then schedules and runs them. Starting with Airflow 3.0, other languages are possible, but Python remains the primary and most widely used option.
How does Airflow schedule and execute pipelines?Airflow’s key components are the DAG Processor, Scheduler, Workers, Triggerer, and API Server. The DAG Processor parses DAGs into the metastore; the Scheduler evaluates schedules and task dependencies, queuing runnable tasks; Workers execute tasks; the Triggerer handles deferrable/asynchronous waits; and the API Server provides the web UI and metastore access. This loop repeats, enabling timely and dependency-aware execution.
What does the Airflow UI provide for monitoring and debugging?The web UI shows an overview of DAGs, their health, and recent runs. Graph and Grid views reveal task structure and historical status, with easy access to logs. Airflow supports retries, notifications, and clearing task states so you can rerun failed or updated tasks selectively.
Why model pipelines as graphs instead of writing a single script?Graphs make dependencies explicit, allow independent branches to run in parallel, and limit reruns to only failed tasks (plus downstream tasks), instead of re-executing an entire monolithic script. This improves readability, efficiency, and recoverability as pipelines grow in complexity.
What are incremental loading and backfilling, and why are they useful?Incremental loading runs a pipeline for discrete schedule intervals (e.g., daily), processing only new/changed data since the last run. Backfilling lets you run historical intervals to rebuild or fill datasets after code changes or late data arrival. Together they save time and cost by avoiding full reprocessing and make recomputation straightforward.
When should you choose Airflow?Airflow excels at time-based or event-triggered batch workflows, pipelines that benefit from explicit time intervals, and orchestrations that integrate many external systems. It’s a strong fit when you want workflows-as-code with robust scheduling, monitoring, retries, and easy backfills, and when your team is comfortable with Python and software engineering best practices.
When is Airflow not the right choice?Avoid Airflow for real-time streaming of individual events, highly dynamic pipelines whose structure changes every run (and must be fully visualized), or teams without sufficient coding experience. Also note Airflow focuses on orchestration and doesn’t natively provide full data lineage or versioning—pair it with specialized tools if you need those features.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Data Pipelines with Apache Airflow, Second Edition ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Data Pipelines with Apache Airflow, Second Edition ebook for free