Data Pipelines with Apache Airflow
Bas P. Harenslak and Julian Rutger de Ruiter
  • MEAP began September 2019
  • Publication in November 2020 (estimated)
  • ISBN 9781617296901
  • 325 pages (estimated)
  • printed in black & white

I have looked for a number of other training materials on this subject and this one is the most comprehensive I've seen, giving many examples and best practices!

Robert Gimbel
A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodge-podge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.

About the Technology

Data pipelines are used to extract, transform and load data to and from multiple sources, routing it wherever it’s needed -- whether that’s visualisation tools, business intelligence dashboards, or machine learning models. But pipelines can be challenging to manage, especially when your data has to flow through a collection of application components, servers, and cloud services. That’s where Apache Airflow comes in! Airflow streamlines the whole process, giving you one tool for programmatically developing and monitoring batch data pipelines, and integrating all the pieces you use in your data stack. Airflow lets you schedule, restart, and backfill pipelines, and its easy-to-use UI and workflows with Python scripting has users praising its incredible flexibility.

About the book

Data Pipelines with Apache Airflow is your essential guide to working with the powerful Apache Airflow pipeline manager. Expert data engineers Bas Harenslak and Julian de Ruiter take you through best practices for creating pipelines for multiple tasks, including data lakes, cloud deployments, and data science. Part desktop reference, part hands-on tutorial, this book teaches you the ins-and-outs of the Directed Acyclic Graphs (DAGs) that power Airflow, and how to write your own DAGs to meet the needs of your projects. You’ll learn how to automate moving and transforming data, managing pipelines by backfilling historical tasks, developing custom components for your specific systems, and setting up Airflow in production environments. With complete coverage of both foundational and lesser-known features, when you’re done you’ll be set to start using Airflow for seamless data pipeline development and management.
Table of Contents detailed table of contents

Part 1: Airflow Basics

1 Meet Apache Airflow

1.1 Introducing data pipelines

1.1.1 Data pipelines as graphs

1.1.2 Executing a pipeline graph

1.1.3 Pipeline graphs vs. sequential scripts

1.1.4 Running pipeline using workflow managers

1.2 Introducing Airflow

1.2.1 Defining pipelines flexibly in (Python) code

1.2.2 Scheduling and executing pipelines

1.2.3 Monitoring and handling failures

1.2.4 Incremental loading and backfilling

1.3 When to use Airflow

1.3.1 Reasons to choose Airflow

1.3.2 Reasons NOT to choose Airflow

1.4 The rest of this book

1.5 Summary

2 Anatomy of an Airflow DAG

2.1 Collecting data from numerous sources

2.1.1 Exploring the data

2.2 Writing your first Airflow DAG

2.2.1 Tasks vs operators

2.2.2 Running arbitrary Python code

2.3 Running a DAG in Airflow

2.4 Running at regular intervals

2.5 Handling failing tasks

2.6 Summary

3 Scheduling in Airflow

3.1 An example: processing user events

3.2 Running at regular intervals

3.2.1 Defining scheduling intervals

3.2.2 Cron-based intervals

3.2.3 Frequency-based intervals

3.3 Processing data incrementally

3.3.1 Fetching events incrementally

3.3.2 Dynamic time references using execution dates

3.3.3 Partitioning your data

3.4 Understanding Airflow’s execution dates

3.4.1 Executing work in fixed-length intervals

3.5 Using backfilling to fill in past gaps

3.5.1 Executing work back in time

3.6 Best Practices for Designing Tasks

3.6.1 Atomicity

3.6.2 Idempotency

3.7 Summary

4 Templating Tasks Using the Airflow Context

4.1 Inspecting data for processing with Airflow

4.1.1 Determining how to load incremental data

4.2 Task context & Jinja templating

4.2.1 Templating operator arguments

4.2.2 What is available for templating?

4.2.3 Templating the PythonOperator

4.2.4 Providing variables to the PythonOperator

4.2.5 Inspecting template arguments

4.3 Hooking up other systems

4.4 Summary

5 Complex Task Orchestration

5.1 Basic Dependencies

5.1.1 Linear Dependencies

5.1.2 Fan In/Out Dependencies

5.2 Branching

5.2.1 Branching within Tasks

5.2.2 Branching within the DAG

5.3 Conditional Tasks

5.3.1 Conditions within tasks

5.3.2 Making tasks conditional

5.3.3 Using built-in operators

5.4 More about Trigger Rules

5.4.1 What is a trigger rule?

5.4.2 The effect of failures

5.4.3 Other trigger rules

5.5 Sharing data between tasks

5.5.1 Sharing data using XComs

5.5.2 When (not) to use XComs

5.6 Summary

6 Triggering Workflows

6.1 Polling conditions with sensors

6.1.1 Polling custom conditions

6.1.2 Sensors outside the happy flow

6.2 Triggering other DAGs

6.2.1 Backfilling with the TriggerDagRunOperator

6.2.2 Polling the state of other DAGs

6.3 Starting workflows with REST/CLI

6.4 Summary

Part 2: Beyond the basics

7 Building Custom Components

7.1.1 Simulating a movie rating API

7.1.2 Fetching ratings from the API

7.1.3 Building the actual DAG

7.2 Building a custom hook

7.2.1 Designing a custom hook

7.2.2 Building our DAG with the Movielens hook

7.3 Building a custom operator

7.3.1 Defining a custom operator

7.3.2 Building an operator for fetching ratings

7.4 Building custom sensors

7.5 Packaging your components

7.5.1 Bootstrapping a Python package

7.5.2 Installing your package

7.6 Summary

8 Testing

8.1 Getting Started with Testing

8.1.1 Integrity testing all DAGs

8.1.2 Setting up a CI/CD pipeline

8.1.3 Writing unit tests

8.1.4 pytest project structure

8.1.5 Testing with files on disk

8.2 Working with DAGs and task context in tests

8.2.1 Working with external systems

8.3 Using tests for development

8.3.1 Testing complete DAGs

8.4 Summary

9 Communicating with External Systems

9.1 Connecting to Cloud Services

9.1.1 Installing extra dependencies

9.1.2 Developing a machine learning model

9.1.3 Developing locally with external systems

9.2 Moving data from between systems

9.2.1 Implementing a PostgresToS3Operator

9.2.2 Outsourcing the heavy work?

9.3 Summary

10 Best Practices

10.1 Writing clean DAGs

10.1.1 Use style conventions

10.1.2 Manage credentials centrally

10.1.3 Specify configuration details consistently

10.1.4 Avoid doing any computation in your DAG definition

10.1.5 Use factories to generate common patterns

10.1.6 Create new DAGs for big changes

10.2 Designing reproducible tasks

10.2.1 Always require tasks to be idempotent

10.2.2 Task results should be deterministic

10.2.3 Design tasks using functional paradigms

10.3 Handling data efficiently

10.3.1 Limit the amount of data being processed

10.3.2 Incremental loading/processing

10.3.3 Cache intermediate data

10.3.4 Don’t store data on local file systems

10.3.5 Offload work to external/source systems

10.4 Managing your resources

10.4.1 Manage concurrency using pools

10.4.2 Detect long running tasks using SLA’s and alerts

10.5 Summary

11 Running Tasks in Containers

11.1 Challenges of many different operators

11.1.1 Operator interfaces and implementations

11.1.2 Complex and conflicting dependencies

11.1.3 Moving towards a generic operator

11.2 Introducing containers

11.2.1 What are containers?

11.2.2 Running our first Docker container

11.2.3 Creating a Docker image

11.2.4 Persisting data using volumes

11.3 Containers and Airflow

11.3.1 Tasks in containers

11.3.2 Why use containers?

11.4 Running tasks in Docker

11.4.1 Introducing the DockerOperator

11.4.2 Creating container images for tasks

11.4.3 Building a DAG with dockerized tasks

11.4.4 Docker-based workflow

11.5 Running tasks in Kubernetes

11.5.1 Introducing Kubernetes

11.5.2 Setting up Kubernetes

11.5.3 Using the KubernetesPodOperator

11.5.5 Differences with Docker-based workflows

11.6 Summary

Part 3: Airflow operations

12 Operating Airflow in Production

12.1 Airflow Architectures

12.1.1 Which executor is right for me?

12.1.2 Configuring a metastore for Airflow

12.1.3 A closer look at the scheduler

12.2 Installing each executor

12.2.1 Setting up the SequentialExecutor

12.2.2 Setting up the LocalExecutor

12.2.3 Setting up the CeleryExecutor

12.2.4 Setting up the KubernetesExecutor

12.3 Capturing logs of all Airflow processes

12.3.1 Capturing the webserver output

12.3.2 Capturing the scheduler output

12.3.3 Capturing task logs

12.3.4 Sending logs to remote storage

12.4 Visualizing and monitoring Airflow metrics

12.4.1 Collecting metrics from Airflow

12.4.2 Configuring Airflow to send metrics

12.4.3 Configuring Prometheus to collect metrics

12.4.4 Creating dashboards with Grafana

12.4.5 What Should You Monitor?

12.5 How to get notified of a failing task

12.5.1 Alerting within DAGs and operators

12.5.2 Defining service level agreements

12.6 Scalability and performance

12.6.1 Controlling the maximum number of running tasks

12.6.2 System performance configurations

12.7 Summary

13 Airflow in the Clouds

13.1 Designing (cloud) deployment strategies

13.2 AWS

13.2.1 Deploying in AWS

13.2.2 AWS-specific hooks and operators

13.2.3 Example: serverless movie ranking with AWS Athena

13.3 Azure

13.3.1 Deploying Airflow

13.3.2 Azure-specific hooks/operators

13.3.3 Example: serverless movie ranking with Azure Synapse

13.4 Google Cloud Platform

13.4.1 Deploying Airflow in GCP

13.4.2 GCP-specific hooks and operators

13.4.3 Example: serverless movie ranking on GCP

13.5 Managed services

13.6 Summary

14 Securing Airflow

14.1 Introducing the RBAC interface

14.1.1 Adding users to the RBAC interface

14.1.2 Configuring the RBAC interface

14.2 Encrypting data at rest

14.2.1 Creating a Fernet key

14.3 Connecting with an LDAP service

14.3.1 Understanding LDAP

14.3.2 Fetching users from an LDAP service

14.4 Encrypting traffic to the webserver

14.4.1 Understanding HTTPS

14.4.2 Configuring a certificate for HTTPS

14.5 Fetching credentials from secret management systems

14.6 Summary

15 Project: Fastest Method of Transportation in NYC

15.1 Understanding the data

15.2 Extracting the data

15.2.1 Downloading Citi Bike data

15.2.2 Downloading Yellow Taxi data

15.3 Applying similar transformations to data

15.4 Structuring a data pipeline

15.5 Developing idempotent data pipelines

15.6 Summary

What's inside

  • Framework foundation and best practices
  • Airflow's execution and dependency system
  • Testing Airflow DAGs
  • Running Airflow in production

About the reader

For data-savvy developers, DevOps and data engineers, and system administrators with intermediate Python skills.

About the authors

Bas Harenslak and Julian de Ruiter are data engineers with extensive experience using Airflow to develop pipelines for major companies including Heineken, Unilever, and Bas is a committer, and both Bas and Julian are active contributors to Apache Airflow.

placing your order...

Don't refresh or navigate away from the page.
Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
print book $29.99 $49.99 pBook + eBook + liveBook
Additional shipping charges may apply
Data Pipelines with Apache Airflow (print book) added to cart
continue shopping
go to cart

eBook $24.99 $39.99 3 formats + liveBook
Data Pipelines with Apache Airflow (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.

FREE domestic shipping on three or more pBooks