Welcome to Manning India!

We are pleased to be able to offer regional eBook pricing for Indian residents.
All eBook prices are discounted 40% or more!
Data Pipelines with Apache Airflow
Bas P. Harenslak and Julian Rutger de Ruiter
  • MEAP began September 2019
  • Publication in Summer 2020 (estimated)
  • ISBN 9781617296901
  • 325 pages (estimated)
  • printed in black & white

A great introduction to Apache Airflow. It's well-written and well thought out.

Kent Spillner
A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodge-podge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.
Table of Contents detailed table of contents

Part 1: Airflow Basics

1 Meet Apache Airflow

1.1 What is Apache Airflow

1.2 Introducing workflow managers

1.2.1 Workflow as a series of tasks

1.2.2 Expressing task dependencies

1.2.3 Workflow management systems

1.2.4 Configuration as code

1.2.5 Task execution model of workflow management systems

1.3 An overview of the Airflow architecture

1.3.1 Directed Acyclic Graphs

1.3.2 Batch processing

1.3.3 Defined in Python code

1.3.4 Scheduling and backfilling

1.3.5 Handling Failures

1.4 How to know if Airflow is right for you

1.4.1 When can Airflow go wrong?

1.4.2 Who will find Airflow useful?

1.5 Summary

2 Anatomy of an Airflow DAG

2.1 Collecting data from numerous sources

2.1.1 Exploring the data

2.2 Writing your first Airflow DAG

2.2.1 Tasks vs operators

2.2.2 Running arbitrary Python code

2.3 Running a DAG in Airflow

2.4 Running at regular intervals

2.5 Handling failing tasks

2.6 Summary

3 Scheduling in Airflow

3.1 An example: processing user events

3.2 Running at regular intervals

3.2.1 Defining scheduling intervals

3.2.2 Cron-based intervals

3.2.3 Frequency-based intervals

3.3 Processing data incrementally

3.3.1 Fetching events incrementally

3.3.2 Dynamic time references using execution dates

3.3.3 Partitioning your data

3.4 Understanding Airflow’s execution dates

3.4.1 Executing work in fixed-length intervals

3.5 Using backfilling to fill in past gaps

3.5.1 Executing work back in time

3.6 Best Practices for Designing Tasks

3.6.1 Atomicity

3.6.2 Idempotency

3.7 Summary

4 Templating Tasks Using the Airflow Context

4.1 Inspecting data for processing with Airflow

4.1.1 Determining how to load incremental data

4.2 Task context & Jinja templating

4.2.1 Templating operator arguments

4.2.2 What is available for templating?

4.2.3 Templating the PythonOperator

4.2.4 Providing variables to the PythonOperator

4.2.5 Inspecting templated arguments

4.3 Hooking up other systems

4.4 Summary

5 Complex task dependencies

5.1 Basic dependencies

5.1.1 Linear dependencies

5.1.2 Fan in/out dependencies

5.2 Branching

5.2.1 Branching within tasks

5.2.2 Branching within the DAG

5.3 Conditional tasks

5.4 More about trigger rules

5.4.1 What is a trigger rule?

5.4.2 A short example

5.4.3 The effect of failures

5.4.4 Other trigger rules

5.5 Summary

6 Triggering workflows

6.1 Polling conditions with sensors

6.1.1 Polling custom conditions

6.1.2 Sensors outside the happy flow

6.2 Triggering other DAGs

6.2.1 Backfilling with the TriggerDagRunOperator

6.2.2 Polling the state of other DAGs

6.3 Starting workflows with REST/CLI

6.4 Summary

Part 2: Beyond the basics

7 Building Custom Components

7.1 Starting with a PythonOperator

7.1.1 Simulating a movie rating API

7.1.2 Fetching ratings from the API

7.1.3 Building the actual DAG

7.2 Building a custom hook

7.2.1 Designing a custom hook

7.2.2 Building our DAG with the Movielens hook

7.3 Building a custom operator

7.3.1 Defining a custom operator

7.3.2 Building an operator for fetching ratings

7.4 Building custom sensors

7.5 Packaging your components

7.5.1 Bootstrapping a Python package

7.5.2 Installing your package

7.6 Summary

8 Testing

9 Communicating with External Systems

10 Best Practices

11 Case studies

Part 3: Airflow operations

12 Running Airflow in production

13 Airflow in the clouds

14 Securing Airflow

15 Future developments

About the Technology

Data pipelines are used to extract, transform and load data to and from multiple sources, routing it wherever it’s needed -- whether that’s visualisation tools, business intelligence dashboards, or machine learning models. But pipelines can be challenging to manage, especially when your data has to flow through a collection of application components, servers, and cloud services. That’s where Apache Airflow comes in! Airflow streamlines the whole process, giving you one tool for programmatically developing and monitoring batch data pipelines, and integrating all the pieces you use in your data stack. Airflow lets you schedule, restart, and backfill pipelines, and its easy-to-use UI and workflows with Python scripting has users praising its incredible flexibility.

About the book

Data Pipelines with Apache Airflow is your essential guide to working with the powerful Apache Airflow pipeline manager. Expert data engineers Bas Harenslak and Julian de Ruiter take you through best practices for creating pipelines for multiple tasks, including data lakes, cloud deployments, and data science. Part desktop reference, part hands-on tutorial, this book teaches you the ins-and-outs of the Directed Acyclic Graphs (DAGs) that power Airflow, and how to write your own DAGs to meet the needs of your projects. You’ll learn how to automate moving and transforming data, managing pipelines by backfilling historical tasks, developing custom components for your specific systems, and setting up Airflow in production environments. With complete coverage of both foundational and lesser-known features, when you’re done you’ll be set to start using Airflow for seamless data pipeline development and management.

What's inside

  • Framework foundation and best practices
  • Airflow's execution and dependency system
  • Testing Airflow DAGs
  • Running Airflow in production

About the reader

For data-savvy developers, DevOps and data engineers, and system administrators with intermediate Python skills.

About the authors

Bas Harenslak and Julian de Ruiter are data engineers with extensive experience using Airflow to develop pipelines for major companies including Heineken, Unilever, and Booking.com. Bas is a committer, and both Bas and Julian are active contributors to Apache Airflow.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
MEAP combo $49.99 pBook + eBook + liveBook
MEAP eBook $19.99 $39.99 pdf + ePub + kindle + liveBook
Prices displayed in rupees will be charged in USD when you check out.

placing your order...

Don't refresh or navigate away from the page.

FREE domestic shipping on three or more pBooks