table of content

Part 1: The building blocks of scalable computing

1 Why scalable computing matters

1.1 Why Dask?

1.2 Cooking with DAGs

1.3 Scaling out, concurrency, and recovery

1.3.1 Scale Up vs. Scale Out

1.3.2 Concurrency and resource management

1.3.3 Recovering from failures

1.4 Introducing the companion dataset

Summary

2 Introducing Dask

2.1 Hello Dask: A first look at the DataFrame API

2.2 Visualizing DAGs

2.3 Lazy Computations

2.3.1 Data Locality

Summary

Part 2: Working with structured data using Dask DataFrames

3 Introducing Dask DataFrames

3.1 Why Use DataFrames?

3.2 Dask and Pandas

3.2.1 Managing DataFrame Partitioning

3.2.2 What is the Shuffle?

3.3 Limitations of Dask DataFrames

Summary

4 Loading data into DataFrames

4.1 Reading data from text files

4.1.1 Using Dask Datatypes

4.1.2 Creating Schemas for Dask DataFrames

4.2 Reading data from relational databases

4.3 Reading data from HDFS and S3

4.4 Reading data in Parquet format

Summary

5 Cleaning and transforming DataFrames

5.1 Working with indexes and axes

5.2 Dealing with missing values

5.3 Recoding data

5.4 Elementwise operations

5.5 Filtering and reindexing DataFrames

5.6 Joining and concatenating DataFrames

5.7 Writing data to text files and Parquet files

5.7.1 Writing to delimited text files

5.7.2 Writing to parquet files

Summary

6 Summarizing and analyzing DataFrames

6.1 Descriptive statistics

6.2 Built-in aggregate functions

6.3 Custom aggregate functions

6.4 Rolling (window) functions

Summary

7 Visualizing DataFrames with Seaborn

7.1 The prepare-reduce-collect-plot pattern

7.2 Visualizing continuous relationships with scatterplot and regplot

7.2.1 Creating a scatterplot with Dask and Seaborn

7.2.2 Adding a linear regression line to the scatterplot

7.2.3 Adding a nonlinear regression line to a scatterplot

7.3 Visualizing Categorical Relationships with violinplot

7.3.1 Creating a violinplot with Dask and Seaborn

7.3.2 Randomly sampling data from a Dask DataFrame

7.4 Visualizing two categorical relationships with heatmap

Summary

8 Visualizing Location Data with Datashader

8.1 What is Datashader and how does it work?

8.1.1 The five stages of the Datashader rendering pipeline

8.1.2 Creating a Datashader Visualization

8.2 Plotting location data as an interactive heatmap

8.2.1 Preparing geographic data for map tiling

8.2.2 Creating the interactive heatmap

Summary

Part 3: Extending and deploying Dask

9 Working with Bags and Arrays

9.1 Reading and parsing unstructured data with Bags

9.1.1 Selecting and viewing data from a Bag

9.1.2 Common parsing issues and how to overcome them

9.1.3 Working with delimiters

9.2 Transforming, filtering, and folding elements

9.2.1 Transforming elements with the map method

9.2.2 Filtering Bags with the filter method

9.2.3 Calculating descriptive statistics on Bags

9.2.4 Creating aggregate functions using the foldby method

9.3 Building Arrays and DataFrames from Bags

9.4 Using Bags for parallel text analysis with NLTK

9.4.1 The basics of bigram analysis

9.4.2 Extracting tokens and filtering stopwords

9.4.3 Analyzing the bigrams

Summary

10 Machine learning with Dask-ML

10.1 Building linear models with Dask-ML

10.1.1 Preparing the data with binary vectorization

10.1.2 Building a logistic regression model with Dask-ML

10.2 Evaluating and tuning Dask-ML models

10.2.1 Evaluating Dask-ML models with the score method

10.2.2 Building a naïve Bayes classifier with Dask-ML

10.2.3 Automatically tuning hyperparameters

10.3 Persisting Dask-ML models

Summary

11 Scaling and deploying Dask

11.1 Building a Dask cluster on Amazon AWS with Docker

11.1.1 Getting started

11.1.2 Creating a security key

11.1.3 Creating the ECS cluster

11.1.4 Configuring the cluster’s networking

11.1.5 Creating a shared data drive in Elastic File System

11.1.6 Allocating space for Docker images in Elastic Container Repository

11.1.7 Building and deploying images for scheduler, worker, and notebook

11.1.8 Connecting to the cluster

11.2 Running and monitoring Dask jobs on a cluster

11.3 Cleaning up the Dask cluster on AWS

Summary

Appendixes

Appendix A: Software Installation

A.1 Installing additional packages with Anaconda

A.2 Installing packages without Anaconda

A.3 Starting a Jupyter Notebook server

A.4 Configuring NLTK

Overview

3 Introducing Dask DataFrames

This chapter introduces Dask DataFrames as a scalable way to work with structured data—rows and columns—by coordinating many smaller Pandas DataFrames through Dask’s task graphs. It explains why DataFrames are a natural fit for common manipulation tasks compared to ad hoc Python structures, and emphasizes fundamental concepts such as axes (row-wise operations along axis 0 by default) and the index, which identifies rows and plays a central role in how Dask distributes work. The index underpins partitioning, enabling Dask to map pieces of a dataset across workers while preserving the semantics of familiar Pandas-like operations.

Dask builds on Pandas by splitting large datasets into partitions that can be processed in parallel across machines, trading a small amount of scheduling overhead for significant speedups at scale. The chapter covers practical partition management: defaults like a 64 MB blocksize when reading CSVs, specifying a target number of partitions, and inspecting layout via divisions and npartitions. It shows how to diagnose imbalance (for example, after filtering) with map_partitions and how to fix it via repartition, noting that changes are lazy until computed. A key performance theme is the “shuffle”—the network-heavy redistribution required by sorts, groupbys, joins, and reindexing—along with strategies to mitigate it: store data pre-sorted when possible, use sorted columns as indices to make lookups and joins efficient, and persist intermediate results to avoid re-shuffling.

The chapter closes with limitations and best practices. Dask DataFrames do not expose the full Pandas API and are immutable, so structure-altering operations (like insert/pop), certain windowed functions (expanding/ewm), and complex reshapes (stack/unstack, melt) are restricted because they induce expensive shuffles. Relational-style operations (join/merge, groupby, rolling) are supported but can become bottlenecks unless aligned on an indexed, sorted key. Indexing itself can be costly if data must be globally sorted, and reset_index behaves per partition, yielding non-unique, restarted sequences. The guidance is to use Dask for ingestion, filtering, and parallelizable transforms, then move reduced results to Pandas for operations better suited to a single-machine workflow, all while following Pandas best practices to get the most from Dask.

The Data Science with Python and Dask workflow

An example of structured data

A Dask representation of the structured data example from Figure 3.1

Dask allows a single Pandas DataFrame to be worked on in parallel by multiple hosts

Processing data in parallel across several machines

A GroupBy operation that requires a shuffle

The result of calling reset_index on a Dask DataFrame

Summary

In this chapter you learned

Dask DataFrames consist of rows (axis 0), columns (axis 1), and an index.
DataFrame methods tend to operate row-wise by default.
Inspecting how a DataFrame is partitioned can be done by accessing the divisions attribute of a DataFrame
Filtering a DataFrame can cause an imbalance in the size of each partition. For best performance, partitions should be roughly equal in size. It’s a good practice to repartition a DataFrame using the repartition method after filtering a large amount of data.
For best performance, DataFrames should be indexed by a logical column, partitioned by their index, and the index should be pre-sorted.

FAQ

What is structured data and when should I use Dask DataFrames?

Structured data is organized into rows and columns (like spreadsheets or database tables). Use Dask DataFrames when you need to manipulate large, tabular datasets that don’t fit comfortably in memory or when you want to parallelize work across cores or a cluster. For small, in-memory datasets, plain Pandas is usually faster and simpler.

How do Dask DataFrames relate to Pandas and Dask’s Delayed/DAG model?

Dask DataFrames are composed of many smaller Pandas DataFrames (partitions) and operations on them build a task graph (DAG). Dask schedules these tasks across workers to execute in parallel, giving you a Pandas-like API with scalable, distributed execution.

What are axes and the index in a DataFrame, and why do they matter in Dask?

Axis 0 refers to rows (the default for most operations) and Axis 1 refers to columns. The index identifies each row; Dask does not enforce uniqueness, but the index is crucial because Dask uses it to define partition boundaries and to efficiently distribute and locate data across workers.

What is a partition in Dask DataFrames?

A partition is a relatively small Pandas DataFrame that forms one chunk of a Dask DataFrame. Dask processes partitions in parallel. When reading data (e.g., with read_csv), Dask uses a default blocksize of about 64 MB to create partitions, or you can specify a target number of partitions via the npartitions argument.

How can I inspect how my Dask DataFrame is partitioned?

Use .npartitions to see the number of partitions and .divisions to see index-based partition boundaries. You can also apply map_partitions(len).compute() to count rows per partition. In divisions, all but the last partition are left-closed/right-open intervals; the last partition includes its upper bound.

When should I repartition, and how?

Repartition when partitions become imbalanced (e.g., after heavy filtering) or when you want a different partition count. Call df.repartition(npartitions=k). Reducing partitions concatenates; increasing partitions splits. Repartitioning is lazy—no data moves until you trigger execution (e.g., with compute, head, or persist). Existing divisions are retained unless you explicitly update them.

What is a “shuffle” and which operations trigger it?

A shuffle redistributes rows across partitions/workers (broadcasting/rearranging data) so related rows end up together. It’s required for operations like set_index, sorting, many joins/merges (especially on non-index keys), and groupby aggregations. Shuffles are expensive because they move data over the network.

How can I minimize shuffle costs?

- Store data pre-sorted in the source system when possible. - Use a sorted column as the index to make joins and lookups partition-aware. - Design joins to align on the index. - If you must shuffle, persist the result to avoid repeating the shuffle in downstream steps.

What limitations do Dask DataFrames have compared to Pandas?

Dask does not expose the full Pandas API. DataFrames are immutable (no in-place structural changes like insert/pop). Some window methods (e.g., expanding/ewm) and complex reshapes (stack/unstack, melt) are unsupported or very costly due to shuffling. Relational operations (join/merge, groupby, rolling) can be bottlenecks. As with Pandas, row-wise apply/iterrows are slow—prefer vectorized operations. A common pattern is to use Dask for heavy lifting and reduction, then switch to Pandas for complex, smaller-scale operations.

How does reset_index behave in Dask vs Pandas?

In Dask, reset_index is applied per partition (like a map_partitions), so each partition’s index restarts at 0. You do not get a unique, global sequential index across the whole DataFrame. Avoid relying on reset_index for keys you plan to join, group, or sort on.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $41.24

you save $13.75 (25%)

include audio $19.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $41.24

you save $13.75 (25%)

include audio $19.99 $14.99

eBook

pdf, ePub, online

$54.99 $41.24

you save $13.75 (25%)

include audio $19.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more