Data Science with Python and Dask you own this product

Jesse C. Daniel

July 2019
ISBN 9781617295607
296 pages

Included with a Manning Online subscription

printed in black & white

available in Korean, Simplified Chinese

catalog / Software Development / Cloud / Machine Learning / Data Science and Analytics

table of content

Part 1: The building blocks of scalable computing

1 Why scalable computing matters

1.1 Why Dask?

1.2 Cooking with DAGs

1.3 Scaling out, concurrency, and recovery

1.3.1 Scale Up vs. Scale Out

1.3.2 Concurrency and resource management

1.3.3 Recovering from failures

1.4 Introducing the companion dataset

Summary

2 Introducing Dask

2.1 Hello Dask: A first look at the DataFrame API

2.2 Visualizing DAGs

2.3 Lazy Computations

2.3.1 Data Locality

Summary

Part 2: Working with structured data using Dask DataFrames

3 Introducing Dask DataFrames

3.1 Why Use DataFrames?

3.2 Dask and Pandas

3.2.1 Managing DataFrame Partitioning

3.2.2 What is the Shuffle?

3.3 Limitations of Dask DataFrames

Summary

4 Loading data into DataFrames

4.1 Reading data from text files

4.1.1 Using Dask Datatypes

4.1.2 Creating Schemas for Dask DataFrames

4.2 Reading data from relational databases

4.3 Reading data from HDFS and S3

4.4 Reading data in Parquet format

Summary

5 Cleaning and transforming DataFrames

5.1 Working with indexes and axes

5.2 Dealing with missing values

5.3 Recoding data

5.4 Elementwise operations

5.5 Filtering and reindexing DataFrames

5.6 Joining and concatenating DataFrames

5.7 Writing data to text files and Parquet files

5.7.1 Writing to delimited text files

5.7.2 Writing to parquet files

Summary

6 Summarizing and analyzing DataFrames

6.1 Descriptive statistics

6.2 Built-in aggregate functions

6.3 Custom aggregate functions

6.4 Rolling (window) functions

Summary

7 Visualizing DataFrames with Seaborn

7.1 The prepare-reduce-collect-plot pattern

7.2 Visualizing continuous relationships with scatterplot and regplot

7.2.1 Creating a scatterplot with Dask and Seaborn

7.2.2 Adding a linear regression line to the scatterplot

7.2.3 Adding a nonlinear regression line to a scatterplot

7.3 Visualizing Categorical Relationships with violinplot

7.3.1 Creating a violinplot with Dask and Seaborn

7.3.2 Randomly sampling data from a Dask DataFrame

7.4 Visualizing two categorical relationships with heatmap

Summary

8 Visualizing Location Data with Datashader

8.1 What is Datashader and how does it work?

8.1.1 The five stages of the Datashader rendering pipeline

8.1.2 Creating a Datashader Visualization

8.2 Plotting location data as an interactive heatmap

8.2.1 Preparing geographic data for map tiling

8.2.2 Creating the interactive heatmap

Summary

Part 3: Extending and deploying Dask

9 Working with Bags and Arrays

9.1 Reading and parsing unstructured data with Bags

9.1.1 Selecting and viewing data from a Bag

9.1.2 Common parsing issues and how to overcome them

9.1.3 Working with delimiters

9.2 Transforming, filtering, and folding elements

9.2.1 Transforming elements with the map method

9.2.2 Filtering Bags with the filter method

9.2.3 Calculating descriptive statistics on Bags

9.2.4 Creating aggregate functions using the foldby method

9.3 Building Arrays and DataFrames from Bags

9.4 Using Bags for parallel text analysis with NLTK

9.4.1 The basics of bigram analysis

9.4.2 Extracting tokens and filtering stopwords

9.4.3 Analyzing the bigrams

Summary

10 Machine learning with Dask-ML

10.1 Building linear models with Dask-ML

10.1.1 Preparing the data with binary vectorization

10.1.2 Building a logistic regression model with Dask-ML

10.2 Evaluating and tuning Dask-ML models

10.2.1 Evaluating Dask-ML models with the score method

10.2.2 Building a naïve Bayes classifier with Dask-ML

10.2.3 Automatically tuning hyperparameters

10.3 Persisting Dask-ML models

Summary

11 Scaling and deploying Dask

11.1 Building a Dask cluster on Amazon AWS with Docker

11.1.1 Getting started

11.1.2 Creating a security key

11.1.3 Creating the ECS cluster

11.1.4 Configuring the cluster’s networking

11.1.5 Creating a shared data drive in Elastic File System

11.1.6 Allocating space for Docker images in Elastic Container Repository

11.1.7 Building and deploying images for scheduler, worker, and notebook

11.1.8 Connecting to the cluster

11.2 Running and monitoring Dask jobs on a cluster

11.3 Cleaning up the Dask cluster on AWS

Summary

Appendixes

Appendix A: Software Installation

A.1 Installing additional packages with Anaconda

A.2 Installing packages without Anaconda

A.3 Starting a Jupyter Notebook server

A.4 Configuring NLTK

Overview

1 Why Scalable Computing Matters

Modern data science is increasingly constrained by the limits of single-machine computing: long runtimes, unstable code, and fragile workflows emerge as datasets grow. While Python’s open data science stack (NumPy, Pandas, SciPy, scikit-learn) democratized analytics, it shines primarily on small datasets that fit in memory. The chapter frames a practical taxonomy—small, medium, and large data—to clarify when memory pressure, disk paging, and single-core execution become bottlenecks, motivating scalable computing as a core competency. It sets the stage for Dask as a way to carry familiar Python workflows beyond a single machine without forcing a wholesale shift in tools or mindset.

Dask is presented as a Python-native framework designed to scale the existing ecosystem by combining a task scheduler with low-level (Delayed, Futures) and high-level (Array, DataFrame, Bag) APIs. Its collections are composed of underlying NumPy and Pandas partitions, enabling parallel execution over chunks while keeping syntax familiar. The same code can run locally or on clusters with minimal configuration, and Dask’s lightweight setup, flexible parallelism, and fault tolerance make it effective for both medium data on a laptop and large data in distributed environments. A comparison with Spark highlights Dask’s advantages for Python practitioners—reduced JVM overhead, quicker learning curve, and greater flexibility for custom Python logic—while acknowledging Spark’s strengths for large-scale collection operations.

To explain how Dask orchestrates parallel work, the chapter introduces directed acyclic graphs (DAGs), using a cooking analogy to illustrate nodes, dependencies, directionality, and transitive reduction. DAGs allow the scheduler to prioritize tasks near the final result, keep workers busy, reduce memory pressure, and emit partial outputs sooner. The chapter also covers practical distributed computing concerns: choosing scale up versus scale out, handling concurrency and resource locks, and recovering from failures ranging from worker loss to data loss and even scheduler failure by replaying task lineage. It concludes by introducing a real-world companion dataset—New York City parking citations—used in subsequent chapters to apply these concepts to hands-on data preparation, analysis, and modeling with Dask.

The components and layers than make up Dask

My favorite recipe for bucatini all'Amatriciana

A graph displaying nodes with dependencies

An example of a cyclic graph demonstrating an infinite feedback loop

The graph represented in figure 1.3 redrawn without transitive reduction

The full directed acyclic graph representation of the bucatini all’Amatriciana recipe.

Scaling up replaces existing equipment with larger/faster/more efficient equipment, while scaling out divides the work between many workers in parallel.

A graph with nodes distributed to many workers depicting dynamic redistribution of work as tasks complete at different times.

An example of resource starvation

Summary

In this chapter you learned

Dask can be used to scale popular data analysis libraries such as Pandas and NumPy, allowing you to analyze medium and large datasets with ease.
Dask uses directed acyclic graphs (DAGs) to coordinate execution of parallelized code across CPU cores and machines.
Directed acyclic graphs are comprised of nodes, have a clearly defined start and end, a single traversal path, and no looping.
Upstream nodes must be completed before work can begin on any dependent downstream nodes.
Scaling out can generally improve performance of complex workloads, but it creates additional overhead that might substantially reduce those performance gains.
In the event of a failure, the steps to reach a node can be repeated from the beginning without disturbing the rest of the process.

FAQ

When should I consider using Dask instead of just Pandas/NumPy/Scikit-learn?

Dask becomes valuable when your dataset pushes beyond single-machine RAM or when computations take painfully long, require paging (spilling to disk), or are hard to parallelize. The Python Open Data Science Stack excels on small, in-memory data; Dask extends those familiar tools to medium and large datasets with native parallelism and distributed execution.

How does the book define small, medium, and large datasets?

Small: up to roughly 2–4 GB, fits comfortably in RAM. Medium: about 10 GB to 2 TB, fits on local disk but not RAM (often incurs paging and benefits from parallelism). Large: greater than ~2 TB, doesn’t fit in RAM or on a single machine’s disk. Boundaries are fuzzy and depend on hardware; think in orders of magnitude.

What are Dask’s main components and layers?

At the core is a task scheduler that executes and monitors computations. Low-level APIs (Dask Delayed and Dask Futures) express work as tasks. High-level collections (Dask DataFrame, Array, Bag, etc.) build on these to provide familiar NumPy/Pandas-like interfaces whose operations translate into many parallel low-level tasks.

What makes Dask stand out for scalable computing?

- It’s Python-native and natively scales NumPy, Pandas, and Scikit-learn by chunking data into partitions of real underlying objects.
- It works well for medium data on a single machine and large data on a cluster, with minimal code changes.
- It can parallelize general Python workflows via Delayed/Futures, not just collection operations.
- It has low setup and maintenance overhead (pip/conda install, easy cluster images, sensible defaults).

What is a directed acyclic graph (DAG), and why does Dask use it?

A DAG is a set of tasks (nodes) connected by directed edges that encode dependencies, with no cycles. Dask uses DAGs to compose, schedule, and monitor computations: tasks can run as soon as their dependencies finish, enabling parallelism, efficient resource use, and clear recovery by replaying task lineage. The chapter uses a pasta recipe to illustrate these ideas.

What is the difference between Dask Delayed and Dask Futures?

Delayed is lazy: it builds a task graph and only executes when you call compute. Futures are eager: submitted tasks begin running immediately and return handles to in-progress results. Use Delayed for batch-style graphs; use Futures for interactive, real-time, or streaming-style distributed work.

Should I scale up or scale out, and how does Dask help?

Scale up (bigger machine) is simpler and often cost-effective up to a point, but hits diminishing returns. Scale out (more workers) exploits parallelism for medium/large data but requires orchestration. Dask makes scaling out straightforward: the same code can run locally or on a cluster via the local or distributed scheduler, and it integrates with YARN, Mesos, and Kubernetes for resource management.

How does Dask handle concurrency and shared resources?

Concurrency constraints arise when tasks compete for limited resources (e.g., memory, I/O, GPUs). Dask’s scheduler manages resource locks and task placement, tries to reduce worker idle time, prioritizes tasks closer to downstream outputs, and balances memory pressure to avoid resource starvation.

What are Dask’s strategies for fault tolerance and recovery?

For worker failures without data loss, the scheduler resubmits unfinished tasks to healthy workers. If intermediate data is lost, Dask can replay the necessary portion of the DAG from its lineage. If the scheduler fails, the graph must be rebuilt and restarted since it holds the execution plan.

How does Dask compare to Apache Spark?

Spark is powerful at large-scale collection operations but is JVM-based; Python code runs through Py4J, and the PySpark API can lag behind Scala/Java features. Setup can be heavier. Dask is Python-native, has APIs familiar to Pandas/NumPy users, is flexible for custom Python workflows, and is lightweight to deploy. Both are viable; Dask often has a shorter learning curve for Python-centric teams.

What companion dataset does the book use, and how big is it?

The book uses NYC OpenData’s parking violations (archived on Kaggle) spanning 2013–June 2017, about 8 GB uncompressed—ideal medium data for most readers. You can download it at: https://www.kaggle.com/new-york-city/nyc-parking-tickets.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$39.99 $29.99

you save $10.00 (25%)

include audio $19.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$39.99 $29.99

you save $10.00 (25%)

include audio $19.99 $14.99

eBook

pdf, ePub, online

$39.99 $29.99

you save $10.00 (25%)

include audio $19.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more