Data Science with Python and Dask
Jesse C. Daniel
  • MEAP began April 2018
  • Publication in July 2019 (estimated)
  • ISBN 9781617295607
  • 296 pages (estimated)
  • printed in black & white

A must read... Forget PySpark and start using Dask!

Al Krinker
If you’re doing data analysis using Pandas, NumPy, or Scikit, you know about THE WALL. At some point, you need to introduce parallelism to your system to handle larger-scale data or analytics tasks. The problem with THE WALL is that it can require you to rewrite your code, redesign your system, or start all over using an unfamiliar technology like Spark or Flink.

Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you’re already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, just using the tools you already use. And Data Science with Python and Dask is your guide to using Dask for your data projects without changing the way you work!
Table of Contents detailed table of contents

Part 1: The Building Blocks of Scalable Computing

1 Why Scalable Computing Matters

1.1 Why Dask?

1.2 Cooking with DAGs

1.3 Scale out, concurrency, and recovery

1.3.1 Scale Up vs. Scale Out

1.3.2 Concurrency and resource management

1.3.3 Recovering from failures

1.4 Introducing the companion dataset

1.5 Summary

2 Introducing Dask

2.1 Hello Dask!

2.2 Visualizing DAGs

2.3 Lazy Computations

2.3.1 Data Locality

2.4 Summary

Part 2: Working with Large Structured Datasets

3 Introducing Dask DataFrames

3.1 Why Use DataFrames?

3.2 Dask and Pandas

3.2.1 Managing DataFrame Partitioning

3.2.2 What is the Shuffle?

3.3 Limitations of Dask DataFrames

3.4 Summary

4 Loading Data into DataFrames

4.1 Reading Data from Text Files

4.1.1 Using Dask Datatypes

4.1.2 Creating Schemas for Dask DataFrames

4.2 Reading Data from Relational Databases

4.3 Reading Data from HDFS and S3

4.4 Reading Data in Parquet Format

4.5 Summary

5 Cleaning and Transforming DataFrames

5.1 Working with Indexes and Axes

5.2 Dealing with Missing Values

5.3 Recoding Data

5.4 Elementwise Operations

5.5 Filtering and Reindexing DataFrames

5.6 Join and Concatenate DataFrames

5.7 Writing Data to Text Files and Parquet Files

5.7.1 Writing to Delimited Text Files

5.7.2 Writing to Parquet Files

5.8 Summary

6 Summarizing and Analyzing DataFrames

6.1 Descriptive Statistics

6.2 Built-In Aggregate Functions

6.3 Custom Aggregate Functions

6.4 Rolling (Window) Functions

6.5 Summary

7 Visualizing DataFrames with Seaborn

7.1 The Prepare-Reduce-Collect-Plot Pattern

7.2 Visualizing Continuous Relationships with pairplot and regplot

7.3 Visualizing Categorical Relationships with violinplot

7.4 Visualizing Two Categorical Relationships with heatmap

7.5 Summary

8 Visualizing Location Data with Datashader

8.1 What Is Datashader and How Does It Work?

8.2 Plotting Location Data as an Interactive Heatmap

8.3 Summary

Part 3: Extending and Deploying Dask

9 Working with Bags and Arrays

9.1 Reading and Parsing Unstructured Data with Bags

9.2 Transforming, Filtering, and Folding Elements

9.3 Building Arrays and DataFrames from Bags

9.4 Using Bags for Parallel Text Analysis with NLTK

9.5 Summary

10 Machine Learning with Dask-ML

10.1 Building Linear Models with Dask ML

10.2 Evaluating and Tuning Dask ML Models

10.3 Persisting Dask ML Models

10.4 Summary

11 Scaling and Deploying Dask

11.1 Building a Dask Cluster on Amazon AWS with Docker

11.1.1 Create a Security Key

11.1.2 Create the ECS Cluster

11.1.3 Configure the Cluster’s Networking

11.1.4 Create a Shared Data Drive in Elastic File System

11.1.5 Allocate Space for Docker Images in Elastic Container Repository

11.1.6 Build and Deploy Images for Scheduler, Worker, and Notebook

11.1.7 Connect to the Cluster

11.2 Running and Monitoring Dask Jobs on a Cluster

11.3 Cleaning Up the Dask Cluster on AWS

11.4 Summary


Appendix A: Software Installation

A.1 Install Additional Packages with Anaconda

A.2 Install Packages without Anaconda

About the Technology

Dask is a self-contained, easily extendible library designed to query, stream, filter, and consolidate huge datasets. Dask simplifies data parallelism, so you spend less time on low-level system management and more time exploring your data. Dask has built-in schedulers—subsystems that distribute computing tasks—that create parallelism on a single machine using threads or processes. More advanced schedulers allow you to fully exploit clusters or HPC systems.

Large datasets tend to be distributed, non-uniform, and prone to change. Dask simplifies the process of ingesting, filtering, and transforming data, reducing or eliminating the need for a heavyweight framework like Spark. You can start and finish even massive data projects in Python.

About the book

Data Science with Python and Dask teaches you how to build distributed data projects that can handle huge amounts of data. You’ll begin with an introduction to the Dask framework, concentrating on how Dask natively scales commonly-used Python libraries like Numpy and Pandas. With a particular focus on data analysis, you’ll immediately start exploring the huge amount of data found in the NYC 2013-2017 Parking Ticket database. You’ll be introduced to Dask DataFrames and learn helpful code patterns to streamline your analysis. You’ll also dig into visualization with Seaborn and learn to build machine learning models using Dask-ML.

As you work through Dask’s features, you’ll learn how to prepare and analyze the dataset to discover trends and patterns in NYC’s parking enforcement operations. How does the time of year and weather affect issued citations? Is the number of citations rising or falling? You’ll find out, and you’ll figure out how to discover similar trends in your own data! Along the way, you’ll look deeper into Dask Arrays and Bags, use Datashader to build interactive location-based visualizations, and learn to implement your own algorithms using custom task graphs. Finally, you’ll learn how to scale your Dask apps and learn how to build your very own Dask cluster using AWS and Docker.

What's inside

  • Working with large structured datasets
  • Writing your own DataFrames
  • Clean and visualize your DataFrames
  • Machine learning with Dask-ML
  • Working with Bags and Arrays
  • Building distributed apps with Dask Distributed
  • Packaging and deploying your Dask apps

About the reader

Written for data engineers and scientists with experience using Python. Knowledge of the PyData stack (Pandas, NumPy, and Scikit-learn) will be helpful. No experience with low-level parallelism is required.

About the author

Jesse Daniel has five years of experience writing applications in Python, including three years working with in the PyData stack (Pandas, NumPy, SciPy, Scikit-Learn). Jesse joined the faculty of the University of Denver in 2016 as an adjunct professor of business information and analytics, where he currently teaches a Python for Data Science course.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
MEAP combo
$49.99 pBook + eBook + liveBook
MEAP eBook
$39.99 pdf + ePub + kindle + liveBook

placing your order...

Don't refresh or navigate away from the page.

FREE domestic shipping on three or more pBooks