Data Science at Scale with Python and Dask
Jesse C. Daniel
  • MEAP began April 2018
  • Publication in February 2019 (estimated)
  • ISBN 9781617295607
  • 400 pages (estimated)
  • printed in black & white

This is a really good introduction to Dask as an alternative to PySpark.

Jeremy Loscheider
If you’re doing data analysis using Pandas, NumPy, or Scikit, you know about THE WALL. At some point, you need to introduce parallelism to your system to handle larger-scale data or analytics tasks. The problem with THE WALL is that it can require you to rewrite your code, redesign your system, or start all over using an unfamiliar technology like Spark or Flink.

Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you’re already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, just using the tools you already use. And Data Science at Scale with Python and Dask is your guide to using Dask for your data projects without changing the way you work!
Table of Contents detailed table of contents

Part 1: The Building Blocks of Scalable Computing

1 Why Scalable Computing Matters

1.1 Why Dask?

1.2 Cooking with DAGs

1.3 Scale out, concurrency, and recovery

1.3.1 Scale Up vs. Scale Out

1.3.2 Concurrency and resource management

1.3.3 Recovering from failures

1.4 Introducing the companion dataset

1.5 Summary

2 Introducing Dask

2.1 Hello Dask!

2.2 Visualizing DAGs

2.3 Lazy Computations

2.3.1 Data Locality

2.4 Summary

Part 2: Working with Large Structured Datasets

3 Introducing Dask DataFrames

3.1 Why Use DataFrames?

3.2 Dask and Pandas

3.2.1 Managing DataFrame Partitioning

3.2.2 What is the Shuffle?

3.3 Limitations of Dask DataFrames

3.4 Summary

4 Loading Data into DataFrames

4.1 Reading Data from Text Files

4.1.1 Using Dask Datatypes

4.1.2 Creating Schemas for Dask DataFrames

4.2 Reading Data from Relational Databases

4.3 Reading Data from HDFS and S3

4.4 Reading Data in Parquet Format

4.5 Summary

5 Cleaning and Transforming DataFrames

5.1 Working with Indexes and Axes

5.2 Dealing with Missing Values

5.3 Recoding Data

5.4 Elementwise Operations

5.5 Filtering and Reindexing DataFrames

5.6 Join and Concatenate DataFrames

5.7 Writing Data to Text Files and Parquet Files

5.7.1 Writing to Delimited Text Files

5.7.2 Writing to Parquet Files

5.8 Summary

6 Summarizing and Analyzing DataFrames

6.1 Descriptive Statistics

6.2 Built-In Aggregate Functions

6.3 Custom Aggregate Functions

6.4 Rolling (Window) Functions

6.5 Summary

7 Visualizing DataFrames with Seaborn

7.1 The Prepare-Reduce-Collect-Plot Pattern

7.2 Visualizing Continuous Relationships with pairplot and regplot

7.3 Visualizing Categorical Relationships with violinplot

7.4 Visualizing Two Categorical Relationships with heatmap

7.5 Summary

8 Visualizing Location Data with Datashader

8.1 What Is Datashader and How Does It Work?

8.2 Plotting Location Data as an Interactive Heatmap

8.3 Summary

Part 3: Extending and Deploying Dask

9 Working with Bags and Arrays

10 Machine Learning with Dask-ML

11 Scaling and Deploying Dask


Appendix A: Software Installation

A.1 Install Additional Packages with Anaconda

A.2 Install Packages without Anaconda

About the Technology

Dask is a self-contained, easily extendible library designed to query, stream, filter, and consolidate huge datasets. Dask simplifies data parallelism, so you spend less time on low-level system management and more time exploring your data. Dask has built-in schedulers—subsystems that distribute computing tasks—that create parallelism on a single machine using threads or processes. More advanced schedulers allow you to fully exploit clusters or HPC systems.

Large datasets tend to be distributed, non-uniform, and prone to change. Dask simplifies the process of ingesting, filtering, and transforming data, reducing or eliminating the need for a heavyweight framework like Spark. You can start and finish even massive data projects in Python.

About the book

Data Science at Scale with Python and Dask teaches you how to build distributed data projects that can handle huge amounts of data. You’ll begin with an introduction to the Dask framework, concentrating on how Dask natively scales commonly-used Python libraries like Numpy and Pandas. With a particular focus on data analysis, you’ll immediately start exploring the huge amount of data found in the NYC 2013-2017 Parking Ticket database. You’ll be introduced to Dask DataFrames and learn helpful code patterns to streamline your analysis. You’ll also dig into visualization with Seaborn and learn to build machine learning models using Dask-ML.

As you work through Dask’s features, you’ll learn how to prepare and analyze the dataset to discover trends and patterns in NYC’s parking enforcement operations. How does the time of year and weather affect issued citations? Is the number of citations rising or falling? You’ll find out, and you’ll figure out how to discover similar trends in your own data! Along the way, you’ll look deeper into Dask Arrays and Bags, use Datashader to build interactive location-based visualizations, and learn to implement your own algorithms using custom task graphs. Finally, you’ll learn how to scale your Dask apps and learn how to build your very own Dask cluster using AWS and Docker.

What's inside

  • Working with large structured datasets
  • Writing your own DataFrames
  • Clean and visualize your DataFrames
  • Machine learning with Dask-ML
  • Working with Bags and Arrays
  • Building distributed apps with Dask Distributed
  • Packaging and deploying your Dask apps

About the reader

Written for data engineers and scientists with experience using Python. Knowledge of the PyData stack (Pandas, NumPy, and Scikit-learn) will be helpful. No experience with low-level parallelism is required.

About the author

Jesse Daniel has five years of experience writing applications in Python, including three years working with in the PyData stack (Pandas, NumPy, SciPy, Scikit-Learn). Jesse joined the faculty of the University of Denver in 2016 as an adjunct professor of business information and analytics, where he currently teaches a Python for Data Science course.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.

FREE domestic shipping on three or more pBooks

This is a really great book for anybody to start getting into data sciences.

Julien Pohie

The author does a phenomenal job of highlighting why DASK should be a go-to tool for anyone working in this domain.

Gregory Matuszek