Data Science at Scale with Python and Dask
Jesse C. Daniel
  • MEAP began April 2018
  • Publication in Spring 2019 (estimated)
  • ISBN 9781617295607
  • 375 pages (estimated)
  • printed in black & white

This is a really good introduction to Dask as an alternative to PySpark.

Jeremy Loscheider
If you're doing data analysis using Pandas, NumPy, or Scikit, you know about THE WALL. At some point, you need to introduce parallelism to your system to handle larger-scale data or analytics tasks. The problem with THE WALL is that it can require you to rewrite your code, redesign your system, or start all over using an unfamiliar technology like Spark or Flink.

Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you're already using, including Pandas, NumPy, and Scikit-Learn. It's built to help you parallelize your data tasks on a standalone system, a cluster, or even a massive supercomputer without radically changing the way you work.
Table of Contents detailed table of contents

Part 1: The Building Blocks of Scalable Computing

1. What is Scalable Computing?

1.1. A brief history of big data

1.2. Why Dask?

1.3. Cooking with DAGs

1.4. Scale out, concurrency, and recovery

1.4.1. Scale Up vs. Scale Out

1.4.2. Concurrency and resource management

1.4.3. Recovering from failures

1.5. Introducing the companion dataset

1.6. Summary

2. Introducing Dask

2.1. Hello Dask!

2.2. Visualizing DAGs

2.3. Lazy Computations

2.3.1. Data Locality

2.3.2. Stages of Computation

2.4. Summary

Part 2: Working with Large Structured Datasets

3. Introducing Dask DataFrames

3.1. Why Use DataFrames?

3.2. Dask and Pandas

3.2.1. Managing DataFrame Partitioning

3.2.2. What is the Shuffle?

3.3. Limitations of Dask DataFrames

3.4. Summary

4. Loading Data into DataFrames

5. Cleaning and Transforming DataFrames

6. Summarizing and AnalyzingDataFrames

7. Visualizing DataFrames with Seaborn

8. Creating Interactive Visualizations with Bokeh

Part 3: Extending and Deploying Dask

9. Working with Bags and Arrays

10. Creating Custom Task Graphs with Dask Delayed

11. Machine Learning with Dask-ML

12. Scaling Dask Applications with Dask Distributed

About the Technology

Dask is a self-contained, easily extendible library designed to query, stream, filter, and consolidate huge datasets. Dask simplifies data parallelism, so you spend less time on low-level system management and more time exploring your data. Dask has built-in schedulers—subsystems that distribute computing tasks—that create parallelism on a single machine using threads or processes. More advanced schedulers allow you to fully exploit clusters or HPC systems.

Large datasets tend to be distributed, non-uniform, and prone to change. Dask simplifies the process of ingesting, filtering, and transforming data, reducing or eliminating the need for a heavyweight framework like Spark. You can start and finish even massive data projects in Python.

About the book

Data Science at Scale with Python and Dask teaches you how to build distributed data projects that can handle huge amounts of data. You'll begin with an introduction to the Dask framework, concentrating on how Dask related to commonly-used Python tools like Numpy and Pandas. With a particular focus on machine learning tasks, you'll immediately start ingesting and exploring the NYC 2013-2017 Parking Ticket database. You'll be introduced to Dask DataFrames, work with Pandas, and read and write your data. You'll also dig into visualization with Seaborn as well as learn several helpful best practices and functions to speed up your work.

Along the way, you'll look deeper into arrays and bags, use Bokeh for interactive visualizations, and learn to create custom task graphs. Finally, you'll learn how to scale your apps as needed, use Dask-ML for easier machine learning, and deploy your app to the world.

What's inside

  • Working with large structured datasets
  • Writing your own DataFrames
  • Clean and visualize your DataFrames
  • Machine learning with Dask-ML
  • Working with Bags and Arrays
  • Building distributed apps with Dask Distributed
  • Packaging and deploying your Dask apps

About the reader

Written for data engineers and scientists with experience using Python. Knowledge of the PyData stack (Pandas, NumPy, and Scikit-learn) will be helpful. No experience with low-level parallelism is required.

About the author

Jesse Daniel has five years of experience writing applications in Python, including three years working with in the PyData stack (Pandas, NumPy, SciPy, Scikit-Learn). Jesse joined the faculty of the University of Denver in 2016 as an adjunct professor of business information and analytics, where he currently teaches a Python for Data Science course.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
Data Science at Scale with Python and Dask (combo) added to cart
continue shopping
go to cart

MEAP combo $49.99 pBook + eBook
MEAP eBook $39.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks

This is a really great book for anybody to start getting into data sciences.

Julien Pohie

The author does a phenomenal job of highlighting why DASK should be a go-to tool for anyone working in this domain.

Gregory Matuszek