If you're doing data analysis using Pandas, NumPy, or Scikit, you know about THE WALL. At some point, you need to introduce parallelism to your system to handle larger-scale data or analytics tasks. The problem with THE WALL is that it can require you to rewrite your code, redesign your system, or start all over using an unfamiliar technology like Spark or Flink.
This is a really good introduction to Dask as an alternative to PySpark.
Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you're already using, including Pandas, NumPy, and Scikit-Learn. It's built to help you parallelize your data tasks on a standalone system, a cluster, or even a massive supercomputer without radically changing the way you work.
Part 1: The Building Blocks of Scalable Computing
1. What is Scalable Computing?
1.1. A brief history of big data
1.2. Why Dask?
1.3. Cooking with DAGs
1.4. Scale out, concurrency, and recovery
1.4.1. Scale Up vs. Scale Out
1.4.2. Concurrency and resource management
1.4.3. Recovering from failures
1.5. Introducing the companion dataset
2. Introducing Dask
2.1. Hello Dask!
2.2. Visualizing DAGs
2.3. Lazy Computations
2.3.1. Data Locality
2.3.2. Stages of Computation
Part 2: Working with Large Structured Datasets
3. Introducing Dask DataFrames
3.1. Why Use DataFrames?
3.2. Dask and Pandas
3.2.1. Managing DataFrame Partitioning
3.2.2. What is the Shuffle?
3.3. Limitations of Dask DataFrames
4. Loading Data into DataFrames
5. Cleaning and Transforming DataFrames
6. Summarizing and AnalyzingDataFrames
7. Visualizing DataFrames with Seaborn
8. Creating Interactive Visualizations with Bokeh
Part 3: Extending and Deploying Dask
9. Working with Bags and Arrays
10. Creating Custom Task Graphs with Dask Delayed
11. Machine Learning with Dask-ML
12. Scaling Dask Applications with Dask Distributed
About the TechnologyDask is a self-contained, easily extendible library designed to query, stream, filter, and consolidate huge datasets. Dask simplifies data parallelism, so you spend less time on low-level system management and more time exploring your data. Dask has built-in schedulers—subsystems that distribute computing tasks—that create parallelism on a single machine using threads or processes. More advanced schedulers allow you to fully exploit clusters or HPC systems.
Large datasets tend to be distributed, non-uniform, and prone to change. Dask simplifies the process of ingesting, filtering, and transforming data, reducing or eliminating the need for a heavyweight framework like Spark. You can start and finish even massive data projects in Python.
About the bookData Science at Scale with Python and Dask teaches you how to build distributed data projects that can handle huge amounts of data. You'll begin with an introduction to the Dask framework, concentrating on how Dask related to commonly-used Python tools like Numpy and Pandas. With a particular focus on machine learning tasks, you'll immediately start ingesting and exploring the NYC 2013-2017 Parking Ticket database. You'll be introduced to Dask DataFrames, work with Pandas, and read and write your data. You'll also dig into visualization with Seaborn as well as learn several helpful best practices and functions to speed up your work.
Along the way, you'll look deeper into arrays and bags, use Bokeh for interactive visualizations, and learn to create custom task graphs. Finally, you'll learn how to scale your apps as needed, use Dask-ML for easier machine learning, and deploy your app to the world.
- Working with large structured datasets
- Writing your own DataFrames
- Clean and visualize your DataFrames
- Machine learning with Dask-ML
- Working with Bags and Arrays
- Building distributed apps with Dask Distributed
- Packaging and deploying your Dask apps
About the readerWritten for data engineers and scientists with experience using Python. Knowledge of the PyData stack (Pandas, NumPy, and Scikit-learn) will be helpful. No experience with low-level parallelism is required.
About the authorJesse Daniel has five years of experience writing applications in Python, including three years working with in the PyData stack (Pandas, NumPy, SciPy, Scikit-Learn). Jesse joined the faculty of the University of Denver in 2016 as an adjunct professor of business information and analytics, where he currently teaches a Python for Data Science course.
This is a really great book for anybody to start getting into data sciences.
The author does a phenomenal job of highlighting why DASK should be a go-to tool for anyone working in this domain.