Algorithms and Data Structures for Massive Datasets
Dzejla Medjedovic, Emin Tahirovic, and Ines Dedovic
  • MEAP began July 2020
  • Publication in Fall 2021 (estimated)
  • ISBN 9781617298035
  • 325 pages (estimated)
  • printed in black & white

Many people know about classical algorithms for common tasks like hashing, sorting, searching. But what if the data to handle can’t fit in memory anymore? This book provides the answers!

Jean-François Morin
Data structures and algorithms that are great for traditional software may quickly slow or fail altogether when applied to huge datasets. Algorithms and Data Structures for Massive Datasets introduces a toolbox of new techniques that are perfect for handling modern big data applications. You'll discover methods for reducing and sketching data so it fits in small memory without losing accuracy, and unlock the algorithms and data structures that form the backbone of a big data system. Filled with fun illustrations and examples from real-world businesses, you'll learn how each of these complex techniques can be practically applied to maximize the accuracy and throughput of big data processing and analytics.

About the Technology

Modern data-intensive applications are outpacing traditional data structures and algorithms. Huge data sets rapidly grow beyond available memory, becoming slow and inefficient, and bottlenecking development. Fortunately, you don't need to blow your budget on expensive upgrades to your computing power! Algorithms and Data Structures for Massive Datasets lays out ways to sketch data in main memory and organize data on disk to make the best use of your available resources. Taken from the latest research papers, these effective techniques apply to any discipline, from finance to text analysis.

About the book

Algorithms and Data Structures for Massive Datasets teaches you to take advantage of data processing and analytics techniques specifically designed for large distributed datasets. And you'll be amazed how easy it is to learn such a challenging topic from this friendly guide! Complex concepts are illustrated with interesting, entertaining graphics and fascinating industry stories that show how these techniques have succeeded in the real world. You'll study examples including Google BigTable, BitCoin, and a smart bed sensor app, learning to build data sketches for processing, querying and exploring large datasets. By the time you're done, you'll be able to identify the perfect algorithm to deliver faster and more reliable results for any data intensive system.
Table of Contents detailed table of contents

1 Introduction

1.1 An example

1.1.1 An example: how to solve it

1.1.2 An example: how to solve it, take two

1.2 The structure of this book

1.3 What makes this book different and whom it is for

1.4 Why is massive data so challenging for today’s systems?

1.4.1 The CPU-memory performance gap

1.4.2 Memory hierarchy

1.4.3 What about distributed systems?

1.5 Summary

Part 1: Hash-Based Sketches

2 Review of Hash Tables and Modern Hashing

2.1 The great idea of hashing

2.2 Usage scenarios in modern systems

2.2.1 Deduplication in backup/storage solutions

2.2.2 Plagiarism detection with MOSS and Rabin-Karp fingerprinting

2.3 O(1) --- what’s the big deal?

2.4 Collision Resolution: theory vs. practice

2.4.1 Cache-efficiency vs. number of probes

2.4.2 Usage scenario: How Python’s dict does it

2.5 Hash Tables for Distributed Systems: Consistent Hashing

2.5.1 Consistent hashing scenario: Chord

2.6 Summary

3 Approximate Membership and Bloom Filter

3.1 How It Works

3.1.1 Insert

3.1.2 Lookup

3.2 Use Cases

3.2.1 Bloom Filters in Networks: Squid

3.2.2 Bitcoin mobile app

3.3 Configuring a Bloom filter for your application

3.3.1 Examples

3.4 A bit of theory

3.4.1 Can we do better?

3.5 Further reading: Bloom filter adaptations and alternatives

3.6 Quotient filter

3.6.1 Quotienting

3.6.2 Resizing

3.7 Summary

4 Frequency Estimation and Count-Min Sketch

4.1 Streaming data

4.2 Count-min sketch: how it works

4.2.1 Update

4.2.2 Estimate

4.2.3 Space and error in count-min sketch

4.3 Use cases

4.3.1 Top-k restless sleepers

4.3.2 Scaling distributional similarity of words

4.4 Range queries with count-min sketch

4.5 Approximate heavy hitters

4.5.1 Majority element

4.5.2 General heavy hitters

4.6 Summary

5 Estimating cardinalty and HyperLogLog

Part 2: Real-time analytics

6 Data streams

7 Sampling from a stream

8 Estimating quantiles and histograms from a stream

Part 3: Data structures for databases and External-memory algorithms

9 External-memory model

10 Data structures for large databases: B-trees and LSM-trees

11 External-memory sorting and batched problems in external memory

What's inside

  • Sketching data structures for practical problems
  • Choosing the right database engine for your application
  • Evaluating and designing efficient on-disk data structures and algorithms
  • Understanding the algorithmic tradeoffs involved in massive-scale systems
  • Deriving basic statistics from streaming data
  • Correctly sampling streaming data
  • Computing percentiles with limited space resources

About the reader

For programmers familiar with fundamental data structures. Examples in R and pseudocode.

About the authors

Dzejla Medjedovic earned her PhD in the Applied Algorithms Lab of the computer science department at Stony Brook University, NY in 2014. She has worked on a number of projects in algorithms for massive data, taught algorithms at various levels and also spent some time at Microsoft. Emin Tahirovic earned his doctorate in biostatistics from UPenn in 2016, and his master's degree in theoretical computer science from Goethe University in Frankfurt in 2008. He has worked for DBahn AG as an IT consultant and he regularly consults on projects for pharma and tech companies. Ines Dedovic earned her PhD at the Institute for Imaging and Computer Vision of the Department of Electrical Engineering at RWTH Aachen University, Germany. She has worked as a researcher at the Research Center Jülich and is currently employed as a software developer for camera systems at Jonas & Redmann, an automation company.

placing your order...

Don't refresh or navigate away from the page.
Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
print book $35.99 $59.99 pBook + eBook + liveBook
Additional shipping charges may apply
Algorithms and Data Structures for Massive Datasets (print book) added to cart
continue shopping
go to cart

eBook $33.59 $47.99 3 formats + liveBook
Algorithms and Data Structures for Massive Datasets (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.

FREE domestic shipping on three or more pBooks