Hadoop in Practice, Second Edition
Alex Holmes
  • September 2014
  • ISBN 9781617292224
  • 512 pages
  • printed in black & white

Very insightful. A deep dive into the Hadoop world.

Andrea Tarocchi, Red Hat, Inc.

Hadoop in Practice, Second Edition provides over 100 tested, instantly useful techniques that will help you conquer big data, using Hadoop. This revised new edition covers changes and new features in the Hadoop core architecture, including MapReduce 2. Brand new chapters cover YARN and integrating Kafka, Impala, and Spark SQL with Hadoop. You'll also get new and updated techniques for Flume, Sqoop, and Mahout, all of which have seen major new versions recently. In short, this is the most practical, up-to-date coverage of Hadoop available anywhere

Table of Contents detailed table of contents

preface

acknowledgments

about this book

about the cover illustration

Part 1 Background and fundamentals

1. Hadoop in a heartbeat

1.1. What is Hadoop?

1.1.1. Core Hadoop components

1.1.2. The Hadoop ecosystem

1.1.3. Hardware requirements

1.1.4. Hadoop distributions

1.1.5. Who’s using Hadoop?

1.1.6. Hadoop limitations

1.2. Getting your hands dirty with MapReduce

1.3. Summary

2. Introduction to YARN

2.1. YARN overview

2.1.1. Why YARN?

2.1.2. YARN concepts and components

2.1.3. YARN configuration

Technique 1 Determining the configuration of your cluster

2.1.4. Interacting with YARN

Technique 2 Running a command on your YARN cluster

Technique 3 Accessing container logs

Technique 4 Aggregating container log files

2.1.5. YARN challenges

2.2. YARN and MapReduce

2.2.1. Dissecting a YARN MapReduce application

2.2.2. Configuration

2.2.3. Backward compatibility

Technique 5 Writing code that works on Hadoop versions 1 and 2

2.2.4. Running a job

Technique 6 Using the command line to run a job

2.2.5. Monitoring running jobs and viewing archived jobs

2.2.6. Uber jobs

Technique 7 Running small MapReduce jobs

2.3. YARN applications

2.3.1. NoSQL

2.3.2. Interactive SQL

2.3.3. Graph processing

2.3.4. Real-time data processing

2.3.5. Bulk synchronous parallel

2.3.6. MPI

2.3.7. In-memory

2.3.8. DAG execution

2.4. Chapter summary

Part 2 Data logistics

3. Data serialization—working with text and beyond

3.1. Understanding inputs and outputs in MapReduce

3.1.1. Data input

3.1.2. Data output

3.2. Processing common serialization formats

3.2.1. XML

Technique 8 MapReduce and XML

3.2.2. JSON

Technique 9 MapReduce and JSON

3.3. Big data serialization formats

3.3.1. Comparing SequenceFile, Protocol Buffers, Thrift, and Avro

3.3.2. SequenceFile

Technique 10 Working with SequenceFiles

Technique 11 Using SequenceFiles to encode Protocol Buffers

3.3.3. Protocol Buffers

3.3.4. Thrift

3.3.5. Avro

Technique 12 Avro’s schema and code generation

Technique 13 Selecting the appropriate way to use Avro in MapReduce

Technique 14 Mixing Avro and non-Avro data in MapReduce

Technique 15 Using Avro records in MapReduce

Technique 16 Using Avro key/value pairs in MapReduce

Technique 17 Controlling how sorting worksin MapReduce

Technique 18 Avro and Hive

Technique 19 Avro and

3.4. Columnar storage

3.4.1. Understanding object models and storage formats

3.4.2. Parquet and the Hadoop ecosystem

3.4.3. Parquet block and page sizes

Technique 20 Reading Parquet files via the command line

Technique 21 Reading and writing Avro data in Parquet with Java

Technique 22 Parquet and MapReduce

Technique 23 Parquet and Hive/Impala

Technique 24 Pushdown predicates and projection with Parquet

3.4.4. Parquet limitations

3.5. Custom file formats

3.5.1. Input and output formats

Technique 25 Writing input and output formats for CSV

3.5.2. The importance of output committing

3.6. Chapter summary

4. Organizing and optimizing data in HDFS

4.1. Data organization

4.1.1. Directory and file layout

4.1.2. Data tiers

4.1.3. Partitioning

Technique 26 Using MultipleOutputs to partition your data

Technique 27 Using a custom MapReduce partitioner

4.1.4. Compacting

Technique 28 Using filecrush to compact data

Technique 29 Using Avro to store multiple small binary files

4.1.5. Atomic data movement

4.2. Efficient storage with compression

Technique 30 Picking the right compression codec for your data

Technique 31 Compression with HDFS, MapReduce, Pig, and Hive

Technique 32 Splittable LZOP with MapReduce, Hive, and Pig

4.3. Chapter summary

5. Moving data into and out of Hadoop

5.1. Key elements of data movement

5.2. Moving data into Hadoop

5.2.1. Roll your own ingest

Technique 33 Using the CLI to load files

Technique 34 Using REST to load files

Technique 35 Accessing HDFS from behind a firewall

Technique 36 Mounting Hadoop with NFS

Technique 37 Using DistCp to copy data within and between clusters

Technique 38 Using Java to load files

5.2.2. Continuous movement of log and binary files into HDFS

Technique 39 Pushing system log messages into HDFS with Flume

Technique 40 An automated mechanism to copy files into HDFS

Technique 41 Scheduling regular ingress activities with Oozie

5.2.3. Databases

Technique 42 Using Sqoop to import data from MySQL

5.2.4. HBase

Technique 43 HBase ingress into HDFS

Technique 44 MapReduce with HBase as a data source

5.2.5. Importing data from Kafka

Technique 45 Using Camus to copy Avro data from Kafka into HDFS

5.3. Moving data out of Hadoop

5.3.1. Roll your own egress

Technique 46 Using the CLI to extract files

Technique 47 Using REST to extract files

Technique 48 Reading from HDFS when behind a firewall

Technique 49 Mounting Hadoop with NFS

Technique 50 Using DistCp to copy data out of Hadoop

Technique 51 Using Java to extract files

5.3.2. Automated file egress

Technique 52 An automated mechanism to export files from HDFS

5.3.3. Databases

Technique 53 Using Sqoop to export data to MySQL

5.3.4. NoSQL

5.4. Chapter summary

Part 3 Big data patterns

6. Applying MapReduce patterns to big data

6.1. Joining

Technique 54 Picking the best join strategy for your data

Technique 55 Filters, projections, and pushdowns

6.1.1. Map-side joins

Technique 56 Joining data where one dataset can fit into memory

Technique 57 Performing a semi-join on large datasets

Technique 58 Joining on presorted and prepartitioned data

6.1.2. Reduce-side joins

Technique 59 A basic repartition join

Technique 60 Optimizing the repartition join

Technique 61 Using Bloom filters to cut down on shuffled data

6.1.3. Data skew in reduce-side joins

Technique 62 Joining large datasets with high join-key cardinality

Technique 63 Handling skews generated by the hash partitioner

6.2. Sorting

6.2.1. Secondary sort

Technique 64 Implementing a secondary sort

6.2.2. Total order sorting

Technique 65 Sorting keys across multiple reducers

6.3. Sampling

Technique 66 Writing a reservoir-sampling InputFormat

6.4. Chapter summary

7. Utilizing data structures and algorithms at scale

7.1. Modeling data and solving problems with graphs

7.1.1. Modeling graphs

7.1.2. Shortest-path algorithm

Technique 67 Find the shortest distance between two users

7.1.3. Friends-of-friends algorithm

Technique 68 Calculating FoFs

7.1.4. Using Giraph to calculate PageRank over a web graph

Technique 69 Calculate PageRank over a web graph

7.2. Bloom filters

Technique 70 Parallelized Bloom filter creation in MapReduce

7.3. HyperLogLog

7.3.1. A brief introduction to HyperLogLog

Technique 71 Using HyperLogLog to calculate unique counts

7.4. Chapter summary

8. Tuning, debugging, and testing

8.1. Measure, measure, measure

8.2. Tuning MapReduce

8.2.1. Common inefficiencies in MapReduce jobs

Technique 72 Viewing job statistics

8.2.2. Map optimizations

73 Data locality

Technique 74 Dealing with a large number of input splits

Technique 75 Generating input splits in the cluster with YARN

8.2.3. Shuffle optimizations

Technique 76 Using the combiner

Technique 77 Blazingly fast sorting with binary comparators

Technique 78 Tuning the shuffle internals

8.2.4. Reducer optimizations

Technique 79 Too few or too many reducers

8.2.5. General tuning tips

Technique 80 Using stack dumps to discover unoptimized user code

Technique 81 Profiling your map and reduce tasks

8.3. Debugging

8.3.1. Accessing container log output

Technique 82 Examining task logs

8.3.2. Accessing container start scripts

Technique 83 Figuring out the container startup command

8.3.3. Debugging OutOfMemory errors

Technique 84 Force container JVMs to generate a heap dump

8.3.4. MapReduce coding guidelines for effective debugging

Technique 85 Augmenting MapReduce code for better debugging

8.4. Testing MapReduce jobs

8.4.1. Essential ingredients for effective unit testing

8.4.2. MRUnit

Technique 86 Using MRUnit to unit-test MapReduce

8.4.3. LocalJobRunner

Technique 87 Heavyweight job testing with the LocalJobRunner

8.4.4. MiniMRYarnCluster

Technique 88 Using MiniMRYarnCluster to test your jobs

8.4.5. Integration and QA testing

8.5. Chapter summary

Part 4 Beyond MapReduce

9. SQL on Hadoop

9.1. Hive

9.1.1. Hive basics

9.1.2. Reading and writing data

Technique 89 Working with text files

Technique 90 Exporting data to local disk

9.1.3. User-defined functions in Hive

Technique 91 Writing UDFs

9.1.4. Hive performance

Technique 92 Partitioning

Technique 93 Tuning Hive joins

9.2. Impala

9.2.1. Impala vs. Hive

9.2.2. Impala basics

Technique 94 Working with text

Technique 95 Working with Parquet

Technique 96 Refreshing metadata

9.2.3. User-defined functions in Impala

Technique 97 Executing Hive UDFs in Impala

9.3. Spark SQL

9.3.1. Spark 101

9.3.2. Spark on Hadoop

9.3.3. SQL with Spark

Technique 98 Calculating stock averages with Spark SQL

Technique 99 Language-integrated queries

Technique 100 Hive and Spark SQL

9.4. Chapter summary

10. Writing a YARN application

10.1. Fundamentals of building a YARN application

10.1.1. Actors

10.1.2. The mechanics of a YARN application

10.2. Building a YARN application to collect cluster statistics

Technique 101 A bare-bones YARN client

Technique 102 A bare-bones ApplicationMaster

Technique 103 Running the application and accessing logs

Technique 104 Debugging using an unmanaged application master

10.3. Additional YARN application capabilities

10.3.1. RPC between components

10.3.2. Service discovery

10.3.3. Checkpointing application progress

10.3.4. Avoiding split-brain

10.3.5. Long-running applications

10.3.6. Security

10.4. YARN programming abstractions

10.4.1. Twill

10.4.2. Spring

10.4.3. REEF

10.4.4. Picking a YARN API abstraction

10.5. Chapter summary

Appendix A: Installing Hadoop and friends

A.1. Code for the book

A.3. Hadoop

A.4. Flume

A.5. Oozie

A.6. Sqoop

A.7. HBase

A.8. Kafka

A.9. Camus

A.10. Avro

A.11. Apache Thrift

A.12. Protocol Buffers

A.13. Snappy

A.14. LZOP

A.15. Elephant Bird

A.16. Hive

A.17. R

A.18. RHadoop

A.19. Mahout

index

bonus chapters available online

11. Integrating R and Hadoop for statistics and more

11.1. Comparing R and MapReduce integrations

11.2. R fundamentals

11.3. R and streaming

11.3.1. Streaming and map-only R

Technique 105 Calculate the daily mean for stocks

11.3.2. Streaming, R, and full MapReduce

Technique 106 Calculate the cumulative moving average for stocks

11.4. RHadoop—a simple integration of client-side R and Hadoop

Technique 107 Calculating CMA with RHadoop

11.5. Chapter summary

12. Predictive analytics with Mahout

12.1. Using recommenders to make product suggestions

12.1.1. Visualizing similarity metrics

12.1.2. The GroupLens dataset

12.1.3. User-based recommenders

12.1.4. Item-based recommenders

Technique 108 Item-based recommenders using movie ratings

12.2. Classification

12.2.1. Writing a homemade naive Bayesian classifier

12.2.2. A scalable spam-detection classification system

Technique 109 Using Mahout to train and test a spam classifier

12.2.3. Additional classification algorithms

12.3. Clustering with K-means

12.3.1. A gentle introduction

12.3.2. Parallel K-means

Technique 110 K-means with a synthetic 2D dataset

12.3.3. K-means and text

12.3.4. Other Mahout clustering algorithms

12.4. Chapter summary

About the book

It's always a good time to upgrade your Hadoop skills! Hadoop in Practice, Second Edition provides a collection of 104 tested, instantly useful techniques for analyzing real-time streams, moving data securely, machine learning, managing large-scale clusters, and taming big data using Hadoop. This completely revised edition covers changes and new features in Hadoop core, including MapReduce 2 and YARN. You'll pick up hands-on best practices for integrating Spark, Kafka, and Impala with Hadoop, and get new and updated techniques for the latest versions of Flume, Sqoop, and Mahout. In short, this is the most practical, up-to-date coverage of Hadoop available.

Readers need to know a programming language like Java and have basic familiarity with Hadoop.

What's inside

  • Thoroughly updated for Hadoop 2
  • How to write YARN applications
  • Integrate real-time technologies like Storm, Impala, and Spark
  • Predictive analytics using Mahout and RR

About the author

Alex Holmes works on tough big-data problems. He is a software engineer, author, speaker, and blogger specializing in large-scale Hadoop projects.


Buy
  • combo $49.99 pBook + eBook
  • eBook $39.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks

The most complete material on Hadoop and its ecosystem known to mankind!

Arthur Zubarev, Vital Insights

Clear and concise, full of insights and highly applicable information.

Edward de Oliveira Ribeiro, DataStax, Inc.

Comprehensive up-to-date coverage of Hadoop 2.

Muthusamy Manigandan, OzoneMedia