Hadoop in Practice
Alex Holmes
  • October 2012
  • ISBN 9781617290237
  • 536 pages
  • printed in black & white

Interesting topics that tickle the creative brain.

Mark Kemna, Brillig


Hadoop in Practice, Second Edition is now available. An eBook of this older edition is included at no additional cost when you buy the revised edition!

Hadoop in Practice collects 85 Hadoop examples and presents them in a problem/solution format. Each technique addresses a specific task you'll face, like querying big data using Pig or writing a log file loader. You'll explore each problem step by step, learning both how to build and deploy that specific solution along with the thinking that went into its design. As you work through the tasks, you'll find yourself growing more comfortable with Hadoop and at home in the world of big data.

Table of Contents detailed table of contents



about this book

Part 1 Background and Fundamentals

1. Hadoop in a heartbeat

1.1. What is Hadoop?

1.2. Running Hadoop

1.3. Chapter summary

Part 2 Data Logistics

2. Moving data in and out of Hadoop

2.1. Key elements of ingress and egress

2.2. Moving data into Hadoop

Technique 1 Pushing system log messages into HDFS with Flume

Technique 2 An automated mechanism to copy files into HDFS

Technique 3 Scheduling regular ingress activities with Oozie

Technique 4 Database ingress with MapReduce

Technique 5 Using Sqoop to import data from MySQL

Technique 6 HBase ingress into HDFS

Technique 7 MapReduce with HBase as a data source

2.3. Moving data out of Hadoop

Technique 8 Automated file copying from HDFS

Technique 9 Using Sqoop to export data to MySQL

Technique 10 HDFS egress to HBase

Technique 11 Using HBase as a data sink in MapReduce

2.4. Chapter summary

3. Data serialization—working with text and beyond

3.1. Understanding inputs and outputs in MapReduce

3.2. Processing common serialization formats

Technique 12 MapReduce and XML

Technique 13 MapReduce and JSON

3.3. Big data serialization formats

Technique 14 Working with SequenceFiles

Technique 15 Integrating Protocol Buffers with MapReduce

Technique 16 Working with Thrift

Technique 17 Next-generation data serialization with MapReduce

3.4. Custom file formats

Technique 18 Writing input and output formats for CSV

3.5. Chapter summary

Part 3 Big Data Patterns

4. Applying MapReduce patterns to big data

4.1. Joining

Technique 19 Optimized repartition joins

Technique 20 Implementing a semi-join

4.2. Sorting

Technique 21 Implementing a secondary sort

Technique 22 Sorting keys across multiple reducers

4.3. Sampling

Technique 23 Reservoir sampling

4.4. Chapter summary

5. Streamlining HDFS for big data

5.1. Working with small files

Technique 24 Using Avro to store multiple small files

5.2. Efficient storage with compression

Technique 25 Picking the right compression codec for your data

Technique 26 Compression with HDFS, MapReduce, Pig, and Hive

Technique 27 Splittable LZOP with MapReduce, Hive, and Pig

5.3. Chapter summary

6. Diagnosing and tuning performance problems

6.1. Measuring MapReduce and your environment

6.2. Determining the cause of your performance woes

Technique 28 Investigating spikes in input data

Technique 29 Identifying map-side data skew problems

Technique 30 Determining if map tasks have an overall low throughput

Technique 31 Small files

Technique 32 Unsplittable files

Technique 33 Too few or too many reducers

Technique 34 Identifying reduce-side data skew problems

Technique 35 Determining if reduce tasks have an overall low throughput

Technique 36 Slow shuffle and sort

Technique 37 Competing jobs and scheduler throttling

Technique 38 Using stack dumps to discover unoptimized user code

Technique 39 Discovering hardware failures

Technique 40 CPU contention

Technique 41 Memory swapping

Technique 42 Disk health

Technique 43 Networking

6.3. Visualization

Technique 44 Extracting and visualizing task execution times

6.4. Tuning

Technique 45 Profiling your map and reduce tasks

Technique 46 Avoid the reducer

Technique 47 Filter and project

Technique 48 Using the combiner

Technique 49 Blazingly fast sorting with comparators

Technique 50 Collecting skewed data

Technique 51 Reduce skew mitigation

6.5. Chapter summary

Part 4 Data Science

7. Utilizing data structures and algorithms

7.1. Modeling data and solving problems with graphs

Technique 52 Find the shortest distance between two users

Technique 53 Calculating FoFs

Technique 54 Calculate PageRank over a web graph

7.2. Bloom filters

Technique 55 Parallelized Bloom filter creation in MapReduce

Technique 56 MapReduce semi-join with Bloom filters

7.3. Chapter summary

8. Integrating R and Hadoop for statistics and more

8.1. Comparing R and MapReduce integrations

8.2. R fundamentals

8.3. R and Streaming

Technique 57 Calculate the daily mean for stocks

Technique 58 Calculate the cumulative moving average for stocks

8.4. Rhipe—Client-side R and Hadoop working together

Technique 59 Calculating the CMA using Rhipe

8.5. RHadoop—a simpler integration of client-side R and Hadoop

Technique 60 Calculating CMA with RHadoop

8.6. Chapter summary

9. Predictive analytics with Mahout

9.1. Using recommenders to make product suggestions

Technique 61 Item-based recommenders using movie ratings

9.2. Classification

Technique 62 Using Mahout to train and test a spam classifier

9.3. Clustering with K-means

Technique 63 K-means with a synthetic 2D dataset

9.4. Chapter summary

Part 5 Taming the Elephant

10. Hacking with Hive

10.1. Hive fundamentals

10.2. Data analytics with Hive

Technique 64 Loading log files

Technique 65 Writing UDFs and compressed partitioned tables

Technique 66 Tuning Hive joins

10.3. Chapter summary


11. Programming pipelines with Pig

11.1. Pig fundamentals

11.2. Using Pig to find malicious actors in log data

Technique 67 Schema-rich Apache log loading

Technique 68 Reducing your data with filters and projection

Technique 69 Grouping and counting IP addresses

Technique 70 IP Geolocation using the distributed cache

Technique 71 Combining Pig with your scripts

Technique 72 Combining data in Pig

Technique 73 Sorting tuples

Technique 74 Storing data in SequenceFiles

11.3. Optimizing user workflows with Pig

Technique 75 A four-step process to working rapidly with big data

11.4. Performance

Technique 76 Pig optimizations

11.5. Chapter summary

12. Crunch and other technologies

12.1. What is Crunch?

Technique 77 Crunch log parsing and basic analytics

12.3. Joins

Technique 78 Crunch’s repartition join

12.4. Cascading

12.5. Chapter summary

13. Testing and debugging

13.1. Testing

Technique 79 Unit Testing MapReduce functions, jobs, and pipelines

Technique 80 Heavyweight job testing with the LocalJobRunner

13.2. Debugging user space problems

Technique 81 Examining task logs

Technique 82 Pinpointing a problem Input Split

Technique 83 Figuring out the JVM startup arguments for a task

Technique 84 Debugging and error handling

13.3. MapReduce gotchas

Technique 85 MapReduce anti-patterns

13.4. Chapter summary

Appendix B: Hadoop built-in ingress and egress tools

Appendix C: HDFS dissected

Appendix D: Optimized MapReduce join frameworks


About the Technology

Hadoop is an open source MapReduce platform designed to query and analyze data distributed across large clusters. Especially effective for big data systems, Hadoop powers mission-critical software at Apple, eBay, LinkedIn, Yahoo, and Facebook. It offers developers handy ways to store, manage, and analyze data.

About the book

Hadoop in Practice collects 85 battle-tested examples and presents them in a problem/solution format. It balances conceptual foundations with practical recipes for key problem areas like data ingress and egress, serialization, and LZO compression. You'll explore each technique step by step, learning how to build a specific solution along with the thinking that went into it. As a bonus, the book's examples create a well-structured and understandable codebase you can tweak to meet your own needs.

What's inside

  • Conceptual overview of Hadoop and MapReduce
  • 85 practical, tested techniques
  • Real problems, real solutions
  • How to integrate MapReduce and R

About the reader

This book assumes you've already started exploring Hadoop and want concrete advice on how to use it in production.

About the author

Alex Holmes is a senior software engineer with extensive expertise in solving big data problems using Hadoop. He has presented at JavaOne and Jazoon and is a technical lead at VeriSign.

Ties together the Hadoop ecosystem technologies.

Ayon Sinha, Britely

Comprehensive … high-quality code samples.

Chris Nauroth, The Walt Disney Company

Covers all of the variants of Hadoop, not just the Apache distribution.

Ted Dunning, MapR Technologies

Charts a path to the future.

Alexey Gayduk, Grid Dynamics