Hadoop in Action
Chuck Lam
  • December 2010
  • ISBN 9781935182191
  • 336 pages
  • printed in black & white

A guide for beginners, a source of insight for advanced users.

Philipp K. Janert, Principal Value, LLC

Hadoop in Action introduces the subject and teaches you how to write programs in the MapReduce style. It starts with a few easy examples and then moves quickly to show Hadoop use in more complex data analysis tasks. Included are best practices and design patterns of MapReduce programming.

Table of Contents detailed table of contents

preface

acknowledgments

about this book

author Online

about the author

about the cover illustration

Part I Hadoop—A Distributed Programming Framework

1. Introducing Hadoop

1.1. Why "Hadoop in Action"?

1.2. What is Hadoop?

1.3. Understanding distributed systems and Hadoop

1.4. Comparing SQL databases and Hadoop

1.5. Understanding MapReduce

1.6. Counting words with Hadoop—running your first program

1.7. History of Hadoop

1.8. Summary

1.9. Resources

2. Starting Hadoop

2.1. The building blocks of Hadoop

2.2. Setting up SSH for a Hadoop cluster

2.3. Running Hadoop

2.4. Web-based cluster UI

2.5. Summary

3. Components of Hadoop

3.1. Working with files in HDFS

3.2. Anatomy of a MapReduce program

3.3. Reading and writing

3.4. Summary

Part II Hadoop In Action

4. Writing basic MapReduce programs

4.1. Getting the patent data set

4.2. Constructing the basic template of a MapReduce program

4.3. Counting things

4.4. Adapting for Hadoop’s API changes

4.5. Streaming in Hadoop

4.6. Improving performance with combiners

4.7. xercising what you’ve learned

4.8. Summary

4.9. Further resources

5. Advanced MapReduce

5.1. Chaining MapReduce jobs

5.2. Joining data from different sources

5.3. Creating a Bloom filter

5.4. Exercising what you’ve learned

5.5. Summary

5.6. Further resources

6. Programming Practices

6.1. Developing MapReduce programs

6.2. Monitoring and debugging on a production cluster

6.3. Tuning for performance

6.4. Summary

7. Cookbook

7.1. Passing job-specific parameters to your tasks

7.2. Probing for task-specific information

7.3. Partitioning into multiple output files

7.4. Inputting from and outputting to a database

7.5. Keeping all output in sorted order

7.6. Summary

8. Managing Hadoop

8.1. Setting up parameter values for practical use

8.2. Checking system’s health

8.3. Setting permissions

8.4. Managing quotas

8.5. Enabling trash

8.6. Removing DataNodes

8.7. Adding DataNodes

8.8. Managing NameNode and Secondary NameNode

8.9. Recovering from a failed NameNode

8.10. Designing network layout and rack awareness

8.11. Scheduling jobs from multiple users

8.12. Summary

Part III Hadoop Gone Wild

9. Running Hadoop in the cloud

9.1. Introducing Amazon Web Services

9.2. Setting up AWS

9.3. Setting up Hadoop on EC2

9.4. Running MapReduce programs on EC2

9.5. Cleaning up and shutting down your EC2 instances

9.6. Amazon Elastic MapReduce and other AWS services

9.7. Summary

10. Programming with Pig

10.1. Thinking like a Pig

10.2. Installing Pig

10.3. Running Pig

10.4. Learning Pig Latin through Grunt

10.5. Speaking Pig Latin

10.6. Working with user-defined functions

10.7. Working with scripts

10.8. Seeing Pig in action—example of computing similar patents

10.9. Summary

11. Hive and the Hadoop herd

11.1. Hive

11.3. Summary

12. Case studies

12.1. Converting 11 million image documents from the New York Times archive

12.2. Mining data at China Mobile

12.3. Recommending the best websites at StumbleUpon

12.4. Building analytics for enterprise search—IBM’s Project ES2

Appendix A: HDFS file commands

index

About the Technology

Big data can be difficult to handle using traditional databases. Apache Hadoop is a NoSQL applications framework that runs on distributed clusters. This lets it scale to huge datasets. If you need analytic information from your data, Hadoop's the way to go.

What's inside

  • Introduction to MapReduce
  • Examples illustrating ideas in practice
  • Hadoop's Streaming API
  • Other related tools, like Pig and Hive

About the reader

This book requires basic Java skills. Knowing basic statistical concepts can help with the more advanced examples.

About the author

Chuck Lam is a Senior Engineer at RockYou! He has a PhD in pattern recognition from Stanford University.


Buy
  • combo $44.99 pBook + eBook
  • eBook $35.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks

A nice mix of the what, why, and how of Hadoop.

Paul Stusiak, Falcon Technologies Corp.

Demystifies Hadoop. A great resource!

Rick Wagner, Acxiom Corp.

Covers it all! Plus, gives you sweet extras no one else does.

John S. Griffin, Overstock.com

An excellent introduction to Hadoop and MapReduce.

Kenneth DeLong, BabyCenter, LLC