Hadoop in Action
Chuck Lam
  • November 2010
  • ISBN 9781935182191
  • 336 pages
  • printed in black & white

A guide for beginners, a source of insight for advanced users.

Philipp K. Janert, Principal Value, LLC

Hadoop in Action introduces the subject and teaches you how to write programs in the MapReduce style. It starts with a few easy examples and then moves quickly to show Hadoop use in more complex data analysis tasks. Included are best practices and design patterns of MapReduce programming.

About the Technology

Big data can be difficult to handle using traditional databases. Apache Hadoop is a NoSQL applications framework that runs on distributed clusters. This lets it scale to huge datasets. If you need analytic information from your data, Hadoop's the way to go.

Table of Contents detailed table of contents



about this book

author Online

about the author

about the cover illustration

Part I Hadoop—A Distributed Programming Framework

1. Introducing Hadoop

1.1. Why "Hadoop in Action"?

1.2. What is Hadoop?

1.3. Understanding distributed systems and Hadoop

1.4. Comparing SQL databases and Hadoop

1.5. Understanding MapReduce

1.6. Counting words with Hadoop—running your first program

1.7. History of Hadoop

1.8. Summary

1.9. Resources

2. Starting Hadoop

2.1. The building blocks of Hadoop

2.2. Setting up SSH for a Hadoop cluster

2.3. Running Hadoop

2.4. Web-based cluster UI

2.5. Summary

3. Components of Hadoop

3.1. Working with files in HDFS

3.2. Anatomy of a MapReduce program

3.3. Reading and writing

3.4. Summary

Part II Hadoop In Action

4. Writing basic MapReduce programs

4.1. Getting the patent data set

4.2. Constructing the basic template of a MapReduce program

4.3. Counting things

4.4. Adapting for Hadoop’s API changes

4.5. Streaming in Hadoop

4.6. Improving performance with combiners

4.7. xercising what you’ve learned

4.8. Summary

4.9. Further resources

5. Advanced MapReduce

5.1. Chaining MapReduce jobs

5.2. Joining data from different sources

5.3. Creating a Bloom filter

5.4. Exercising what you’ve learned

5.5. Summary

5.6. Further resources

6. Programming Practices

6.1. Developing MapReduce programs

6.2. Monitoring and debugging on a production cluster

6.3. Tuning for performance

6.4. Summary

7. Cookbook

7.1. Passing job-specific parameters to your tasks

7.2. Probing for task-specific information

7.3. Partitioning into multiple output files

7.4. Inputting from and outputting to a database

7.5. Keeping all output in sorted order

7.6. Summary

8. Managing Hadoop

8.1. Setting up parameter values for practical use

8.2. Checking system’s health

8.3. Setting permissions

8.4. Managing quotas

8.5. Enabling trash

8.6. Removing DataNodes

8.7. Adding DataNodes

8.8. Managing NameNode and Secondary NameNode

8.9. Recovering from a failed NameNode

8.10. Designing network layout and rack awareness

8.11. Scheduling jobs from multiple users

8.12. Summary

Part III Hadoop Gone Wild

9. Running Hadoop in the cloud

9.1. Introducing Amazon Web Services

9.2. Setting up AWS

9.3. Setting up Hadoop on EC2

9.4. Running MapReduce programs on EC2

9.5. Cleaning up and shutting down your EC2 instances

9.6. Amazon Elastic MapReduce and other AWS services

9.7. Summary

10. Programming with Pig

10.1. Thinking like a Pig

10.2. Installing Pig

10.3. Running Pig

10.4. Learning Pig Latin through Grunt

10.5. Speaking Pig Latin

10.6. Working with user-defined functions

10.7. Working with scripts

10.8. Seeing Pig in action—example of computing similar patents

10.9. Summary

11. Hive and the Hadoop herd

11.1. Hive

11.3. Summary

12. Case studies

12.1. Converting 11 million image documents from the New York Times archive

12.2. Mining data at China Mobile

12.3. Recommending the best websites at StumbleUpon

12.4. Building analytics for enterprise search—IBM’s Project ES2

Appendix A: HDFS file commands


What's inside

  • Introduction to MapReduce
  • Examples illustrating ideas in practice
  • Hadoop's Streaming API
  • Other related tools, like Pig and Hive

About the reader

This book requires basic Java skills. Knowing basic statistical concepts can help with the more advanced examples.

About the author

Chuck Lam is a Senior Engineer at RockYou! He has a PhD in pattern recognition from Stanford University.

placing your order...

Don't refresh or navigate away from the page.
print book $24.99 $44.99 pBook + eBook + liveBook
Additional shipping charges may apply
Hadoop in Action (print book) added to cart
continue shopping
go to cart

eBook $35.99 3 formats + liveBook
Hadoop in Action (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.
customers also reading

This book

FREE domestic shipping on three or more pBooks