contents
preface
acknowledgments
about this book
Author Online
About the author
About the cover illustration
Part I Hadoop–A Distributed Programming Framework
- 1 Introducing Hadoop
- 1.1 Why “Hadoop in Action”?
- 1.2 What is Hadoop?
- 1.3 Understanding distributed systems and Hadoop
- 1.4 Comparing SQL databases and Hadoop
- 1.5 Understanding MapReduce
- 1.6 Counting words with Hadoop—running your first program
- 1.7 History of Hadoop
- 1.8 Summary
- 1.9 Resources
- 2 Starting Hadoop
- 2.1 The building blocks of Hadoop
- 2.2 Setting up SSH for a Hadoop cluster
- 2.3 Running Hadoop
- 2.4 Web-based cluster UI
- 2.5 Summary
- 3 Components of Hadoop
- 3.1 Working with files in HDFS
- 3.2 Anatomy of a MapReduce program
- 3.3 Reading and writing
- 3.4 Summary
Part II Hadoop In Action
- 4 Writing basic MapReduce programs
- 4.1 Getting the patent data set
- 4.2 Constructing the basic template of a MapReduce program
- 4.3 Counting things
- 4.4 Adapting for Hadoop’s API changes
- 4.5 Streaming in Hadoop
- 4.6 Improving performance with combiners
- 4.7 Exercising what you’ve learned
- 4.8 Summary
- 4.9 Further resources
- 5 Advanced MapReduce
- 5.1 Chaining MapReduce jobs
- 5.2 Joining data from different sources
- 5.3 Creating a Bloom filter
- 5.4 Exercising what you’ve learned
- 5.5 Summary
- 5.6 Further resources
- 6 Programming Practices
- 6.1 Developing MapReduce programs
- 6.2 Monitoring and debugging on a production cluster
- 6.3 Tuning for performance
- 6.4 Summary
- 7 Cookbook
- 7.1 Passing job-specific parameters to your tasks
- 7.2 Probing for task-specific information
- 7.3 Partitioning into multiple output files
- 7.4 Inputting from and outputting to a database
- 7.5 Keeping all output in sorted order
- 7.6 Summary
- 8 Managing Hadoop
- 8.1 Setting up parameter values for practical use
- 8.2 Checking system’s health
- 8.3 Setting permissions
- 8.4 Managing quotas
- 8.5 Enabling trash
- 8.6 Removing DataNodes
- 8.7 Adding DataNodes
- 8.8 Managing NameNode and Secondary NameNode
- 8.9 Recovering from a failed NameNode
- 8.10 Designing network layout and rack awareness
- 8.11 Scheduling jobs from multiple users
- 8.12 Summary
Part III Hadoop Gone Wild
- 9 Running Hadoop in the cloud
- 9.1 Introducing Amazon Web Services
- 9.2 Setting up AWS
- 9.3 Setting up Hadoop on EC2
- 9.4 Running MapReduce programs on EC2
- 9.5 Cleaning up and shutting down your EC2 instances
- 9.6 Amazon Elastic MapReduce and other AWS services
- 9.7 Summary
- 10 Programming with Pig
- 10.1 Thinking like a Pig
- 10.2 Installing Pig
- 10.3 Running Pig
- 10.4 Learning Pig Latin through Grunt
- 10.5 Speaking Pig Latin
- 10.6 Working with user-defined functions
- 10.7 Working with scripts
- 10.8 Seeing Pig in action—example of computing similar patents
- 10.9 Summary
- 11 Hive and the Hadoop herd
- 11.1 Hive
- 11.2 Other Hadoop-related stuff
- 11.3 Summary
- 12 Case studies
- 12.1 Converting 11 million image documents from the New York Times archive
- 12.2 Mining data at China Mobile
- 12.3 Recommending the best websites at StumbleUpon
- 12.4 Building analytics for enterprise search—IBM’s Project ES2
appendix HDFS file commands
index