Hadoop is an open source framework implementing the MapReduce algorithm behind Google’s approach to querying the distributed data sets that constitute the internet. This definition naturally leads to an obvious question: What are maps and why do they need to be reduced? Massive data sets can be extremely difficult to analyze and query using traditional mechanisms, especially when the queries themselves are quite complicated. In effect, the MapReduce algorithm breaks up both the query and the data set into constituent parts—that’s the mapping. The mapped components of the query can be processed simultaneously—or reduced—to rapidly return results.
This book teaches readers how to use Hadoop and write MapReduce programs. The intended readers are programmers, architects, and project managers who have to process large amounts of data offline. This book guides the reader from obtaining a copy of Hadoop to setting it up in a cluster and writing data analytic programs.
The book begins by making the basic idea of Hadoop and MapReduce easier to grasp by applying the default Hadoop installation to a few easy-to-follow tasks, such as analyzing changes in word frequency across a body of documents. The book continues through the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action.
MapReduce is a complex idea both conceptually and in its implementation, and Hadoop users are challenged to learn all the knobs and levers for running Hadoop. This book takes you beyond the mechanics of running Hadoop, teaching you to write meaningful programs in a MapReduce framework.
This book assumes the reader will have a basic familiarity with Java, as most code examples will be written in Java. Familiarity with basic statistical concepts (e.g., histogram, correlation) will help the reader appreciate the more advanced data processing examples.
The book has 12 chapters divided into three parts.
Part 1 consists of three chapters which introduce the Hadoop framework, covering the basics you’ll need to understand and use Hadoop. The chapters describe the hardware components that make up a Hadoop cluster, as well as the installation and configuration to create a working system. Part 1 also covers the MapReduce framework at a high level and gets your first MapReduce program up and running.
Part 2, “Hadoop in action,” consists of five chapters that teach the practical skills required to write and run data processing programs in Hadoop. In these chapters we explore various examples of using Hadoop to analyze a patent data set, including advanced algorithms such as the Bloom filter. We also cover programming and administration techniques that are uniquely useful to working with Hadoop in production.
Part 3 is called “Hadoop gone wild” and the final four chapters of the book explore the larger ecosystem around Hadoop. Cloud services provide an alternative to buying and hosting your own hardware to create a Hadoop cluster and any add-on packages provide higher-level programming abstractions over MapReduce. Finally, we look at several case studies where Hadoop solves real business problems in practice.
An appendix contains a listing of HDFS commands along with their descriptions and usage.
All source code in listings or in text is in a fixed-width font like this to separate it from ordinary text. Code annotations accompany many of the listings, highlighting important concepts. In some cases, numbered bullets link to explanations that follow the listing.
The code for the examples in this book is available for download from the publisher’s website at www.manning.com/HadoopinAction.