about this book

Doug Cutting, Hadoop’s creator, likes to call Hadoop the kernel for big data, and I’d tend to agree. With its distributed storage and compute capabilities, Hadoop is funda­mentally an enabling technology for working with huge datasets. Hadoop, to me, pro­vides a bridge between structured (RDBMS) and unstructured (log files, XML, text) data, and allows these datasets to be easily joined together. This has evolved from tra­ditional use cases, such as combining OLTP and log files, to more sophisticated uses, such as using Hadoop for data warehousing (exemplified by Facebook) and the field of data science, which studies and makes new discoveries about data.

This book collects a number of intermediary and advanced Hadoop examples and presents them in a problem/solution format. Each of the 85 techniques addresses a specific task you’ll face, like using Flume to move log files into Hadoop or using Mahout for predictive analysis. Each problem is explored step by step and, as you work through them, you’ll find yourself growing more comfortable with Hadoop and at home in the world of big data.

This hands-on book targets users who have some practical experience with Hadoop and understand the basic concepts of MapReduce and HDFS. Manning’s Hadoop in Action by Chuck Lam contains the necessary prerequisites to understand and apply the techniques covered in this book.

Many techniques in this book are Java-based, which means readers are expected to possess an intermediate-level knowledge of Java. An excellent text for all levels of Java users is Effective Java, Second Edition, by Joshua Bloch (Addison-Wesley, 2008).

Roadmap

This book has 13 chapters divided into five parts.

Part 1 contains a single chapter that’s the introduction to this book. It reviews Hadoop basics and looks at how to get Hadoop up and running on a single host. It wraps up with a walk-through on how to write and execute a MapReduce job.

Part 2, “Data logistics,” consists of two chapters that cover the techniques and tools required to deal with data fundamentals, getting data in and out of Hadoop, and how to work with various data formats. Getting data into Hadoop is one of the first roadblocks commonly encountered when working with Hadoop, and chapter 2 is dedicated to looking at a variety of tools that work with common enterprise data sources. Chapter 3 covers how to work with ubiquitous data formats such as XML and JSON in MapReduce, before going on to look at data formats better suited to working with big data.

Part 3 is called “Big data patterns,” and looks at techniques to help you work effec­tively with large volumes of data. Chapter 4 examines how to optimize MapReduce join and sort operations, and chapter 5 covers working with a large number of small files, and compression. Chapter 6 looks at how to debug MapReduce performance issues, and also covers a number of techniques to help make your jobs run faster.

Part 4 is all about “Data science,” and delves into the tools and methods that help you make sense of your data. Chapter 7 covers how to represent data such as graphs for use with MapReduce, and looks at several algorithms that operate on graph data. Chapter 8 describes how R, a popular statistical and data mining platform, can be inte­grated with Hadoop. Chapter 9 describes how Mahout can be used in conjunction with MapReduce for massively scalable predictive analytics.

Part 5 is titled “Taming the elephant,” and examines a number of technologies that make it easier to work with MapReduce. Chapters 10 and 11 cover Hive and Pig respectively, both of which are MapReduce domain-specific languages (DSLs) geared at providing high-level abstractions. Chapter 12 looks at Crunch and Cascading, which are Java libraries that offer their own MapReduce abstractions, and chapter 13 covers techniques to help write unit tests, and to debug MapReduce problems.

The appendixes start with appendix A, which covers instructions on installing both Hadoop and all the other related technologies covered in the book. Appendix B cov­ers low-level Hadoop ingress/egress mechanisms that the tools covered in chapter 2 leverage. Appendix C looks at how HDFS supports reads and writes, and appendix D covers a couple of MapReduce join frameworks written by the author and utilized in chapter 4.

Code conventions and downloads

All source code in listings or in text is in a fixed-width font like this to separate it from ordinary text. Code annotations accompany many of the listings, highlighting important concepts.

All of the text and examples in this book work with Hadoop 0.20.x (and 1.x), and most of the code is written using the newer org.apache.hadoop.mapreduce MapReduce APIs. The few examples that leverage the older org.apache.hadoop.mapred package are usually the result of working with a third-party library or a utility that only works with the old API.

All of the code used in this book is available on GitHub at https://github.com/alexholmes/hadoop-book as well as from the publisher’s website at www.manning.com/HadoopinPractice.

Building the code depends on Java 1.6 or newer, git, and Maven 3.0 or newer. Git is a source control management system, and GitHub provides hosted git repository ser­vices. Maven is used for the build system.

You can clone (download) my GitHub repository with the following command:

$ git clone git://github.com/alexholmes/hadoop-book.git

After the sources are downloaded you can build the code:

$ cd hadoop-book
$ mvn package

This will create a Java JAR file, target/hadoop-book-1.0.0-SNAPSHOT-jar-with-depen­dencies.jar. Running the code is equally simple with the included bin/run.sh.

If you’re running on a CDH distribution, the scripts will run configuration-free. If you’re running on any other distribution, you’ll need to set the HADOOP_HOME environ­ment variable to point to your Hadoop installation directory.

The bin/run.sh script takes as the first argument the fully qualified Java class name of the example, followed by any arguments expected by the example class. As an example, to run the inverted index MapReduce code from chapter 1, you’d run the following:.

$ hadoop fs -mkdir /tmp
$ hadoop fs -put test-data/ch1/* /tmp/

# replace the path below with the location of your Hadoop installation
# this isn't required if you are running CDH3
export HADOOP_HOME=/usr/local/hadoop

$ bin/run.sh com.manning.hip.ch1.InvertedIndexMapReduce \
  /tmp/file1.txt /tmp/file2.txt output

The previous code won’t work if you don’t have Hadoop installed. Please refer to chapter 1 for CDH installation instructions, or appendix A for Apache installation instructions.

Third-party libraries

I use a number of third-party libraries for the sake of convenience. They’re included in the Maven-built JAR so there’s no extra work required to work with these libraries. The following table contains a list of the libraries that are in prevalent use throughout the code examples.

Common third-party libraries

Library Link Details
Apache
Commons IO
http://commons.apache.org/io/ Helper functions to help work with input and output streams in Java. You’ll make frequent use of the IOUtils to close connections and to read the contents of files into strings.
Apache
Commons Lang
http://commons.apache.org/lang/ Helper functions to work with strings, dates, and collections. You’ll make frequent use of the StringUtils class for tokenization.

Datasets

Throughout this book you’ll work with three datasets to provide some variety for the examples. All the datasets are small to make them easy to work with. Copies of the exact data used are available in the GitHub repository in the directory https://github.com/alexholmes/hadoop-book/tree/master/test-data. I also sometimes have data that’s specific to a chapter, which exists within chapter-specific subdirectories under the same GitHub location.

NASDAQ FINANCIAL STOCKS

I downloaded the NASDAQ daily exchange data from Infochimps (see http://mng.bz/xjwc). I filtered this huge dataset down to just five stocks and their start-of­year values from 2000 through 2009. The data used for this book is available on GitHub at https://github.com/alexholmes/hadoop-book/blob/master/test-data/stocks.txt.

The data is in CSV form, and the fields are in the following order:

Symbol,Date,Open,High,Low,Close,Volume,Adj Close

APACHE LOG DATA

I created a sample log file in Apache Common Log Format (see http://mng.bz/L4S3) with some fake Class E IP addresses and some dummy resources and response codes. The file is available on GitHub at https://github.com/alexholmes/hadoop­book/blob/master/test-data/apachelog.txt.

NAMES

The government’s census was used to retrieve names from http://mng.bz/LuFB and is available at https://github.com/alexholmes/hadoop-book/blob/master/test-data/names.txt.

Getting help

You’ll no doubt have questions when working with Hadoop. Luckily, between the wikis and a vibrant user community your needs should be well covered.

The main wiki is located at http://wiki.apache.org/hadoop/, and contains useful presentations, setup instructions, and troubleshooting instructions.

The Hadoop Common, HDFS, and MapReduce mailing lists can all be found on http://hadoop.apache.org/mailing_lists.html.

Search Hadoop is a useful website that indexes all of Hadoop and its ecosystem projects, and it provides full-text search capabilities: http://search-hadoop.com/.

You’ll find many useful blogs you should subscribe to in order to keep on top of current events in Hadoop. This preface includes a selection of my favorites:

There are a plethora of active Hadoop Twitter users who you may want to follow, including Arun Murthy (@acmurthy), Tom White (@tom_e_white), Eric Sammer (@esammer), Doug Cutting (@cutting), and Todd Lipcon (@tlipcon). The Hadoop project itself tweets on @hadoop.

Author Online

Purchase of Hadoop in Practice includes free access to a private web forum run by Man­ning Publications where you can make comments about the book, ask technical questions, and receive help from the author and other users. To access and subscribe to the forum, point your web browser to www.manning.com/HadoopinPractice or www.manning.com/holmes/. These pages provide information on how to get on the forum after you are registered, what kind of help is available, and the rules of conduct on the forum.

Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It’s not a commitment to any specific amount of participation on the part of the author, whose contribution to the book’s forum remains voluntary (and unpaid). We suggest you try asking him some challenging questions, lest his interest stray!

The Author Online forum and the archives of previous discussions will be accessi­ble from the publisher’s website as long as the book is in print.

About the author

Alex Holmes is a senior software engineer with over 15 years of experience develop­ing large-scale distributed Java systems. For the last four years he has gained expertise in Hadoop solving big data problems across a number of projects. He has presented at JavaOne and Jazoon and is currently a technical lead at VeriSign.

Alex maintains a Hadoop-related blog at http://grepalex.com, and is on Twitter at https://twitter.com/grep_alex.

About the cover illustration

The figure on the cover of Hadoop in Practice is captioned “A young man from Kistanja, Dalmatia.” The illustration is taken from a reproduction of an album of Croatian tra­ditional costumes from the mid-nineteenth century by Nikola Arsenovic, published by the Ethnographic Museum in Split, Croatia, in 2003. The illustrations were obtained from a helpful librarian at the Ethnographic Museum in Split, itself situated in the Roman core of the medieval center of the town: the ruins of Emperor Diocletian’s retirement palace from around AD 304. The book includes finely colored illustrations of figures from different regions of Croatia, accompanied by descriptions of the cos­tumes and of everyday life.

Kistanja is a small town located in Bukovica, a geographical region in Croatia. It is situated in northern Dalmatia, an area rich in Roman and Venetian history. The word mamok in Croatian means a bachelor, beau, or suitor—a single young man who is of courting age—and the young man on the cover, looking dapper in a crisp, white linen shirt and a colorful, embroidered vest, is clearly dressed in his finest clothes, which would be worn to church and for festive occasions—or to go calling on a young lady.

Dress codes and lifestyles have changed over the last 200 years, and the diversity by region, so rich at the time, has faded away. It is now hard to tell apart the inhabitants of different continents, let alone of different hamlets or towns separated by only a few miles. Perhaps we have traded cultural diversity for a more varied personal life—cer­tainly for a more varied and fast-paced technological life.

Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by illustrations from old books and collections like this one.