I first encountered Hadoop in the fall of 2008 when I was working on an internetcrawl and analysis project at Verisign. My team was making discoveries similar to thosethat Doug Cutting and others at Nutch had made several years earlier regarding howto efficiently store and manage terabytes of crawled and analyzed data. At the time, wewere getting by with our home-grown distributed system, but the influx of a new datastream and requirements to join that stream with our crawl data couldn’t be sup-ported by our existing system in the required timelines.

After some research we came across the Hadoop project, which seemed to be aperfect fit for our needs—it supported storing large volumes of data and provided amechanism to combine them. Within a few months we’d built and deployed a Map-Reduce application encompassing a number of MapReduce jobs, woven together withour own MapReduce workflow management system onto a small cluster of 18 nodes. Itwas a revelation to observe our MapReduce jobs crunching through our data in min-utes. Of course we couldn’t anticipate the amount of time that we’d spend debuggingand performance-tuning our MapReduce jobs, not to mention the new roles we tookon as production administrators—the biggest surprise in this role was the number ofdisk failures we encountered during those first few months supporting production!

As our experience and comfort level with Hadoop grew, we continued to buildmore of our functionality using Hadoop to help with our scaling challenges. We alsostarted to evangelize the use of Hadoop within our organization and helped kick-startother projects that were also facing big data challenges.

The greatest challenge we faced when working with Hadoop (and specificallyMapReduce) was relearning how to solve problems with it. MapReduce is its own flavor of parallel programming, which is quite different from the in-JVM programmingthat we were accustomed to. The biggest hurdle was the first one—training our brainsto think MapReduce, a topic which the book Hadoop in Action by Chuck Lam (Man-ning Publications, 2010) covers well.

After you’re used to thinking in MapReduce, the next challenge is typically relatedto the logistics of working with Hadoop, such as how to move data in and out of HDFS,and effective and efficient ways to work with data in Hadoop. These areas of Hadoophaven’t received much coverage, and that’s what attracted me to the potential of thisbook—that of going beyond the fundamental word-count Hadoop usages and cover-ing some of the more tricky and dirty aspects of Hadoop.

As I’m sure many authors have experienced, I went into this project confidentlybelieving that writing this book was just a matter of transferring my experiences ontopaper. Boy, did I get a reality check, but not altogether an unpleasant one, becausewriting introduced me to new approaches and tools that ultimately helped better myown Hadoop abilities. I hope that you get as much out of reading this book as I didwriting it.