I’ve been fascinated by data for a long time. When I was an undergrad in electrical engineering, I discovered digital signal processing and gravitated toward it. I found out that music, video, photos, and lots of other stuff could be viewed as data. Computation was creating and enhancing those emotional experiences. I thought that was the coolest thing ever.
Over time, I continued to get excited by new aspects of data. The last few years had exposed me to social and big data. Big data was especially intellectually challenging for me. Previously I had learned to look at data from a statistician’s point of view, and new types of data had “only” asked for new mathematical methods. It wasn’t simple, but at least I had been trained for that, and there was also a wealth of resources to tap into. Big data, on the other hand, was about system-level innovations and new ways of programming. I wasn’t trained for it, and more importantly, I wasn’t alone. Knowledge about handling big data in practice was somewhat of a black art. This was true of many tools and techniques for scaling data processing, including caching (for example, memcached), replication, sharding, and, of course, MapReduce/Hadoop. I had spent the last few years getting up to speed on many of these skills.
Personally I have found the biggest hurdle to learning these techniques is in the middle of the learning curve. In the beginning it’s pretty easy to fi nd introductory blogs and presentations teaching you how to do a “Hello World” example. And when you’re suffi ciently well-versed, you’ll know how to ask additional questions to the mailing lists, meet experts at meetups or conferences, and even read the source code yourself. But there’s a huge knowledge gap in the middle, when your appetite is whetted but you don’t quite know what questions to ask next. This problem is especially acute for the newest technologies, like Hadoop. An organized exposition that starts with “Hello World” and takes you to the point where you can comfortably apply Hadoop in practice is needed. That’s what I intend this book to be. Fortunately I’ve found Manning’s In Action series to be consistent with this objective, and they have excellent editors that helped me along the way.
I had a fun time writing this book, and I hope this is the beginning of your wonderful journey with Hadoop.