Hadoop in Action, Second Edition
Chuck P. Lam, Mark W. Davis, and Ajit Gaddam
  • MEAP began September 2014
  • Publication in July 2016 (estimated)
  • ISBN 9781617291227
  • 525 pages (estimated)
  • printed in black & white
We regret that Manning Publications will not be publishing this title.

Hadoop in Action, Second Edition, provides a comprehensive introduction to Hadoop and shows you how to write programs in the MapReduce style. It starts with a few easy examples and then moves quickly to show how Hadoop can be used in more complex data analysis tasks. You'll discover how YARN, new in Hadoop 2, simplifies and supercharges resource management to make streaming and real-time applications more feasible. Included are best practices and design patterns of MapReduce programming. The book expands on the first edition by enhancing coverage of important Hadoop 2 concepts and systems, and by providing new chapters on data management and data science that reinforce a practical understanding of Hadoop.

Table of Contents detailed table of contents

Part 1: Hadoop—A Distributed Programming Framework

1. Introducing Hadoop

1.1. Why "Hadoop in Action"?

1.2. What is Hadoop?

1.3. Understanding distributed systems and Hadoop

1.4. Comparing Hadoop with SQL and NoSQL databases

1.4.1. Scale-out instead of scale-up

1.4.2. Key/value pairs instead of relational tables

1.4.3. Functional programming (MapReduce) instead of declarative queries (SQL)

1.4.4. Offline batch processing instead of online transactions

1.4.5. NoSQL versus SQL

1.5. Understanding MapReduce

1.5.1. Scaling a simple program manually

1.5.2. Scaling the same program in MapReduce

1.6. Counting words with Hadoop—running your first program

1.7. History of Hadoop

1.8. The Hadoop Ecosystem

1.8.1. Apache Zookeeper

1.8.2. YARN: Yet Another Resource Negotiator

1.8.3. Hive

1.8.4. Oozie

1.8.5. Avro

1.8.6. HBase

1.8.7. Pig

1.8.8. Flume

1.8.9. Solr

1.8.10. Impala

1.8.11. Sqoop

1.9. Big Data Workflows

1.10. Summary

1.11. Resources

2. Starting Hadoop

2.1. The building blocks of Hadoop

2.1.1. NameNode

2.1.2. DataNode

2.1.3. Secondary NameNode

2.1.4. ResourceManager

2.1.5. NodeManager

2.1.6. ApplicationMaster

2.2. Changes in Hadoop 2

2.3. Setting up SSH for a Hadoop cluster

2.3.1. Define a common account

2.3.2. Verify SSH installation

2.3.3. Generate SSH key pair

2.3.4. Distribute public key and validate logins

2.4. Running Hadoop

2.4.1. Local (standalone) mode

2.4.2. Pseudo-distributed mode

2.4.3. Fully distributed mode

2.5. Web-based cluster UI

2.6. Running Hadoop in the cloud

2.6.1. Introducing Amazon Web Services

2.6.2. Setting up AWS

2.6.3. Setting up EMR

2.6.4. Running MapReduce jobs on EMR

2.6.5. Shutting down AWS and EMR

2.7. Summary

3. Securing the Hadoop Platform

3.1. Hadoop Security Weaknesses

3.1.1. Top 10 Security and Privacy Challenges in Hadoop

3.1.2. Additional Security Weaknesses

3.2. Hadoop Threat Model

3.2.1. Challenges and Threats in Hadoop Security

3.2.2. Hadoop Threat Modeling

3.3. Hadoop Security Framework

3.3.1. Data Management

3.3.2. Data Discovery

3.3.3. Data Tagging

3.4. Identity & Access Management

3.4.1. Threat Modeling

3.4.2. Authentication

3.4.3. Authorization

3.4.4. Access Control

3.4.5. User Entitlement + Data Metering

3.4.6. RBAC Authorization

3.4.7. HDFS Security

3.4.8. ACLs and Security

3.4.9. LDAP for Hadoop

3.4.10. Kerberos and Hadoop

3.4.11. Getting and Installing Kerberos

3.5. Data Protection & Privacy

3.5.1. Application Level Cryptography (Tokenization, field-level encryption)

3.5.2. Transparent Encryption (disk / HDFS layer)

3.5.3. Data Masking/ Data Redaction

3.6. Network Security

3.6.1. Threat Model

3.6.2. Data Protection In-Transit

3.6.3. Network Security Zoning

3.7. Infrastructure Security & Integrity

3.7.1. Threat Model Development

3.7.2. Logging / Audit

3.7.3. Secure Enhanced Linux (SELinux)

3.8. Summary

4. Components of Hadoop

4.1. Working with files in HDFS

4.1.1. Basic file commands

4.1.2. Reading and writing to HDFS programmatically

4.2. Anatomy of a MapReduce program

4.2.1. Hadoop data types

4.2.2. Mapper

4.2.3. Reducer

4.2.4. Partitioner—redirecting output from Mapper

4.2.5. Combiner—local reduce

4.2.6. Word counting with predefined mapper and reducer classes

4.3. Reading and writing

4.3.1. InputFormat

4.3.2. OutputFormat

4.4. Summary

Part 2: Hadoop in Action

5. Writing basic MapReduce programs

5.1. Getting the patent data set

5.1.1. The patent citation data

5.1.2. The patent description data

5.2. Constructing the basic template of a MapReduce program

5.2.1. MapReduce v1 and v2

5.3. Counting things

5.4. Streaming in Hadoop

5.4.1. Streaming with Unix commands

5.4.2. Streaming with scripts

5.4.3. Streaming with key/value pairs

5.4.4. Streaming with the Aggregate package

5.5. Improving performance with combiners

5.6. Exercising what you've learned

5.7. Summary

5.8. Further resources

6. Advanced MapReduce

7. Programming practices

7.1. Developing MapReduce programs

7.1.1. Local mode

7.1.2. Pseudo-distributed or Single Node Cluster mode

7.2. Monitoring and debugging on a production cluster

7.2.1. Counters

7.2.2. Skipping bad records

7.2.3. Rerunning failed tasks with IsolationRunner

7.3. Tuning for performance

7.3.1. Reducing network traffic with combiner

7.3.2. Reducing the amount of input data

7.3.3. Using compression

7.3.4. Reusing the JVM

7.3.5. Running with speculative execution

7.3.6. Refactoring code and rewriting algorithms

7.4. Summary

Part 3: Data Management with Hadoop

8. Data Security for Data Management

8.1. HDFS Security

8.2. ACLs and Security

8.3. LDAP for Hadoop

8.4. Kerberos and Hadoop

8.4.1. Getting and Installing Kerberos

8.5. Apache Knox

8.6. Apache Sentry

8.7. Overview

9. SQL meets Hadoop with Hive

10. Programming with Pig

11. HBase: The Hadoop Analytics Database

Part 4: Hadoop for Data Scientists & Cyber-Security Analytics

12. MapReduce programming for data science

13. Writing and Using YARN Applications

14. Collaborative filtering using Spark

15. Accelerating SQL analytics with Tez

16. Machine Learning on Hadoop

17. Data Exfiltration Analysis for Cyber Security

Appendixes

Appendix A: HDFS file commands

Appendix B: HiveQL

Appendix C: Pig Latin

Appendix D: HBase Commands

About the Technology

The massive datasets required for most modern businesses are too large to safely store and efficiently process on a single server. Hadoop is an open source data processing framework that provides a distributed file system so you can manage data stored across clusters of servers and implements the MapReduce data processing model so you can effectively query and utilize big data. The new Hadoop 2.0 is a stable, enterprise-ready platform supported by a rich ecosystem of tools and related technologies such as Pig, Hive, YARN, Spark, Tez, and many more.

What's inside

  • Introduction to MapReduce
  • Examples illustrating ideas in practice
  • Hadoop's Streaming API
  • Data science using Hadoop
  • Related tools and technologies, like Pig, Hive, YARN, Spark, Tez, and many more

About the reader

This book requires basic Java skills. Knowing basic statistical concepts can help with the more advanced examples.

About the authors

Chuck Lam and Mark Davis have been working with Hadoop since its earliest days. Chuck is a serial startup veteran and the original author of Hadoop in Action. Mark founded the Hadoop analytics company, Kitenga and is now a Distinguished Big Data Analytics Engineer for Dell and the Big Data Lead for the IEEE Cloud Computing Initiative.

Ajit Gaddam is a technologist, serial entrepreneur, and an information security expert. He is a frequent speaker at high-profile conferences and is an active participant in various open source and security architecture standards bodies.


Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.