Mahout in Action
Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman
  • October 2011
  • ISBN 9781935182689
  • 416 pages
  • printed in black & white

A hands-on discussion of machine learning with Mahout.

Isabel Drost, Cofounder Apache Mahout

Mahout in Action is a hands-on introduction to machine learning with Apache Mahout. Following real-world examples, the book presents practical use cases and then illustrates how Mahout can be applied to solve them. Includes a free audio- and video-enhanced ebook.

Table of Contents show full



about this book

about multimedia extras

about the cover illustration

Part 1 Recommendations

1. Chapter 2 Introducing recommenders

1.1. Defining recommendation

1.2. Running a first recommender engine

1.3. Evaluating a recommender

1.4. Evaluating precision and recall

1.5. Evaluating the GroupLens data set

1.6. Summary

2. Chapter 3 Representing recommender data

2.1. Representing preference data

2.2. In-memory DataModels

2.3. Coping without preference values

2.4. Summary

3. Chapter 4 Making recommendations

3.1. Understanding user-based recommendation

3.2. Exploring the user-based recommender

3.3. Exploring similarity metrics

3.4. Item-based recommendation

3.5. Slope-one recommender

3.6. New and experimental recommenders

3.7. Comparison to other recommenders

3.8. Summary

4. Chapter 5 Taking recommenders to production

4.1. Analyzing example data from a dating site

4.2. Finding an effective recommender

4.3. Injecting domain-specific information

4.4. Recommending to anonymous users

4.5. Creating a web-enabled recommender

4.6. Updating and monitoring the recommender

4.7. Summary

5. Chapter 6 Distributing recommendation computations

5.1. Analyzing the Wikipedia data set

5.2. Designing a distributed item-based algorithm

5.3. Implementing a distributed algorithm with MapReduce

5.4. Running MapReduces with Hadoop

5.5. Pseudo-distributing a recommender

5.6. Looking beyond first steps with recommendations

5.7. Summary

Part 2 Clustering

6. Chapter 7 Introduction to clustering

6.1. Clustering basics

6.2. Measuring the similarity of items

6.3. Hello World: running a simple clustering example

6.4. Exploring distance measures

6.5. Hello World again! Trying out various distance measures

6.6. Summary

7. Chapter 8 Representing data

7.1. Visualizing vectors

7.2. Representing text documents as vectors

7.3. Generating vectors from documents

7.4. Improving quality of vectors using normalization

7.5. Summary

8. Chapter 9 Clustering algorithms in Mahout

8.1. K-means clustering

8.2. Beyond k-means: an overview of clustering techniques

8.3. Fuzzy k-means clustering

8.4. Model-based clustering

8.5. Topic modeling using latent Dirichlet allocation (LDA)

8.6. Summary

9. Chapter 10 Evaluating and improving clustering quality

9.1. Inspecting clustering output

9.2. Analyzing clustering output

9.3. Improving clustering quality

9.4. Summary

10. Chapter 11 Taking clustering to production

10.1. Quick-start tutorial for running clustering on Hadoop

10.2. Tuning clustering performance

10.3. Batch and online clustering

10.4. Summary

11. Chapter 12 Real-world applications of clustering

11.1. Finding similar users on Twitter

11.2. Suggesting tags for artists on

11.3. Analyzing the Stack Overflow data set

11.4. Summary

Part 3 Classification

12. Chapter 13 Introduction to classification

12.1. Why use Mahout for classification?

12.2. The fundamentals of classification systems

12.3. How classification works

12.4. Work flow in a typical classification project

12.5. Step-by-step simple classification example

12.6. Summary

13. Chapter 14 Training a classifier

13.1. Extracting features to build a Mahout classifier

13.2. Preprocessing raw data into classifiable data

13.3. Converting classifiable data into vectors

13.4. Classifying the 20 newsgroups data set with SGD

13.5. Choosing an algorithm to train the classifier

13.6. Classifying the 20 newsgroups data with naive Bayes

13.7. Summary

14. Chapter 15 Evaluating and tuning a classifier

14.1. Classifier evaluation in Mahout

14.2. The classifier evaluation API

14.3. When classifiers go bad

14.4. Tuning for better performance

14.5. Summary

15. Chapter 16 Deploying a classifier

15.1. Process for deployment in huge systems

15.2. Determining scale and speed requirements

15.3. Building a training pipeline for large systems

15.4. Integrating a Mahout classifier

15.5. Example: a Thrift-based classification server

15.6. Summary

16. Chapter 17 Case study: Shop It To Me

16.1. Why Shop It To Me chose Mahout

16.2. General structure of the email marketing system

16.3. Training the model

16.4. Speeding up classification

16.5. Summary

Appendix A: JVM tuning

Appendix B: Mahout math

Appendix C: Resources


© 2014 Manning Publications Co.

About the Technology

A computer system that learns and adapts as it collects data can be really powerful. Mahout, Apache's open source machine learning project, captures the core algorithms of recommendation systems, classification, and clustering in ready-to-use, scalable libraries. With Mahout, you can immediately apply to your own projects the machine learning techniques that drive Amazon, Netflix, and others.

About the book

This book covers machine learning using Apache Mahout. Based on experience with real-world applications, it introduces practical use cases and illustrates how Mahout can be applied to solve them. It places particular focus on issues of scalability and how to apply these techniques against large data sets using the Apache Hadoop framework.

This book is written for developers familiar with Java. No prior experience with Mahout is assumed.

What's inside

  • Use group data to make individual recommendations
  • Find logical clusters within your data
  • Filter and refine with on-the-fly classification
  • Free audio and video extras

About the reader

This book is written for developers familiar with Java. No prior experience with Mahout is assumed.

About the authors

Sean Owen helped build Google's Mobile Web search and launched the Taste framework, now part of Mahout. Robin Anil contributed the Bayes classifier and frequent pattern mining implementations to Mahout. Ted Dunning contributed to the Mahout clustering, classification, and matrix decomposition algorithms. Ellen Friedman is an experienced writer with a doctorate in biochemistry.

combo $44.99 pBook + eBook
eBook $35.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks

The writing makes a complex topic easy to understand.

Rick Wagner, Red Hat

Essential Mahout, authored by the core developer team.

Philipp K. Janert, Author of Gnuplot in Action

Dramatically reduces the learning curve.

David Grossman, Illinois Institute of Technology

Recommendations, clustering, and classification all lucidly explained.

John S. Griffin,