Contents


preface
acknowledgments
about this book
about multimedia extras
about the cover illustration

Chapter 1 Meet Apache Mahout
Mahout’s story
Mahout’s machine learning themes
Tackling large scale with Mahout and Hadoop
Setting up Mahout
Summary

Part 1 Recommendations

Chapter 2 Introducing recommenders
Defining recommendation
Running a first recommender engine
Evaluating a recommender
Evaluating precision and recall
Evaluating the GroupLens data set
Summary
Chapter 3 Representing recommender data
Representing preference data
In-memory DataModels
Coping without preference values
Summary
Chapter 4 Making recommendations
Understanding user-based recommendation
Exploring the user-based recommender
Exploring similarity metrics
Item-based recommendation
Slope-one recommender
New and experimental recommenders
Comparison to other recommenders
Summary
Chapter 5 Taking recommenders to production
Analyzing example data from a dating site
Finding an effective recommender
Injecting domain-specific information
Recommending to anonymous users
Creating a web-enabled recommender
Updating and monitoring the recommender
Summary
Chapter 6 Distributing recommendation computations
Analyzing the Wikipedia data set
Designing a distributed item-based algorithm
Implementing a distributed algorithm with MapReduce
Running MapReduces with Hadoop
Pseudo-distributing a recommender
Looking beyond first steps with recommendations
Summary

Part 2 Clustering

Chapter 7 Introduction to clustering
Clustering basics
Measuring the similarity of items
Hello World: running a simple clustering example
Exploring distance measures
Hello World again! Trying out various distance measures
Summary
Chapter 8 Representing data
Visualizing vectors
Representing text documents as vectors
Generating vectors from documents
Improving quality of vectors using normalization
Summary
Chapter 9 Clustering algorithms in Mahout
K-means clustering
Beyond k-means: an overview of clustering techniques
Fuzzy k-means clustering
Model-based clustering
Topic modeling using latent Dirichlet allocation (LDA)
Summary
Chapter 10 Evaluating and improving clustering quality
Inspecting clustering output
Analyzing clustering output
Improving clustering quality
Summary
Chapter 11 Taking clustering to production
Quick-start tutorial for running clustering on Hadoop
Tuning clustering performance
Batch and online clustering
Summary
Chapter 12 Real-world applications of clustering
Finding similar users on Twitter
Suggesting tags for artists on Last.fm
Analyzing the Stack Overflow data set
Summary

Part 3 Classification

Chapter 13 Introduction to classification
Why use Mahout for classification?
The fundamentals of classification systems
How classification works
Work flow in a typical classification project
Step-by-step simple classification example
Summary
Chapter 14 Training a classifier
Extracting features to build a Mahout classifier
Preprocessing raw data into classifiable data
Converting classifiable data into vectors
Classifying the 20 newsgroups data set with SGD
Choosing an algorithm to train the classifier
Classifying the 20 newsgroups data with naive Bayes
Summary
Chapter 15 Evaluating and tuning a classifier
Classifier evaluation in Mahout
The classifier evaluation API
When classifiers go bad
Tuning for better performance
Summary
Chapter 16 Deploying a classifier
Process for deployment in huge systems
Determining scale and speed requirements
Building a training pipeline for large systems
Integrating a Mahout classifier
Example: a Thrift-based classification server
Summary
Chapter 17 Case study: Shop It To Me
Why Shop It To Me chose Mahout
General structure of the email marketing system
Training the model
Speeding up classification
Summary

appendix A JVM tuning
appendix B Mahout math
appendix C Resources
index