Contents
preface
acknowledgments
about this book
about multimedia extras
about the cover illustration
Chapter 1 Meet Apache Mahout
Mahout’s story
Mahout’s machine learning themes
Tackling large scale with Mahout and Hadoop
Setting up Mahout
Summary
Part 1 Recommendations
- Chapter 2 Introducing recommenders
- Defining recommendation
- Running a first recommender engine
- Evaluating a recommender
- Evaluating precision and recall
- Evaluating the GroupLens data set
- Summary
- Chapter 3 Representing recommender data
- Representing preference data
- In-memory DataModels
- Coping without preference values
- Summary
- Chapter 4 Making recommendations
- Understanding user-based recommendation
- Exploring the user-based recommender
- Exploring similarity metrics
- Item-based recommendation
- Slope-one recommender
- New and experimental recommenders
- Comparison to other recommenders
- Summary
- Chapter 5 Taking recommenders to production
- Analyzing example data from a dating site
- Finding an effective recommender
- Injecting domain-specific information
- Recommending to anonymous users
- Creating a web-enabled recommender
- Updating and monitoring the recommender
- Summary
- Chapter 6 Distributing recommendation computations
- Analyzing the Wikipedia data set
- Designing a distributed item-based algorithm
- Implementing a distributed algorithm with MapReduce
- Running MapReduces with Hadoop
- Pseudo-distributing a recommender
- Looking beyond first steps with recommendations
- Summary
Part 2 Clustering
- Chapter 7 Introduction to clustering
- Clustering basics
- Measuring the similarity of items
- Hello World: running a simple clustering example
- Exploring distance measures
- Hello World again! Trying out various distance measures
- Summary
- Chapter 8 Representing data
- Visualizing vectors
- Representing text documents as vectors
- Generating vectors from documents
- Improving quality of vectors using normalization
- Summary
- Chapter 9 Clustering algorithms in Mahout
- K-means clustering
- Beyond k-means: an overview of clustering techniques
- Fuzzy k-means clustering
- Model-based clustering
- Topic modeling using latent Dirichlet allocation (LDA)
- Summary
- Chapter 10 Evaluating and improving clustering quality
- Inspecting clustering output
- Analyzing clustering output
- Improving clustering quality
- Summary
- Chapter 11 Taking clustering to production
- Quick-start tutorial for running clustering on Hadoop
- Tuning clustering performance
- Batch and online clustering
- Summary
- Chapter 12 Real-world applications of clustering
- Finding similar users on Twitter
- Suggesting tags for artists on Last.fm
- Analyzing the Stack Overflow data set
- Summary
Part 3 Classification
- Chapter 13 Introduction to classification
- Why use Mahout for classification?
- The fundamentals of classification systems
- How classification works
- Work flow in a typical classification project
- Step-by-step simple classification example
- Summary
- Chapter 14 Training a classifier
- Extracting features to build a Mahout classifier
- Preprocessing raw data into classifiable data
- Converting classifiable data into vectors
- Classifying the 20 newsgroups data set with SGD
- Choosing an algorithm to train the classifier
- Classifying the 20 newsgroups data with naive Bayes
- Summary
- Chapter 15 Evaluating and tuning a classifier
- Classifier evaluation in Mahout
- The classifier evaluation API
- When classifiers go bad
- Tuning for better performance
- Summary
- Chapter 16 Deploying a classifier
- Process for deployment in huge systems
- Determining scale and speed requirements
- Building a training pipeline for large systems
- Integrating a Mahout classifier
- Example: a Thrift-based classification server
- Summary
- Chapter 17 Case study: Shop It To Me
- Why Shop It To Me chose Mahout
- General structure of the email marketing system
- Training the model
- Speeding up classification
- Summary
appendix A JVM tuning
appendix B Mahout math
appendix C Resources
index