Contents


foreword
preface
acknowledgments
about this book
about the cover illustration
 
Chapter 1 Getting started taming text
Why taming text is important
Preview: A fact-based question answering system
Understanding text is hard
Text, tamed
Text and the intelligent app: search and beyond
Summary
Resources
Chapter 2 Foundations of taming text
Foundations of language
Common tools for text processing
Preprocessing and extracting content from common file formats
Summary
Resources
Chapter 3 Searching
Search and faceting example: Amazon.com
Introduction to search concepts
Introducing the Apache Solr search server
Indexing content with Apache Solr
Searching content with Apache Solr
Understanding search performance factors
Improving search performance
Search alternatives
Summary
Resources
Chapter 4 Fuzzy string matching
Approaches to fuzzy string matching
Finding fuzzy string matches
Building fuzzy string matching applications
Summary
Resources
Chapter 5 Identifying people, places, and things
Approaches to named-entity recognition
Basic entity identification with OpenNLP
In-depth entity identification with OpenNLP
Performance of OpenNLP
Customizing OpenNLP entity identification for a new domain
Summary
Further reading
Chapter 6 Clustering text
Google News document clustering
Clustering foundations
Setting up a simple clustering application
Clustering search results using Carrot 2
Clustering document collections with Apache Mahout
Topic modeling using Apache Mahout
Examining clustering performance
Acknowledgments
Summary
References
Chapter 7 Classification, categorization, and tagging
Introduction to classification and categorization
The classification process
Building document categorizers using Apache Lucene
Training a naive Bayes classifier using Apache Mahout
Categorizing documents with OpenNLP
Building a tag recommender using Apache Solr
Summary
References
Chapter 8 Building an example question answering system
Basics of a question answering system
Installing and running the QA code
A sample question answering architecture
Understanding questions and producing answers
Steps to improve the system
Summary
Resources
Chapter 9 Untamed text: exploring the next frontier
Semantics, discourse, and pragmatics: exploring higher levels of NLP
Document and collection summarization
Relationship extraction
Identifying important content and people
Detecting emotions via sentiment analysis
Cross-language information retrieval
Summary
References

index