Taming Text
How to Find, Organize, and Manipulate It
Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris
Foreword by Liz Liddy
  • December 2012
  • ISBN 9781933988382
  • 320 pages
  • printed in black & white

Takes the mystery out of very complex processes.

From the Foreword by Liz Liddy, Dean, iSchool, Syracuse University

Taming Text is a hands-on, example-driven guide to working with unstructured text in the context of real-world applications. This book explores how to automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. The book guides you through examples illustrating each of these topics, as well as the foundations upon which they are built.

About the book

There is so much text in our lives, we are practically drowning in it. Fortunately, there are innovative tools and techniques for managing unstructured information that can throw the smart developer a much-needed lifeline. You'll find them in this book.

Taming Text is a practical, example-driven guide to working with text in real applications. This book introduces you to useful techniques like full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. You'll explore real use cases as you systematically absorb the foundations upon which they are built.

Written in a clear and concise style, this book avoids jargon, explaining the subject in terms you can understand without a background in statistics or natural language processing. Examples are in Java, but the concepts can be applied in any language.

Table of Contents detailed table of contents




about this book

about the cover illustration

1. Getting started taming text

1.1. Why taming text is important

1.2. Preview: A fact-based question answering system

1.3. Understanding text is hard

1.4. Text, tamed

1.5. Text and the intelligent app: search and beyond

1.6. Summary

1.7. Resources

2. Foundations of taming text

2.1. Foundations of language

2.2. Common tools for text processing

2.3. Preprocessing and extracting content from common file formats

2.4. Summary

2.5. Resources

3. Searching

3.1. Search and faceting example: Amazon.com

3.2. Introduction to search concepts

3.3. Introducing the Apache Solr search server

3.4. Indexing content with Apache Solr

3.5. Searching content with Apache Solr

3.6. Understanding search performance factors

3.7. Improving search performance

3.8. Search alternatives

3.9. Summary

3.10. Resources

4. Fuzzy string matching

4.1. Approaches to fuzzy string matching

4.2. Finding fuzzy string matches

4.3. Building fuzzy string matching applications

4.4. Summary

4.5. Resources

5. Identifying people, places, and things

5.1. Approaches to named-entity recognition

5.2. Basic entity identification with OpenNLP

5.3. In-depth entity identification with OpenNLP

5.4. Performance of OpenNLP

5.5. Customizing OpenNLP entity identification for a new domain

5.6. Summary

5.7. Further reading

6. Clustering text

6.1. Google News document clustering

6.2. Clustering foundations

6.3. Setting up a simple clustering application

6.4. Clustering search results using Carrot 2

6.5. Clustering document collections with Apache Mahout

6.6. Topic modeling using Apache Mahout

6.7. Examining clustering performance

6.8. Acknowledgments

6.9. Summary

6.10. References

7. Classification, categorization, and tagging

7.1. Introduction to classification and categorization

7.2. The classification process

7.3. Building document categorizers using Apache Lucene

7.4. Training a naive Bayes classifier using Apache Mahout

7.5. Categorizing documents with OpenNLP

7.6. Building a tag recommender using Apache Solr

7.7. Summary

7.8. References

8. Building an example question answering system

8.1. Basics of a question answering system

8.2. Installing and running the QA code

8.3. A sample question answering architecture

8.4. Understanding questions and producing answers

8.5. Steps to improve the system

8.6. Summary

8.7. Resources

9. Untamed text: exploring the next frontier

9.1. Semantics, discourse, and pragmatics: exploring higher levels of NLP

9.2. Document and collection summarization

9.3. Relationship extraction

9.4. Identifying important content and people

9.5. Detecting emotions via sentiment analysis

9.6. Cross-language information retrieval

9.7. Summary



What's inside

  • When to use text-taming techniques
  • Important open-source libraries like Solr and Mahout
  • How to build text-processing applications

About the authors

Grant Ingersoll is an engineer, speaker, and trainer, a Lucene committer, and a cofounder of the Mahout machine-learning project. Thomas Morton is the primary developer of OpenNLP and Maximum Entropy. Drew Farris is a technology consultant, soft ware developer, and contributor to Mahout, Lucene, and Solr.

placing your order...

Don't refresh or navigate away from the page.
print book $26.99 $44.99 pBook + eBook + liveBook
Additional shipping charges may apply
Prints and ships within 3-5 days
Taming Text (print book) added to cart
continue shopping
go to cart

eBook $28.79 $35.99 3 formats + liveBook
Taming Text (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.
customers also reading

This book 1-hop 2-hops 3-hops

FREE domestic shipping on three or more pBooks