About this Book

Taming Text is about building software applications that derive their core value from using and manipulating content that primarily consists of the written word. This book is not a theoretical treatise on the subjects of search, natural language processing, and machine learning, although we cover all of those topics in a fair amount of detail throughout the book. We strive to avoid jargon and complex math and instead focus on providing the concepts and examples that today’s software engineers, architects, and practitioners need in order to implement intelligent, next-generation, text-driven applications. Taming Text is also firmly grounded in providing real-world examples of the concepts described in the book using freely available, highly popular, open source tools like Apache Solr, Mahout, and OpenNLP.

Who should read this book

Is this book for you? Perhaps. Our target audience is software practitioners who don’t have (much of) a background in search, natural language processing, and machine learning. In fact, our book is aimed at practitioners in a work environment much like what we’ve seen in many companies: a development team is tasked with adding search and other features to a new or existing application and few, if any, of the developers have any experience working with text. They need a good primer on understanding the concepts without being bogged down by the unnecessary.

In many cases, we provide references to easily accessible sources like Wikipedia and seminal academic papers, thus providing a launching pad for the reader to explore an area in greater detail if desired. Additionally, while most of our open source tools and examples are in Java, the concepts and ideas are portable to many other programming languages, so Rubyists, Pythonistas, and others should feel quite comfortable as well with the book.

This book is clearly not for those looking for explanations of the math involved in these systems or for academic rigor on the subject, although we do think students will find the book helpful when they need to implement the concepts described in the classroom and more academically-oriented books.

This book doesn’t target experienced field practitioners who have built many text-based applications in their careers, although they may find some interesting nuggets here and there on using the open source packages described in the book. More than one experienced practitioner has told us that the book is a great way to get team members who are new to the field up to speed on the ideas and code involved in writing a text-based application.

Ultimately, we hope this book is an up-to-date guide for the modern programmer, a guide that we all wish we had when we first started down our career paths in programming text-based applications.

Roadmap

Chapter 1 explains why processing text is important, and what makes it so challenging. We preview a fact-based question answering (QA) system, setting the stage for utilizing open source libraries to tame text.

Chapter 2 introduces the building blocks of text processing: tokenizing, chunking, parsing, and part of speech tagging. We follow up with a look at how to extract text from some common file formats using the Apache Tika open source project.

Chapter 3 explores search theory and the basics of the vector space model. We introduce the Apache Solr search server and show how to index content with it. You’ll learn how to evaluate the search performance factors of quantity and quality.

Chapter 4 examines fuzzy string matching with prefixes and n-grams. We look at two character overlap measures—the Jaccard measure and the Jaro-Winkler distance—and explain how to find candidate matches with Solr and rank them.

Chapter 5 presents the basic concepts behind named-entity recognition. We show how to use OpenNLP to find named entities, and discuss some OpenNLP performance considerations. We also cover how to customize OpenNLP entity identification for a new domain.

Chapter 6 is devoted to clustering text. Here you’ll learn the basic concepts behind common text clustering algorithms, and see examples of how clustering can help improve text applications. We also explain how to cluster whole document collections using Apache Mahout, and how to cluster search results using Carrot2.

Chapter 7 discusses the basic concepts behind classification, categorization, and tagging. We show how categorization is used in text applications, and how to build, train, and evaluate classifiers using open source tools. We also use the Mahout implementation of the naive Bayes algorithm to build a document categorizer.

Chapter 8 is where we bring together all the things learned in the previous chapters to build an example QA system. This simple application uses Wikipedia as its knowledge base, and Solr as a baseline system.

Chapter 9 explores what’s next in search and NLP, and the roles of semantics, discourse, and pragmatics. We discuss searching across multiple languages and detecting emotions in content, as well as emerging tools, applications, and ideas.

Code conventions and downloads

This book contains numerous code examples. All the code is in a fixed-width font like this to separate it from ordinary text. Code members such as method names, class names, and so on are also in a fixed-width font.

In many listings, the code is annotated to point out key concepts, and numbered bullets are sometimes used in the text to provide additional information about the code.

Source code examples in this book are fairly close to the samples that you’ll find online. But for brevity’s sake, we may have removed material such as comments from the code to fit it well within the text.

The source code for the examples in the book is available for download from the publisher’s website at www.manning.com/TamingText .

The purchase of Taming Text includes free access to a private web forum run by Manning Publications, where you can make comments about the book, ask technical questions, and receive help from the authors and from other users. To access the forum and subscribe to it, point your web browser at www.manning.com/TamingText. This page provides information on how to get on the forum once you are registered, what kind of help is available, and the rules of conduct on the forum.

Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and authors can take place. It’s not a commitment to any specific amount of participation on the part of the authors, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the authors some challenging questions, lest their interest stray!

The Author Online forum and archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.