Preface

Life is full of serendipitous moments, few of which stand out for me (Grant) like the one that now defines my career. It was the late 90s, and I was a young software developer working on distributed electromagnetics simulations when I happened on an ad for a developer position at a small company in Syracuse, New York, called TextWise. Reading the description, I barely thought I was qualified for the job, but decided to take a chance anyway and sent in my resume. Somehow, I landed the job, and thus began my career in search and natural language processing. Little did I know that, all these years later, I would still be doing search and NLP, never mind writing a book on those subjects.

My first task back then was to work on a cross-language information retrieval (CLIR) system that allowed users to enter queries in English and find and automatically translate documents in French, Spanish, and Japanese. In retrospect, that first system I worked on touched on all the hard problems I’ve come to love about working with text: search, classification, information extraction, machine translation, and all those peculiar rules about languages that drive every grammar student crazy. After that first project, I’ve worked on a variety of search and NLP systems, ranging from rule-based classifiers to question answering (QA) systems. Then, in 2004, a new job at the Center for Natural Language Processing led me to the use of Apache Lucene, the de facto open source search library (these days, anyway). I once again found myself writing a CLIR system, this time to work with English and Arabic. Needing some Lucene features to complete my task, I started putting up patches for features and bug fixes. Sometime thereafter, I became a committer. From there, the floodgates opened. I got more involved in open source, starting the Apache Mahout machine learning project with Isabel Drost and Karl Wettin, as well as cofounding Lucid Imagination, a company built around search and text analytics with Apache Lucene and Solr.

Coming full circle, I think search and NLP are among the defining areas of computer science, requiring a sophisticated approach to both the data structures and algorithms necessary to solve problems. Add to that the scaling requirements of processing large volumes of user-generated web and social content, and you have a developer’s dream. This book addresses my view that the marketplace was missing (at the time) a book written for engineers by engineers and specifically geared toward using existing, proven, open source libraries to solve hard problems in text processing. I hope this book helps you solve everyday problems in your current job as well as inspires you to see the world of text as a rich opportunity for learning.

Grant Ingersoll

I (Tom) became fascinated with artificial intelligence as a sophomore in high school and as an undergraduate chose to go to graduate school and focus on natural language processing. At the University of Pennsylvania, I learned an incredible amount about text processing, machine learning, and algorithms and data structures in general. I also had the opportunity to work with some of the best minds in natural language processing and learn from them.

In the course of my graduate studies, I worked on a number of NLP systems and participated in numerous DARPA-funded evaluations on coreference, summarization, and question answering. In the course of this work, I became familiar with Lucene and the larger open source movement. I also noticed that there was a gap in open source text processing software that could provide efficient end-to-end processing. Using my thesis work as a basis, I contributed extensively to the OpenNLP project and also continued to learn about NLP systems while working on automated essay and short-answer scoring at Educational Testing Services.

Working in the open source community taught me a lot about working with others and made me a much better software engineer. Today, I work for Comcast Corporation with teams of software engineers that use many of the tools and techniques described in this book. It is my hope that this book will help bridge the gap between the hard work of researchers like the ones I learned from in graduate school and software engineers everywhere whose aim is to use text processing to solve real problems for real people.

Thomas Morton

Like Grant, I (Drew) was first introduced to the field of information retrieval and natural language processing by Dr. Elizabeth Liddy, Woojin Paik, and all of the others doing research at TextWise in the mid 90s. I started working with the group as I was finishing my master’s at the School of Information Studies (iSchool) at Syracuse University. At that time, TextWise was transitioning from a research group to a startup business developing applications based on the results of our text processing research. I stayed with the company for many years, constantly learning, discovering new things, and working with many outstanding people who came to tackle the challenges of teaching machines to understand language from many different perspectives.

Personally, I approach the subject of text analytics first from the perspective of a software developer. I’ve had the privilege of working with brilliant researchers and transforming their ideas from experiments to functioning prototypes to massively scalable systems. In the process, I’ve had the opportunity to do a great deal of what has recently become known as data science and discovered a deep love of exploring and understanding massive datasets and the tools and techniques for learning from them.

I cannot overstate the impact that open source software has had on my career. Readily available source code as a companion to research is an immensely effective way to learn new techniques and approaches to text analytics and software development in general. I salute everyone who has made the effort to share their knowledge and experience with others who have the passion to collaborate and learn. I specifically want to acknowledge the good folks at the Apache Software Foundation who continue to grow a vibrant ecosystem dedicated to the development of open source software and the people, process, and community that support it.

The tools and techniques presented in this book have strong roots in the open source software community. Lucene, Solr, Mahout, and OpenNLP all fall under the Apache umbrella. In this book, we only scratch the surface of what can be done with these tools. Our goal is to provide an understanding of the core concepts surrounding text processing and provide a solid foundation for future explorations of this domain.

Happy coding!

Drew Farris