About this Book

Lucene in Action delivers details, best practices, caveats, tips, and tricks for using the best open-source Java search engine available.

This book assumes the reader is familiar with basic Java programming. Lucene itself is a single Java Archive (JAR) file and integrates into the simplest Java stand-alone console program as well as the most sophisticated enterprise application.

Roadmap

We organized part 1 of this book to cover the core Lucene Application Programming Interface (API) in the order you’re likely to encounter it as you integrate Lucene into your applications:

Part 2 goes beyond Lucene’s built-in facilities and shows you what can be done around and above Lucene:

Who should read this book?

Developers who need powerful search capabilities embedded in their applications should read this book. Lucene in Action is also suitable for developers who are curious about Lucene or indexing and search techniques, but who may not have an immediate need to use it. Adding Lucene know-how to your toolbox is valuable for future projects—search is a hot topic and will continue to be in the future.

This book primarily uses the Java version of Lucene (from Apache Jakarta), and the majority of the code examples use the Java language. Readers familiar with Java will be right at home. Java expertise will be helpful; however, Lucene has been ported to a number of other languages including C++, C#, Python, and Perl. The concepts, techniques, and even the API itself are comparable between the Java and other language versions of Lucene.

Code examples

The source code for this book is available from Manning’s website at http://www.manning.com/hatcher2. Instructions for using this code are provided in the README file included with the source-code package.

The majority of the code shown in this book was written by us and is included in the source-code package. Some code (particularly the case-study code) isn’t provided in our source-code package; the code snippets shown there are owned by the contributors and are donated as is. In a couple of cases, we have included a small snippet of code from Lucene’s codebase, which is licensed under the Apache Software License (http://www.apache.org/licenses/LICENSE-2.0).

Code examples don’t include package and import statements, to conserve space; refer to the actual source code for these details.

Why JUnit?

We believe code examples in books should be top-notch quality and real-world applicable. The typical “hello world” examples often insult our intelligence and generally do little to help readers see how to really adapt to their environment.

We’ve taken a unique approach to the code examples in Lucene in Action. Many of our examples are actual JUnit test cases (http://www.junit.org). JUnit, the de facto Java unit-testing framework, easily allows code to assert that a particular assumption works as expected in a repeatable fashion. Automating JUnit test cases through an IDE or Ant allows one-step (or no steps with continuous integration) confidence building. We chose to use JUnit in this book because we use it daily in our other projects and want you to see how we really code. Test Driven Development (TDD) is a development practice we strongly espouse.

If you’re unfamiliar with JUnit, please read the following primer. We also suggest that you read Pragmatic Unit Testing in Java with JUnit by Dave Thomas and Andy Hunt, followed by Manning’s JUnit in Action by Vincent Massol and Ted Husted.

JUnit primer

This section is a quick and admittedly incomplete introduction to JUnit. We’ll provide the basics needed to understand our code examples. First, our JUnit test cases extend junit.framework.TestCase and many extend it indirectly through our custom LiaTestCase base class. Our concrete test classes adhere to a naming convention: we suffix class names with Test. For example, our QueryParser tests are in QueryParserTest.java.

JUnit runners automatically execute all methods with the signature public void testXXX(), where XXX is an arbitrary but meaningful name. JUnit test methods should be concise and clear, keeping good software design in mind (such as not repeating yourself, creating reusable functionality, and so on).

Assertions

JUnit is built around a set of assert statements, freeing you to code tests clearly and letting the JUnit framework handle failed assumptions and reporting the details. The most frequently used assert statement is assertEquals; there are a number of overloaded variants of the assertEquals method signature for various data types. An example test method looks like this:

public void testExample() {
 SomeObject obj = new SomeObject();
 assertEquals(10, obj.someMethod());
}

The assert methods throw a runtime exception if the expected value (10, in this example) isn’t equal to the actual value (the result of calling someMethod on obj, in this example). Besides assertEquals, there are several other assert methods for convenience. We also use assertTrue(expression), assertFalse(expression), and assertNull(expression) statements. These test whether the expression is true, false, and null, respectively.

The assert statements have overloaded signatures that take an additional String parameter as the first argument. This String argument is used entirely for reporting purposes, giving the developer more information when a test fails. We use this String message argument to be more descriptive (or sometimes comical).

By coding our assumptions and expectations in JUnit test cases in this manner, we free ourselves from the complexity of the large systems we build and can focus on fewer details at a time. With a critical mass of test cases in place, we can remain confident and agile. This confidence comes from knowing that changing code, such as optimizing algorithms, won’t break other parts of the system, because if it did, our automated test suite would let us know long before the code made it to production. Agility comes from being able to keep the codebase clean through refactoring. Refactoring is the art (or is it a science?) of changing the internal structure of the code so that it accommodates evolving requirements without affecting the external interface of a system.

JUnit in context

Let’s take what we’ve said so far about JUnit and frame it within the context of this book. JUnit test cases ultimately extend from junit.framework.TestCase, and test methods have the public void testXXX() signature. One of our test cases (from chapter 3) is shown here:

public class BasicSearchingTest extends LiaTestCase {  

 public void testTerm() throws Exception {
  IndexSearcher searcher = new IndexSearcher(directory);  
  Term t = new Term("subject", "ant");
  Query query = new TermQuery(t);
  Hits hits = searcher.search(query);
  assertEquals("JDwA", 1, hits.length());  
  t = new Term("subject", "junit");
  hits = searcher.search(new TermQuery(t));
  assertEquals(2, hits.length());  
  searcher.close();
 }
}

Of course, we’ll explain the Lucene API used in this test case later. Here we’ll focus on the JUnit details. A variable used in testTerm, directory, isn’t defined in this class. JUnit provides an initialization hook that executes prior to every test method; this hook is a method with the public void setUp() signature. Our LiaTestCase base class implements setUp in this manner:

public abstract class LiaTestCase extends TestCase {
 private String indexDir = System.getProperty("index.dir");
 protected Directory directory;
 protected void setUp() throws Exception {
  directory = FSDirectory.getDirectory(indexDir, false);
 }
}

If our first assert in testTerm fails, we see an exception like this:

junit.framework.AssertionFailedError: JDwA expected:<1> but was:<0>
    at lia.searching.BasicSearchingTest.
  testTerm(BasicSearchingTest.java:20)

This failure indicates our test data is different than what we expect.

Testing Lucene

The majority of the tests in this book test Lucene itself. In practice, is this realistic? Isn’t the idea to write test cases that test our own code, not the libraries themselves? There is an interesting twist to Test Driven Development used for learning an API: Test Driven Learning. It’s immensely helpful to write tests directly to a new API in order to learn how it works and what you can expect from it. This is precisely what we’ve done in most of our code examples, so that tests are testing Lucene itself. Don’t throw these learning tests away, though. Keep them around to ensure your expectations of the API hold true when you upgrade to a new version of the API, and refactor them when the inevitable API change is made.

Mock objects

In a couple of cases, we use mock objects for testing purposes. Mock objects are used as probes sent into real business logic in order to assert that the business logic is working properly. For example, in chapter 4, we have a SynonymEngine interface (see section 4.6). The real business logic that uses this interface is an analyzer. When we want to test the analyzer itself, it’s unimportant what type of SynonymEngine is used, but we want to use one that has well defined and predictable behavior. We created a MockSynonymEngine, allowing us to reliably and predictably test our analyzer. Mock objects help simplify test cases such that they test only a single facet of a system at a time rather than having intertwined dependencies that lead to complexity in troubleshooting what really went wrong when a test fails. A nice effect of using mock objects comes from the design changes it leads us to, such as separation of concerns and designing using interfaces instead of direct concrete implementations.

Our test data

Most of our book revolves around a common set of example data to provide consistency and avoid having to grok an entirely new set of data for each section. This example data consists of book details. Table 1 shows the data so that you can reference it and make sense of our examples.

Table 1 Sample data used throughout this book
Title / Author Category Subject
A Modern Art of Education
Rudolf Steiner
/education/pedagogy education philosophy psychology practice Waldorf
Imperial Secrets of Health and Longevity
Bob Flaws
/health/alternative/Chinese diet chinese medicine qi gong health herbs
Tao Te Ching
Stephen Mitchell
/philosophy/eastern taoism
Gödel, Escher, Bach: an Eternal Golden Braid
Douglas Hofstadter
/technology/computers/ai artificial intelligence number theory mathematics music
Mindstorms
Seymour Papert
/technology/computers/programming/education children computers powerful ideas LOGO education
Java Development with Ant
Erik Hatcher, Steve Loughran
/technology/computers/programming apache jakarta ant build tool junit java development
JUnit in Action
Vincent Massol, Ted Husted
/technology/computers/programming junit unit testing mock objects
Lucene in Action
Otis Gospodnetić, Erik Hatcher
/technology/computers/programming lucene search
Extreme Programming Explained
Kent Beck
/technology/computers/programming/methodology extreme programming agile test driven development methodology
Tapestry in Action
Howard Lewis-Ship
/technology/computers/programming tapestry web user interface components
The Pragmatic Programmer
Dave Thomas, Andy Hunt
/technology/computers/programming pragmatic agile methodology developer tools

The data, besides the fields shown in the table, includes fields for ISBN, URL, and publication month. The fields for category and subject are our own subjective values, but the other information is objectively factual about the books.

Code conventions and downloads

Source code in listings or in text is in a fixed width font to separate it from ordinary text. Java method names, within text, generally won’t include the full method signature.

In order to accommodate the available page space, code has been formatted with a limited width, including line continuation markers where appropriate.

We don’t include import statements and rarely refer to fully qualified class names—this gets in the way and takes up valuable space. Refer to Lucene’s Javadocs for this information. All decent IDEs have excellent support for automatically adding import statements; Erik blissfully codes without knowing fully qualified classnames using IDEA IntelliJ, and Otis does the same with XEmacs. Add the Lucene JAR to your project’s classpath, and you’re all set. Also on the classpath issue (which is a notorious nuisance), we assume that the Lucene JAR and any other necessary JARs are available in the classpath and don’t show it explicitly.

We’ve created a lot of examples for this book that are freely available to you. A .zip file of all the code is available from Manning’s web site for Lucene in Action: http://www.manning.com/hatcher2. Detailed instructions on running the sample code are provided in the main directory of the expanded archive as a README file.

Author online

The purchase of Lucene in Action includes free access to a private web forum run by Manning Publications, where you can discuss the book with the authors and other readers. To access the forum and subscribe to it, point your web browser to http://www.manning.com/hatcher2. This page provides information on how to get on the forum once you are registered, what kind of help is available, and the rules of conduct on the forum.

About the authors

Erik Hatcher codes, writes, and speaks on technical topics that he finds fun and challenging. He has written software for a number of diverse industries using many different technologies and languages. Erik coauthored Java Development with Ant (Manning, 2002) with Steve Loughran, a book that has received wonderful industry acclaim. Since the release of Erik’s first book, he has spoken at numerous venues including the No Fluff, Just Stuff symposium circuit, JavaOne, O’Reilly’s Open Source Convention, the Open Source Content Management Conference, and many Java User Group meetings. As an Apache Software Foundation member, he is an active contributor and committer on several Apache projects including Lucene, Ant, and Tapestry. Erik currently works at the University of Virginia's Humanities department supporting Applied Research in Patacriticism. He lives in Charlottesville, Virginia with his beautiful wife, Carole, and two astounding sons, Ethan and Jakob.

Otis Gospodnetić has been an active Lucene developer for four years and maintains the jGuru Lucene FAQ. He is a Software Engineer at Wireless Generation, a company that develops technology solutions for educational assessments of students and teachers. In his spare time, he develops Simpy, a Personal Web service that uses Lucene, which he created out of his passion for knowledge, information retrieval, and management. Previous technical publications include several articles about Lucene, published by O’Reilly Network and IBM developerWorks. Otis also wrote To Choose and Be Chosen: Pursuing Education in America, a guidebook for foreigners wishing to study in the United States; it’s based on his own experience. Otis is from Croatia and currently lives in New York City.

About the title

By combining introductions, overviews, and how-to examples, the In Action books are designed to help learning and remembering. According to research in cognitive science, the things people remember are things they discover during self-motivated exploration.

Although no one at Manning is a cognitive scientist, we are convinced that for learning to become permanent it must pass through stages of exploration, play, and, interestingly, re-telling of what is being learned. People understand and remember new things, which is to say they master them, only after actively exploring them. Humans learn in action. An essential part of an In Action guide is that it is example-driven. It encourages the reader to try things out, to play with new code, and explore new ideas.

There is another, more mundane, reason for the title of this book: our readers are busy. They use books to do a job or solve a problem. They need books that allow them to jump in and jump out easily and learn just what they want just when they want it. They need books that aid them in action. The books in this series are designed for such readers.

About the cover illustration

The figure on the cover of Lucene in Action is “An inhabitant of the coast of Syria.” The illustration is taken from a collection of costumes of the Ottoman Empire published on January 1, 1802, by William Miller of Old Bond Street, London. The title page is missing from the collection and we have been unable to track it down to date. The book’s table of contents identifies the figures in both English and French, and each illustration bears the names of two artists who worked on it, both of whom would no doubt be surprised to find their art gracing the front cover of a computer programming book?two hundred years later.

The collection was purchased by a Manning editor at an antiquarian flea market in the “Garage” on West 26th Street in Manhattan. The seller was an American based in Ankara, Turkey, and the transaction took place just as he was packing up his stand for the day. The Manning editor did not have on his person the substantial amount of cash that was required for the purchase and a credit card and check were both politely turned down.

With the seller flying back to Ankara that evening the situation was getting hopeless. What was the solution? It turned out to be nothing more than an old-fashioned verbal agreement sealed with a handshake. The seller simply proposed that the money be transferred to him by wire and the editor walked out with the seller’s bank information on a piece of paper and the portfolio of images under his arm. Needless to say, we transferred the funds the next day, and we remain grateful and impressed by this unknown person’s trust in one of us. It recalls something that might have happened a long time ago.

The pictures from the Ottoman collection, like the other illustrations that appear on our covers, bring to life the richness and variety of dress customs of two centuries ago. They recall the sense of isolation and distance of that period—and of every other historic period except our own hyperkinetic present.

Dress codes have changed since then and the diversity by region, so rich at the time, has faded away. It is now often hard to tell the inhabitant of one continent from another. Perhaps, trying to view it optimistically, we have traded a cultural and visual diversity for a more varied personal life. Or a more varied and interesting intellectual and technical life.

We at Manning celebrate the inventiveness, the initiative, and, yes, the fun of the computer business with book covers based on the rich diversity of regional life of two centuries ago? brought back to life by the pictures from this collection.