About this book

Lucene in Action, Second Edition delivers details, best practices, caveats, tips, and tricks for using the best open-source search engine available.

This book assumes the reader is familiar with basic Java programming. Lucene’s core itself is a single Java Archive (JAR) file, less than 1MB and with no dependencies, and integrates into the simplest Java stand-alone console program as well as the most sophisticated enterprise application.

Roadmap

We organized part 1 of this book to cover the core Lucene Application Programming Interface (API) in the order you’re likely to encounter it as you integrate Lucene into your applications:

Part 2 goes beyond Lucene’s built-in facilities and shows you what can be done around and above Lucene:

Part 3 (chapters 12, 13, and 14) brings all the technical details of Lucene back into focus with case studies contributed by those who have built interesting, fast, and scalable applications with Lucene at their core.

What’s new in the second edition?

Much has changed in Lucene in the 5 years since this book was originally published. As is often the case with a successful open-source project with a strong technical architecture, a robust community of users and developers has thrived over time, and from all that energy has emerged a number of amazing improvements. Here’s a sampling of the changes:

Entirely new case studies have been added, in Chapters 12, 13 and 14. A new chapter (11) has been added to cover the administrative aspects of Lucene. Chapter 7, which previously described a custom framework for parsing different document types, has been rewritten entirely based on Tika. In addition all code samples have been updated to Lucene’s 3.0.1 APIs. And of course lots of great feedback from our readers has been folded in (thank you, and please keep it coming!).

Who should read this book?

Developers who need powerful search capabilities embedded in their applications should read this book. Lucene in Action, Second Edition is also suitable for developers who are curious about Lucene or indexing and search techniques, but who may not have an immediate need to use it. Adding Lucene know-how to your toolbox is valuable for future projects—search is a hot topic and will continue to be in the future.

This book primarily uses the Java version of Lucene (from Apache), and the majority of the code examples use the Java language. Readers familiar with Java will be right at home. Java expertise will be helpful; however, Lucene has been ported to a number of other languages including C++, C#, Python, and Perl. The concepts, techniques, and even the API itself are comparable between the Java and other language versions of Lucene.

Code examples

The source code for this book is available from Manning’s website at http://www.manning.com/LuceneinActionSecondEdition or http://www.manning.com/hatcher3. Instructions for using this code are provided in the README file included with the source-code package.

The majority of the code shown in this book was written by us and is included in the source-code package, licensed under the Apache Software License (http://www.apache.org/licenses/LICENSE-2.0). Some code (particularly the case-study code, and the examples from Lucene’s ports to other programming languages) isn’t provided in our source-code package; the code snippets shown there are owned by the contributors and are donated as is. In a couple of cases, we have included a small snippet of code from Lucene’s codebase, which is also licensed under Apache Software License 2.0.

Code examples don’t include package and import statements, to conserve space; refer to the actual source code for these details. Likewise, in the name of brevity and keeping examples focused on Lucene’s code, there are numerous places where we simply declare throws Exception, while for production code you should declare and catch only specific exceptions and implement proper handling when exceptions occur. In some cases there are fragments of code, inlined in the text, that are not full standalone examples; these cases are included in source files named Fragments.java, under each subdirectory.

Why JUnit?

We believe code examples in books should be top-notch quality and real-world applicable. The typical “hello world” examples often insult our intelligence and generally do little to help readers see how to really adapt to their environment.

We’ve taken a unique approach to the code examples in Lucene in Action, Second Edition. Many of our examples are actual JUnit test cases (http://www.junit.org), version 4.1. JUnit, the de facto Java unit-testing framework, easily allows code to assert that a particular assumption works as expected in a repeatable fashion. It also cleanly separates what we are trying to accomplish, by showing the small test case up front, from how we accomplish it, by showing the source code behind the APIs invoked by the test case. Automating JUnit test cases through an IDE or Ant allows one-step (or no steps with continuous integration) confidence building. We chose to use JUnit in this book because we use it daily in our other projects and want you to see how we really code. Test Driven Development (TDD) is a development practice we strongly espouse.

If you’re unfamiliar with JUnit, please read the JUnit primer section. We also suggest that you read Pragmatic Unit Testing in Java with JUnit by Dave Thomas and Andy Hunt, followed by Manning’s JUnit in Action by Vincent Massol and Ted Husted, a second edition of which is in the works by Petar Tahchiev, Felipe Leme, Vincent Massol, and Gary Gregory.

Code conventions and downloads

Source code in listings or in text is in a fixed width font to separate it from ordinary text. Java method names, within text, generally won’t include the full method signature.

In order to accommodate the available page space, code has been formatted with a limited width, including line continuation markers where appropriate.

We don’t include import statements and rarely refer to fully qualified class names —this gets in the way and takes up valuable space. Refer to Lucene’s Javadocs for this information. All decent IDEs have excellent support for automatically adding import statements; Erik blissfully codes without knowing fully qualified classnames using IDEA IntelliJ, Otis and Mike both use XEmacs. Add the Lucene JAR to your project’s classpath, and you’re all set. Also on the classpath issue (which is a notorious nuisance), we assume that the Lucene JAR and any other necessary JARs are available in the classpath and don’t show it explicitly. The lib directory, with the source code, includes JARs that the source code uses. When you run the ant targets, these JARs are placed on the classpath for you.

We’ve created a lot of examples for this book that are freely available to you. A .zip file of all the code is available from Manning’s web site for Lucene in Action: http://www.manning.com/hatcher3. Detailed instructions on running the sample code are provided in the main directory of the expanded archive as a README file.

Our test data

Most of our book revolves around a common set of example data to provide consistency and avoid having to grok an entirely new set of data for each section. This example data consists of book details. Table 1 shows the data so that you can reference it and make sense of our examples.

The data, besides the fields shown in the table, includes fields for ISBN, URL, and publication month. When you unzip the source code available for download at www.manning.com/hatcher3, the books are represented as *.properties files under the data sub-directory, and the command-line tool at src/lia/common/CreateTestIndex.java is used to create the test index used throughout the book. The fields for category and subject are our own subjective values, but the other information is objectively factual about the books

.
Title / Author Category Subject
A Modern Art of Education
Rudolf Steiner
/education/pedagogy education philosophy psychology practice Waldorf
Lipitor, Thief of Memory
Duane Graveline, Kilmer S. McCully, Jay S. Cohen
/health cholesterol,statin,lipitor
Nudge: Improving Decisions About Health, Wealth, and Happiness
Richard H. Thaler, Cass R. Sunstein
/health information architecture,decisions,choices
Imperial Secrets of Health and Longevity
Bob Flaws
/health/alternative/Chinese diet chinese medicine qi gong health herbs
Tao Te Ching 道德經
Stephen Mitchell
/philosophy/eastern taoism
Gödel, Escher, Bach: an Eternal Golden Braid
Douglas Hofstadter
/technology/computers/ai artificial intelligence number theory mathematics music
Mindstorms: Children, Computers, And Powerful Ideas
Seymour Papert
/technology/computers/programming/education children computers powerful ideas LOGO education
Ant in Action
Steve Loughran, Erik Hatcher
/technology/computers/programming apache ant build tool junit java development
JUnit in Action, Second Edition
Petar Tahchiev, Felipe Leme, Vincent Massol, Gary Gregory
/technology/computers/programming junit unit testing mock objects
Lucene in Action, Second Edition
Michael McCandless, Erik Hatcher, Otis Gospodnetić
/technology/computers/programming lucene search java
Extreme Programming Explained
Kent Beck
/technology/computers/programming/methodology extreme programming agile test driven development methodology
Tapestry in Action
Howard Lewis-Ship
/technology/computers/programming tapestry web user interface components
The Pragmatic Programmer
Dave Thomas, Andy Hunt
/technology/computers/programming pragmatic agile methodology developer tools

Author online

The purchase of Lucene in Action, Second Edition includes free access to a web forum run by Manning Publications, where you can discuss the book with the authors and other readers. To access the forum and subscribe to it, point your web browser to http://www.manning.com/LuceneinActionSecondEdition. This page provides information on how to get on the forum once you are registered, what kind of help is available, and the rules of conduct on the forum.

About the title

By combining introductions, overviews, and how-to examples, the In Action books are designed to help learning and remembering. According to research in cognitive science, the things people remember are things they discover during self-motivated exploration.

Although no one at Manning is a cognitive scientist, we are convinced that for learning to become permanent it must pass through stages of exploration, play, and, interestingly, re-telling of what is being learned. People understand and remember new things, which is to say they master them, only after actively exploring them. Humans learn in action. An essential part of an In Action guide is that it is example-driven. It encourages the reader to try things out, to play with new code, and explore new ideas.

There is another, more mundane, reason for the title of this book: our readers are busy. They use books to do a job or solve a problem. They need books that allow them to jump in and jump out easily and learn just what they want just when they want it. They need books that aid them in action. The books in this series are designed for such readers.