Lucene in Action
Erik Hatcher and Otis Gospodnetic
  • December 2004
  • ISBN 9781932394283
  • 456 pages
  • printed in black & white
This title is out of print and no longer for sale.

...packed with examples and advice on how to effectively use this incredibly powerful tool.

Brian Goetz, Quiotix Corporation


Lucene in Action, Second Edition is now available. An eBook of this older edition is included at no additional cost when you buy the revised edition!

Lucene is a gem in the open-source world—a highly scalable, fast search engine. It delivers performance and is disarmingly easy to use. Lucene in Action is the authoritative guide to Lucene. It describes how to index your data, including types you definitely need to know such as MS Word, PDF, HTML, and XML. It introduces you to searching, sorting, filtering, and highlighting search results.

Table of Contents detailed table of contents




about this book

Part 1 Core Lucene

1. Meet Lucene

1.1. Evolution of information organization and access

1.2. Understanding Lucene

1.3. Indexing and searching

1.4. Lucene in action: a sample application

1.5. Understanding the core indexing classes

1.6. Understanding the core searching classes

1.7. Review of alternate search products

1.8. Summary

2. Indexing

2.1. Understanding the indexing process

2.2. Basic index operations

2.3. Boosting Documents and Fields

2.4. Indexing dates

2.5. Indexing numbers

2.6. Indexing Fields used for sorting

2.7. Controlling the indexing process

2.8. Optimizing an index

2.9. Concurrency, thread-safety, and locking issues

2.10. Debugging indexing

2.11. Summary

3. Adding search to your application

3.1. Implementing a simple search feature

3.2. Using IndexSearcher

3.3. Understanding Lucene scoring

3.4. Creating queries programmatically

3.5. Parsing query expressions: QueryParser

3.6. Summary

4. Analysis

4.1. Using analyzers

4.2. Analyzing the analyzer

4.3. Using the built-in analyzers

4.4. Dealing with keyword fields

4.5. "Sounds like" querying

4.6. Synonyms, aliases, and words that mean the same

4.7. Stemming analysis

4.8. Language analysis issues

4.9. Nutch analysis

4.10. Summary

5. Advanced search techniques

5.1. Sorting search results

5.2. Using PhrasePrefixQuery

5.3. Querying on multiple fields at once

5.4. Span queries: Lucene’s new hidden gem

5.6. Searching across multiple Lucene indexes

5.7. Leveraging term vectors

5.8. Summary

6. Extending search

6.1. Using a custom sort method

6.2. Developing a custom HitCollector

6.3. Extending QueryParser

6.4. Using a custom filter

6.5. Performance testing

6.6. Summary

Part 2 Applied Lucene

7. Parsing common document formats

7.1. Handling rich-text documents

7.2. Indexing XML

7.3. Indexing a PDF document

7.4. Indexing an HTML document

7.5. Indexing a Microsoft Word document

7.6. Indexing an RTF document

7.7. Indexing a plain-text document

7.8. Creating a document-handling framework

7.9. Other text-extraction tools

7.10. Summary

8. Tools and extensions

8.1. Playing in Lucene’s Sandbox

8.2. Interacting with an index

8.3. Analyzers, tokenizers, and TokenFilters, oh my

8.4. Java Development with Ant and Lucene

8.5. JavaScript browser utilities

8.6. Synonyms from WordNet

8.7. Highlighting query terms

8.8. Chaining filters

8.9. Storing an index in Berkeley DB

8.10. Building the Sandbox

8.11. Summary

9. Lucene ports

9.1. Ports' relation to Lucene

9.2. CLucene

9.3. dotLucene

9.4. Plucene

9.5. Lupy

9.6. PyLucene

9.7. Summary

10. Case studies

10.1. Nutch: "The NPR of search engines"

10.2. Using Lucene at jGuru

10.3. Using Lucene in SearchBlox

10.4. Competitive intelligence with Lucene in XtraMind’s XM-InformationMinder™

10.5. Alias-i: orthographic variation with Lucene

10.6. Artful searching at

10.7. I love Lucene: TheServerSide

10.8. Conclusion

Appendix A: Installing Lucene

Appendix B: Lucene index format

Appendix C: Resources


About the Technology

Lucene powers search in surprising places—in discussion groups at Fortune 100 companies, in commercial issue trackers, in email search from Microsoft, in the Nutch web search engine (that scales to billions of pages). It is used by diverse companies including Akamai, Overture, Technorati, HotJobs, Epiphany, FedEx, Mayo Clinic, MIT, New Scientist Magazine, and many others.

About the book

Adding search to your application can be easy. With many reusable examples and good advice on best practices, Lucene in Action shows you how. And if you would like to search through Lucene in Action over the Web, you can do so using Lucene itself as the search engine--take a look at the authors' awesome Search Inside solution. Its results page resembles Google's and provides a novel yet familiar interface to the entire book and book blog.

What's inside

  • How to integrate Lucene into your applications
  • Ready-to-use framework for rich document handling
  • Case studies including Nutch, TheServerSide, jGuru, etc.
  • Lucene ports to Perl, Python, C#/.Net, and C++
  • Sorting, filtering, term vectors, multiple, and remote index searching
  • The new SpanQuery family, extending query parser, hit collecting
  • Performance testing and tuning
  • Lucene add-ons (hit highlighting, synonym lookup, and others)
  • Foreword by Doug Cutting, the inventor of Lucene

About the authors

A committer on the Ant, Lucene, and Tapestry open-source projects, Erik Hatcher is coauthor of Manning's award-winning Java Development with Ant. Otis Gospodnetic is a Lucene committer, a member of Apache Jakarta Project Management Committee, and maintainer of the jGuru's Lucene FAQ. Both authors have published numerous technical articles including several on Lucene.