Lucene in Action, Second Edition
Michael McCandless, Erik Hatcher, and Otis Gospodnetić
  • July 2010
  • ISBN 9781933988177
  • 532 pages
  • printed in black & white

... brings you up to speed.

Doug Cutting, Founder of Lucene, Nutch, and Hadoop

When Lucene first appeared, this superfast search engine was nothing short of amazing. Today, Lucene still delivers. Its high-performance, easy-to-use API, features like numeric fields, payloads, near-real-time search, and huge increases in indexing and searching speed make it the leading search tool.

And with clear writing, reusable examples, and unmatched advice, Lucene in Action, Second Edition is still the definitive guide to effectively integrating search into your applications. This totally revised book shows you how to index your documents, including formats such as MS Word, PDF, HTML, and XML. It introduces you to searching, sorting, and filtering, and covers the numerous improvements to Lucene since the first edition. Source code is for Lucene 3.0.1.

Table of Contents detailed table of contents

foreword

preface

preface to the first edition

acknowledgments

about this book

about the authors

JUnit primer

Part 1 Core Lucene

1. Meet Lucene

1.1. Dealing with information explosion

1.2. What is Lucene?

1.3. Lucene and the components of a search application

1.4. Lucene in action: a sample application

1.5. Understanding the core indexing classes

1.6. Understanding the core searching classes

1.7. Summary

2. Building a search index

2.1. How Lucene models content

2.2. Understanding the indexing process

2.3. Basic index operations

2.4. Field options

2.5. Boosting documents and fields

2.6. Indexing numbers, dates, and times

2.7. Field truncation

2.9. Optimizing an index

2.10. Other directory implementations

2.11. Concurrency, thread safety, and locking issues

2.12. Debugging indexing

2.13. Advanced indexing concepts

2.14. Summary

3. Adding search to your application

3.1. Implementing a simple search feature

3.2. Using IndexSearcher

3.3. Understanding Lucene scoring

3.4. Lucene’s diverse queries

3.5. Parsing query expressions: QueryParser

3.6. Summary

4. Lucene’s analysis process

4.1. Using analyzers

4.2. What’s inside an analyzer?

4.3. Using the built-in analyzers

4.4. Sounds-like querying

4.5. Synonyms, aliases, and words that mean the same

4.6. Stemming analysis

4.7. Field variations

4.8. Language analysis issues

4.9. Nutch analysis

4.10. Summary

5. Advanced search techniques

5.1. Lucene’s field cache

5.2. Sorting search results

5.3. Using MultiPhraseQuery

5.4. Querying on multiple fields at once

5.5. Span queries

5.7. Custom scoring using function queries

5.8. Searching across multiple Lucene indexes

5.9. Leveraging term vectors

5.10. Loading fields with FieldSelector

5.12. Summary

6. Extending search

6.1. Using a custom sort method

6.2. Developing a custom Collector

6.3. Extending QueryParser

6.4. Custom filters

6.5. Payloads

6.6. Summary

Part 2 Applied Lucene

7. Extracting text with Tika

7.1. What is Tika?

7.2. Tika’s logical design and API

7.3. Installing Tika

7.4. Tika’s built-in text extraction tool

7.5. Extracting text programmatically

7.6. Tika’s limitations

7.7. Indexing custom XML

7.8. Alternatives

7.9. Summary

8. Essential Lucene extensions

8.1. Luke, the Lucene Index Toolbox

8.2. Analyzers, tokenizers, and TokenFilters

8.3. Highlighting query terms

8.4. FastVectorHighlighter

8.5. Spell checking

8.6. Fun and interesting Query extensions

8.7. Building contrib modules

8.8. Summary

9. Further Lucene extensions

9.1. Chaining filters

9.2. Storing an index in Berkeley DB

9.3. Synonyms from WordNet

9.4. Fast memory-based indices

9.5. XML QueryParser: Beyond "one box" search interfaces

9.6. Surround query language

9.7. Spatial Lucene

9.8. Searching multiple indexes remotely

9.9. Flexible QueryParser

9.10. Odds and ends

9.11. Summary

10. Using Lucene from other programming languages

10.1. Ports primer

10.2. CLucene (C++)

10.3. Lucene.Net (C# and other .NET languages)

10.4. KinoSearch and Lucy (Perl)

10.5. Ferret (Ruby)

10.6. PHP

10.7. PyLucene (Python)

10.8. Solr (many programming languages)

10.9. Summary

11. Lucene administration and performance tuning

11.1. Performance tuning

11.2. Threads and concurrency

11.3. Managing resource consumption

11.4. Hot backups of the index

11.5. Common errors

11.6. Summary

Part 3 Case studies

12. Case study 1: Krugle

Krugle: Searching source code

12.1. Introducing Krugle

12.2. Appliance architecture

12.3. Search performance

12.4. Parsing source code

12.5. Substring searching

12.7. Future improvements

12.8. Summary

13. Case study 2: SIREn

Searching semistructured documents with SIREn

13.1. Introducing SIREn

13.2. SIREn’s benefits

13.3. Indexing entities with SIREn

13.4. Searching entities with SIREn

13.5. Integrating SIREn in Solr

13.6. Benchmark

13.7. Summary

14. Case study 3: LinkedIn

Adding facets and real-time search with Bobo Browse and Zoie

14.1. Faceted search with Bobo Browse

14.2. Real-time search with Zoie

14.3. Summary

Appendix A: Installing Lucene

Appendix B: Lucene index format

Appendix C: Lucene/contrib benchmark

Appendix D: Resources

index

What's inside

  • Performing hot backups
  • Using numeric fields
  • Tuning for indexing or searching speed
  • Boosting matches with payloads
  • Creating reusable analyzers
  • Adding concurrency with threads
  • Four new case studies
  • Much more!

About the authors

Michael McCandless is a Lucene PMC member and committer with more than a decade of experience building search engines. Erik Hatcher and Otis Gospodnetić are the authors of the first edition of Lucene in Action and long-time contributors to Lucene, Solr, Mahout, and other Lucene-based projects.


combo $49,99 pBook + eBook
eBook $39,99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks

This new edition has it all.

Chad Davis, Blackdog Software, Author of Struts 2 in Action

Very readable, full of expert tips.

Rick Wagner, Acxiom Corp.

Elegant, and easy to read - just like Lucene itself.

Shai Erera, IBM Haifa Research Labs

For a Lucene developer, it's required reading.

Stuart Caborn, Thoughtworks