Table of Contents
foreword xvii
preface xix
acknowledgments xxii
about this book xxv
Part 1 Core Lucene 1
- 1 Meet Lucene 3
- 1.1 Evolution of information organization and access 4
- 1.2 Understanding Lucene 6
- What Lucene is 7
- What Lucene can do for you 7
- History of Lucene 9
- Who uses Lucene 10
- Lucene ports: Perl, Python, C++, .NET, Ruby 10
- 1.3 Indexing and searching 10
- What is indexing, and why is it important? 10
- What is searching? 11
- 1.4 Lucene in action: a sample application 11
- Creating an index 12
- Searching an index 15
- 1.5 Understanding the core indexing classes 18
- IndexWriter 19
- Directory 19
- Analyzer 19
- Document 20
- Field 20
- 1.6 Understanding the core searching classes 22
- IndexSearcher 23
- Term 23
- Query 23
- TermQuery 24
- Hits 24
- 1.7 Review of alternate search products 24
- IR libraries 24
- Indexing and searching applications 26
- Online resources 27
- 1.8 Summary 27
- 2 Indexing 28
- 2.1 Understanding the indexing process 29
- Conversion to text 29
- Analysis 30
- Index writing 31
- 2.2 Basic index operations 31
- Adding documents to an index 31
- Removing Documents from an index 33
- Undeleting Documents 36
- Updating Documents in an index 36
- 2.3 Boosting Documents and Fields 38
- 2.4 Indexing dates 39
- 2.5 Indexing numbers 40
- 2.6 Indexing Fields used for sorting 41
- 2.7 Controlling the indexing process 42
- Tuning indexing performance 42
- In-memory indexing: RAMDirectory 48
- Limiting Field sizes: maxFieldLength 54
- 2.8 Optimizing an index 56
- 2.9 Concurrency, thread-safety, and locking issues 59
- Concurrency rules 59
- Thread-safety 60
- Index locking 62
- Disabling index locking 66
- 2.10 Debugging indexing 66
- 2.11 Summary 67
- 3 Adding search to your application 68
- 3.1 Implementing a simple search feature 69
- Searching for a specific term 70
- Parsing a user-entered query expression: QueryParser 72
- 3.2 Using IndexSearcher 75
- Working with Hits 76
- Paging through Hits 77
- Reading indexes into memory 77
- 3.3 Understanding Lucene scoring 78
- Lucene, you got a lot of splainin to do! 80
- 3.4 Creating queries programmatically 81
- Searching by term: TermQuery 82
- Searching within a range: RangeQuery 83
- Searching on a string: PrefixQuery 84
- Combining queries: BooleanQuery 85
- Searching by phrase: PhraseQuery 87
- Searching by wildcard: WildcardQuery 90
- Searching for similar terms: FuzzyQuery 92
- 3.5 Parsing query expressions: QueryParser 93
- Query.toString 94
- Boolean operators 94
- Grouping 95
- Field selection 95
- Range searches 96
- Phrase queries 98
- Wildcard and prefix queries 99
- Fuzzy queries 99
- Boosting queries 99
- To QueryParse or not to QueryParse? 100
- 3.6 Summary 100
- 4 Analysis 102
- 4.1 Using analyzers 104
- Indexing analysis 105
- QueryParser analysis 106
- Parsing versus analysis: when an analyzer isnt appropriate 107
- 4.2 Analyzing the analyzer 107
- Whats in a token? 108
- TokenStreams uncensored 109
- Visualizing analyzers 112
- Filtering order can be important 116
- 4.3 Using the built-in analyzers 119
- StopAnalyzer 119
- StandardAnalyzer 120
- 4.4 Dealing with keyword fields 121
- Alternate keyword analyzer 125
- 4.5 Sounds like querying 125
- 4.6 Synonyms, aliases, and words that mean the same 128
- Visualizing token positions 134
- 4.7 Stemming analysis 136
- Leaving holes 136
- Putting it together 137
- Hole lot of trouble 138
- 4.8 Language analysis issues 140
- Unicode and encodings 140
- Analyzing non-English languages 141
- Analyzing Asian languages 142
- Zaijian 145
- 4.9 Nutch analysis 145
- 4.10 Summary 147
- 5 Advanced search techniques 149
- 5.1 Sorting search results 150
- Using a sort 150
- Sorting by relevance 152
- Sorting by index order 153
- Sorting by a field 154
- Reversing sort order 154
- Sorting by multiple fields 155
- Selecting a sorting field type 156
- Using a nondefault locale for sorting 157
- Performance effect of sorting 157
- 5.2 Using PhrasePrefixQuery 157
- 5.3 Querying on multiple fields at once 159
- 5.4 Span queries: Lucenes new hidden gem 161
- Building block of spanning, SpanTermQuery 163
- Finding spans at the beginning of a field 165
- Spans near one another 166
- Excluding span overlap from matches 168
- Spanning the globe 169
- SpanQuery and QueryParser 170
- 5.5 Filtering a search 171
- Using DateFilter 171
- Using QueryFilter 173
- Security filters 174
- A QueryFilter alternative 176
- Caching filter results 177
- Beyond the built-in filters 177
- 5.6 Searching across multiple Lucene indexes 178
- Using MultiSearcher 178
- Multithreaded searching using ParallelMultiSearcher 180
- 5.7 Leveraging term vectors 185
- Books like this 186
- What category? 189
- 5.8 Summary 193
- 6 Extending search 194
- 6.1 Using a custom sort method 195
- Accessing values used in custom sorting 200
- 6.2 Developing a custom HitCollector 201
- About BookLinkCollector 202
- Using BookLinkCollector 202
- 6.3 Extending QueryParser 203
- Customizing QueryParsers behavior 203
- Prohibiting fuzzy and wildcard queries 204
- Handling numeric field-range queries 205
- Allowing ordered phrase queries 208
- 6.4 Using a custom filter 209
- Using a filtered query 212
- 6.5 Performance testing 213
- Testing the speed of a search 213
- Load testing 217
- QueryParser again! 218
- Morals of performance testing 220
- 6.6 Summary 220
Part 2 Applied Lucene 221
- 7 Parsing common document formats 223
- 7.1 Handling rich-text documents 224
- Creating a common DocumentHandler interface 225
- 7.2 Indexing XML 226
- Parsing and indexing using SAX 227
- Parsing and indexing using Digester 230
- 7.3 Indexing a PDF document 235
- Extracting text and indexing using PDFBox 236
- Built-in Lucene support 239
- 7.4 Indexing an HTML document 241
- Getting the HTML source data 242
- Using JTidy 242
- Using NekoHTML 245
- 7.5 Indexing a Microsoft Word document 248
- Using POI 249
- Using TextMining.orgs API 250
- 7.6 Indexing an RTF document 252
- 7.7 Indexing a plain-text document 253
- 7.8 Creating a document-handling framework 254
- FileHandler interface 255
- ExtensionFileHandler 257
- FileIndexer application 260
- Using FileIndexer 262
- FileIndexer drawbacks, and how to extend the framework 263
- 7.9 Other text-extraction tools 264
- Document-management systems and services 264
- 7.10 Summary 265
- 8 Tools and extensions 267
- 8.1 Playing in Lucenes Sandbox 268
- 8.2 Interacting with an index 269
- lucli: a command-line interface 269
- Luke: the Lucene Index Toolbox 271
- LIMO: Lucene Index Monitor 279
- 8.3 Analyzers, tokenizers, and TokenFilters, oh my 282
- SnowballAnalyzer 283
- Obtaining the Sandbox analyzers 284
- 8.4 Java Development with Ant and Lucene 284
- Using the <index> task 285
- Creating a custom document handler 286
- Installation 290
- 8.5 JavaScript browser utilities 290
- JavaScript query construction and validation 291
- Escaping special characters 292
- Using JavaScript support 292
- 8.6 Synonyms from WordNet 292
- Building the synonym index 294
- Tying WordNet synonyms into an analyzer 296
- Calling on Lucene 297
- 8.7 Highlighting query terms 300
- Highlighting with CSS 301
- Highlighting Hits 303
- 8.8 Chaining filters 304
- 8.9 Storing an index in Berkeley DB 307
- Coding to DbDirectory 308
- Installing DbDirectory 309
- 8.10 Building the Sandbox 309
- Check it out 310
- Ant in the Sandbox 310
- 8.11 Summary 311
- 9 Lucene ports 312
- 9.1 Ports relation to Lucene 313
- 9.2 CLucene 314
- Supported platforms 314
- API compatibility 314
- Unicode support 316
- Performance 317
- Users 317
- 9.3 dotLucene 317
- API compatibility 317
- Index compatibility 318
- Performance 318
- Users 318
- 9.4 Plucene 318
- API compatibility 319
- Index compatibility 320
- Performance 320
- Users 320
- 9.5 Lupy 320
- API compatibility 320
- Index compatibility 322
- Performance 322
- Users 322
- 9.6 PyLucene 322
- API compatibility 323
- Index compatibility 323
- Performance 323
- Users 323
- 9.7 Summary 324
- 10 Case studies 325
- 10.1 Nutch: The NPR of search engines 326
- More in depth 327
- Other Nutch features 328
- 10.2 Using Lucene at jGuru 329
- Topic lexicons and document categorization 330
- Search database structure 331
- Index fields 332
- Indexing and content preparation 333
- Queries 335
- JGuruMultiSearcher 339
- Miscellaneous 340
- 10.3 Using Lucene in SearchBlox 341
- Why choose Lucene? 341
- SearchBlox architecture 342
- Search results 343
- Language support 343
- Reporting Engine 344
- Summary 344
- 10.4 Competitive intelligence with Lucene in XtraMinds XM-InformationMinder? 344
- The system architecture 347
- How Lucene has helped us 350
- 10.5 Alias-i: orthographic variation with Lucene 351
- Alias-i application architecture 352
- Orthographic variation 354
- The noisy channel model of spelling correction 355
- The vector comparison model of spelling variation 356
- A subword Lucene analyzer 357
- Accuracy, efficiency, and other applications 360
- Mixing in context 360
- References 361
- 10.6 Artful searching at Michaels.com 361
- Indexing content 362
- Searching content 367
- Search statistics 370
- Summary 371
- 10.7 I love Lucene: TheServerSide 371
- Building better search capability 371
- High-level infrastructure 373
- Building the index 374
- Searching the index 377
- Configuration: one place to rule them all 379
- Web tier: TheSeeeeeeeeeeeerverSide? 383
- Summary 385
- 10.8 Conclusion 385
appendix A Installing Lucene 387
appendix B Lucene index format 393
appendix C Resources 408
index 415