Index


 
A

abbreviation, handling 355
accuracy 360
Ackley, Ryan 250
Adobe Systems 235
agent, distributed 349
AliasAnalyzer 364
Alias-i 361
Almaer, Dion 371
alternative spellings 354
analysis 103
during indexing 105
field-specific 108
foreign languages 140
in Nutch 145
position gaps 136
positional gap issues 138
versus parsing 107
with QueryParser 106
Analyzers 19
additional 282
Brazilian 282
buffering 130
building blocks 110
built-in 104, 119
Chinese 282
choosing 103
CJK 282
Dutch 282
field types 105
for highlighting 300
French 282
injecting synonyms 129, 296
SimpleAnalyzer 108
Snowball 283
StandardAnalyzer 120
StopAnalyzer 119
subword 357
using WordNet 296
visualizing 112
WhitespaceAnalyzer 104
with QueryParser 72
Ant
building Lucene 391
building Sandbox 310
indexing a fileset 284
Antiword 264
ANTLR 100, 336
Apache Jakarta 7, 9
Apache Software Foundation 9
Apache Software License 7
Arabic 359
architecture
field design 374
TheServerSide configuration 379
ASCII 142
Asian language analysis 142

 
B

Bakhtiar, Amir 320
Beagle 318
Bell, Timothy C. 26
Berkeley DB, storing index 307
Bialecki, Andrzej 271
biomedical, use of Lucene 352
BooleanQuery 85
from QueryParser 72, 87
n-gram extension 358
TooManyClauses exception 215
used with PhraseQuery 158
boosting 79
documents 377
documents and fields 38–39
BrazilianAnalyzer 282

 
C

C++ 10
CachingWrappingFilter 171, 177
caching DateFilter 173
Cafarella, Michael 326
Carpenter, Bob 351
cell phone, T9 WordNet interface 297
ChainedFilter 177, 304
Chandler 307, 322
charades 125
Chinese analysis 142–143, 282
CJK (Chinese Japanese Korean) 142
CJKAnalyzer 143, 145, 282
Clark, Andy 245
Clark, Mike 214
CLucene 314, 317
supported platforms 314
Unicode support 316
color
distance formula 366
indexing 365
command-line interface 269
compound index
creating 400
format 341
converting native files to ASCII 142
coordination, query term 79
Cozens, Simon 318
CPAN 318
crawler 372
in SearchBlox 342
with XM-InformationMinder 347
crawling alternatives 330
CSS in highlighting 301
Cutting, Doug 9
relevant work 9
CVS
obtaining Lucene’s source code 391
Sandbox 268
CyberNeko. See NekoHTML
CzechAnalyzer 282

 
D

database 8
indexing 362
primary key 362
searching 362
storing index inside Berkeley DB 307
date, indexing 216
DateField 39
alternatives 218
issue 216
min and max constants 173
range queries 96
used with DateFilter 173
DateFilter 171–173
caching 177
open-ended ranges 172
with caching 177
within ChainedFilter 306
DbDirectory 308
debugging, queries 94
DefaultSimilarity 79
deleting documents 375
Digester
configuration 379
Directory 19
FSDirectory 19
RAMDirectory 19
directory in Berkeley DB 308
DMOZ 27
DNA 354
Docco 265
DocSearcher 264
Document 20, 71
copy/paste from Luke 274
editing with Luke 275
heterogenous fields 33
document boosting 377
document frequency
seen with Luke 273
document handler
customizing for Ant 286
indexing with Ant 285
document type handling
in SearchBlox 342
documentation 388
dotLucene 317–318
downloading Lucene 388
Dutch 354
DutchAnalyzer 282

 
E

Egothor 24
encoding
ISO-8859-1 142
UTF-8 140
Etymon PJ 264
Explanation 80

 
F

Field 20–22
appending to 33
keyword, analysis 121
storing term vectors 185
file handle
issue 340
Filter 76
caching 177
ChainedFilter 304
custom 209
using HitCollector 203
within a Query 212
FilteredQuery 178, 212
filtering
search space 171–178
token. See TokenFilter
foreign language analysis 140
Formatter 300
Fragmenter 300
FrenchAnalyzer 282
fuzzy string similarity 351
FuzzyEnum 350
FuzzyQuery 92
from QueryParser 93
issues 350
performance issue 213
prohibiting 204

 
G

GCJ 308
German analysis 141
Giustina, Fabrizio 242
Glimpse 26
GNOME 318
Google 6, 27
alternative word suggestions 128
analysis 103
API 352
definitions 292
expense 372
term highlighting 300
government intelligence, use of Lucene 352

 
H

Harvest 26
Harvest-NG 26
Harwood, Mark 300
highlighting, query terms 300–303, 343
Hindi 354
HitCollector 76, 201–203
customizing 350
priority-queue idea 360
used by Filters 203
Hits 24, 70–71, 76
highlighting 303
ht://Dig 26
TheServerSide usage 371
HTML 8
cookie 77
highlighting 301
<meta> tag 140
parsing 107, 329, 352
HTMLParser 264
HTTP
crawler. See Nutch
session 77
HTTP request
content-type 140

 
I

I18N. See internationalization
index optimization 56–59
disk space requirements 56
performance effect 56
when to do it 58
why do it 57
index structure
converting 400–401
performance comparison 402
IndexFiles 389
IndexHTML 390
indexing
adding documents 31–33
analysis during 105
Ant task 285
at TheServerSide 373
browsing tool 271
buffering 42
colors 365
compound format 341
compound index 399–400
concurrency rules 59–60
creation of 12
data structures 11
dates 39–40, 216
debugging 66
directory structure 395
disabling locking 66
file format 404
file view with Luke 277
.fnm file 405
for sorting 41
format 393
framework 225–226, 254–263
HTML 241, 248
incremental 396
index files 397
jGuru design 332
limiting field length 54–55
locking 62–66
logical view 394
maxFieldLength 54–55
maxMergeDocs 42–47
mergeFactor 42–47
merging indexes 52
Microsoft Word documents 248–251
minMergeDocs 42, 47
multifile index structure 395
numbers 40–41
open files 47–48
parallelization 52–54
PDF 235–241
performance 42–47
plain-text documents 253–254
removing documents 33–36
rich-text documents 224
RTF documents 252–253
scheduling 367
segments 396–397
status with LIMO 279
steps 29–31
storing in Berkeley DB 307
term dictionary 406
term frequency 406
term positions 406
thread-safety 60–62
tools 269
undeleting documents 36
updating documents 36
batching 37
using RAMDirectory 48–52
XML 226–235
IndexReader 199
deleting documents 375
retrieving term vectors 186
IndexSearcher 23, 70, 78
n-gram extension 358
paging through results 77
using 75
IndexWriter 19
addDocument 106
analyzer 123
information overload 6
Information Retrieval (IR) 7
libraries 24–26
Installing Lucene 387–392
intelligent agent 6
internationalization 141
inverse document frequency 79
inverted index 404
IR. See Information Retrieval (IR)
ISO-8859-1 142

 
J

Jakarta Commons Digester 230–235
Jakarta POI 249–250
Japanese analysis 142
Java Messaging Service 352
in XM-InformationMinder 347
Java, keyword 331
JavaCC 100
building Lucene 392
JavaScript
character escaping 292
query construction 291
query validation 291
JDOM 264
jGuru 341
JGuruMultiSearcher 339
Jones, Tim 150
JPedal 264
jSearch 7
JTidy 242–245
indexing HTML with Ant 285
JUnitPerf 213
JWordNet 297

 
K

keyword analyzer 124
Konrad, Karsten 344
Korean analysis 142

 
L

language
handling 354
support 343
LARM 7, 372
Levenshtein distance algorithm 92
lexicon, definition 331
LIMO 279
LingPipe 353
linguistics 353
Litchfield, Ben 236
Lookout 6, 318
Lucene
building from source 391
community 10
demonstration applications 389–391
developers 10
documentation 388
downloading 388
history of 9
index 11
integration of 8
ports 10
sample application 11
Sandbox 268
understanding 6
users of 10
what it is 7
Lucene ports 312–324
summary 313
Lucene Wiki 7
Lucene.Net 6
lucli 269
Luke 271, 391
plug-ins 278
Lupy 308, 320–322

 
M

Managing Gigabytes 26
Matalon, Dror 269
Metaphone 125
MG4J 26
Michaels.com 361–371
Microsoft 6, 318
Microsoft Index Server 26
Microsoft Outlook 6, 318
Microsoft Windows 14
Microsoft Word 8
parsing 107
Miller, George 292
and WordNet 292
misspellings 354
matching 363
mock object 131, 211
Moffat, Alistair 26
morphological variation 355
Movable Type 320
MSN 6
MultiFieldQueryParser 160
multifile index, creating 398
multiple indexes 331
MultiSearcher 178–185
alternative 339
multithreaded searching. See ParallelMultiSearcher
Multivalent 264

 
N

Namazu 26
native2ascii 142
natural language with XM-InformationMinder 345
NekoHTML 245–248, 329, 352
.NET 10
n-gram TokenStream 357
NGramQuery 358
NGramSearcher 358
Nioche, Julien 279
noisy-channel model 355
normalization
field length 79
query 79
numeric
padding 206
range queries 205
Nutch 7, 9, 329
Explanation 81

 
O

OLE 2 Compound Document format 249
open files formula 401
OpenOffice SDK 264
optimize 340
orthographic variation 354
Overture 6

 
P

paging
at jGuru 336
TheServerSide search results 383
through Hits 77
ParallelMultiSearcher 180
Parr, Terence 329
ParseException 204, 379
parsing 73
query expressions. See QueryParser
QueryParser method 73
stripping plurals 334
versus analysis 107
partitioning indexes 180
PDF 8
See also indexing PDF
PDF Text Stream 264
PDFBox 236–241
built-in Lucene support 239
PerFieldAnalyzerWrapper
for Keyword fields 123
performance
issues with WildcardQuery 91
iterating Hits warning 369
load testing 217
of sorting 157
SearchBlox case study 341
statistics 370
testing 213, 220
Perl 10
pharmaceutical, uses of Lucene 347
PhrasePrefixQuery 157–159
handling synonyms alternative 134
PhraseQuery 87
compared to PhrasePrefixQuery 158
forcing term order 208
from QueryParser 90
in contrast to SpanNearQuery 166
multiple terms 89
position increment issue 138
scoring 90
slop factor 139
with synonyms 132
Piccolo 264
Plucene 318–320
POI 264
Porter stemming algorithm 136
Porter, Dr. Martin 25, 136, 283
position, increment offset in SpanQuery 161
precision 11, 360
PrefixQuery 84
from QueryParser 85
optimized WildcardQuery 92
Properties file, encoding 142
PyLucene 308, 322–323
Python 10

 
Q

Query 23, 70, 72
creating programatically 81
preprocessing at jGuru 335
starts with 84
statistics 337
toString 94
See also QueryParser
query expression, parsing. See QueryParser
QueryFilter 171, 173, 209
alternative using BooleanQuery 176
as security filter 174
within ChainedFilter 305
QueryHandler 328
querying 70
QueryParser 70, 72–74, 93
analysis 106
analysis issues 134
analyzer choice 107
and SpanQuery 170
boosting queries 99
combining with another Query 82
combining with programmatic queries 100
creating BooleanQuery 87
creating FuzzyQuery 93, 99
creating PhraseQuery 90, 98
creating PrefixQuery 85, 99
creating RangeQuery 84
creating SpanNearQuery 208
creating TermQuery 83
creating WildcardQuery 91, 99
custom date parsing 218
date parsing locale 97
date ranges 96
default operator 94
escape characters 93
expression syntax 74
extending 203–209
field selection 95
grouping expressions 95
handling numeric ranges 205
issues 100, 107
Keyword fields 122
lowercasing wildcard and prefix queries 99
overriding for synonym injection 134
PhraseQuery issue 138
prohibiting expensive queries 204
range queries 96
TheServerSide custom implementation 378
Quick, Andy 242

 
R

Raggett, Dave 242
RAM, loading indexes into 77
RAMDirectory, loading file index into 77
RangeQuery 83
from QueryParser 84
handling numeric data 205
spanning multiple indexes 179
raw score 78
recall 11, 360
regular expressions. See WildcardQuery
relational database. See database
relevance 76
remote searching 180
RemoteSearchable 180
RGB indexing 366
RMI, searching via 180
Ruby 10
Russian analysis 141

 
S

Sandbox 268
analyzers 284
building components 309
ChainedFilter 177
Highlighter 300
SAX 352
scalability with SearchBlox 341
score 70, 77–78
normalization 78
ScoreDocComparator 198
Scorer 300
scoring 78
affected by HitCollector 203
formula 78
scrolling. See paging
search 68
products 26
resources 27
search engine 7
See Nutch; SearchBlox
SearchBlox 7, 265–344
SearchClient 182
SearchFiles 389
searching 10
API 70
filtering results 171–178
for similar documents 186
indexes in parallel 180
multiple indexes 178
on multiple fields 159
TheServerSide 373
using HitCollector 201
with Luke 275
SearchServer 180
Searchtools 27
security filtering 174
Selvaraj, Robert 341
Short, Allen 320
similar term query. See FuzzyQuery
similarity 80
between documents. See term vectors
customizing 350
with XM-InformationMinder 345
SimpleAnalyzer 108, 119
example 104
SimpleHTMLFormatter 301
Simpy 265
slop
with PhrasePrefixQuery 159
with SpanNearQuery 166
Snowball 25
SnowballAnalyzer 282
SortComparatorSource 195, 198
SortField 200–201
sorting
accessing custom value 200
alphabetically 154
by a field 154
by geographic distance 195
by index order 153
by multiple fields 155
by relevance 152
custom method 195–201
example 150
field type 156
performance 157
reversing 154
search results 150–157
specifying locale 157
Soundex. See Metaphone
source code, Sandbox 268, 309
SpanFirstQuery 162, 165
Spanish 354
SpanNearQuery 99, 162, 166, 203, 208
SpanNotQuery 162, 168
SpanOrQuery 162, 169
SpanQuery 161–170
aggregating 169
and QueryParser 170
visualization utility 164
SpanTermQuery 162–165
spelling correction 354
Spencer, Dave 293
spidering alternatives 330
SQL 362
similarities with QueryParser 72
StandardAnalyzer 119–120
example 104–105
with Asian languages 143
with CJK characters 142, 145
statistics
at jGuru 337
Michaels.com 370
Steinbach, Ralf 344
stemming alternative 359
stemming analyzer 283
Stenzhorn, Holger 344
stop words 20, 103
at jGuru 335
StopAnalyzer 119
example 104
StringTemplate 330
SubWordAnalyzer 357
SWIG 308
SWISH 26
SWISH++ 26
SWISH-E 26
SynonymEngine 131
mock 132
synonyms
analyzer injection 129
indexing 363
injecting with PhrasePrefixQuery 159
with PhraseQuery 133
See also WordNet

 
T

T9, cell phone interface 297
Tan, Kelvin 291, 304
Term 23
term
definition 103
navigation with Luke 273
term frequency 79, 331
weighting 359
term vectors 185–193
aggregating 191
browsing with Luke 275
computing angles 192
computing archetype document 189
TermEnum 198
TermFreqVector 186
TermQuery 24, 71, 82
contrasted with SpanTermQuery 161
from QueryParser 83
with synonyms 132
TextMining.org 250–251
TheServerSide 385
Tidy. See JTidy
Token 108
TokenFilter 109
additional 282
ordering 116
tokenization
definition 103
tokenization. See analysis
Tokenizer 109
additional 282
n-gram 357
tokens
meta-data 109
offsets 116
position increment 109
position increment in Nutch 146
type 116, 127
visualizing positions 134
TokenStream 107
architecture 110
for highlighting 300
Tomcat
demo application 390
tool
command-line interface 269
Lucene Index Monitor 279
Luke 271
TopDocs 200
TopFieldDocs 200
transliteration 355, 359
troubleshooting 392

 
U

UbiCrawler 26
Unicode 140
UNIX 17
user interface 6
UTF-8 140

 
V

Vajda, Andi 308, 322
van Klinken, Ben 314
vector. See term vectors
Verity 26
visualization
with XM-InformationMinder 346

 
W

Walls, Craig 361
web application
CSS highlighting 301
demo 390
JavaScript 290
LIMO 279
Michaels.com 367
TheServerSide example 383
web crawler 7
alternatives 330
See also crawler
Webglimpse 26
WebStart, Lucene Index Toolbox 272
weighting, n-grams 360
WhitespaceAnalyzer 119
example 104
WildcardQuery 90
from QueryParser 91
performance issue 213
prohibiting 204
Witten, Ian H. 26
WordNet 292–300
WordNetSynonymEngine 297

 
X

Xapian 25
Omega 25
xargs 17
Xerces 227–230
Xerces Native Interface (XNI) 245
XM-InformationMinder 344–350
XML
configuration 380
encoding 140
parsing 107
search results 343
Xpdf 264
XSL
transforming search results 343

 
Y

Yahoo! 6

 
Z

Zilverline 7