Index
A
abbreviation, handling 355
accuracy 360
Ackley, Ryan 250
Adobe Systems 235
agent, distributed 349
AliasAnalyzer 364
Alias-i 361
Almaer, Dion 371
alternative spellings 354
analysis 103
during indexing 105
field-specific 108
foreign languages 140
in Nutch 145
position gaps 136
positional gap issues 138
versus parsing 107
with QueryParser 106
Analyzers 19
additional 282
Brazilian 282
buffering 130
building blocks 110
built-in 104, 119
Chinese 282
choosing 103
CJK 282
Dutch 282
field types 105
for highlighting 300
French 282
injecting synonyms 129, 296
SimpleAnalyzer 108
Snowball 283
StandardAnalyzer 120
StopAnalyzer 119
subword 357
using WordNet 296
visualizing 112
WhitespaceAnalyzer 104
with QueryParser 72
Ant
building Lucene 391
building Sandbox 310
indexing a fileset 284
Antiword 264
ANTLR 100, 336
Apache Jakarta 7, 9
Apache Software Foundation 9
Apache Software License 7
Arabic 359
architecture
field design 374
TheServerSide configuration 379
ASCII 142
Asian language analysis 142
B
Bakhtiar, Amir 320
Beagle 318
Bell, Timothy C. 26
Berkeley DB, storing index 307
Bialecki, Andrzej 271
biomedical, use of Lucene 352
BooleanQuery 85
from QueryParser 72, 87
n-gram extension 358
TooManyClauses exception 215
used with PhraseQuery 158
boosting 79
documents 377
documents and fields 3839
BrazilianAnalyzer 282
C
C++ 10
CachingWrappingFilter 171, 177
caching DateFilter 173
Cafarella, Michael 326
Carpenter, Bob 351
cell phone, T9 WordNet interface 297
ChainedFilter 177, 304
Chandler 307, 322
charades 125
Chinese analysis 142143, 282
CJK (Chinese Japanese Korean) 142
CJKAnalyzer 143, 145, 282
Clark, Andy 245
Clark, Mike 214
CLucene 314, 317
supported platforms 314
Unicode support 316
color
distance formula 366
indexing 365
command-line interface 269
compound index
creating 400
format 341
converting native files to ASCII 142
coordination, query term 79
Cozens, Simon 318
CPAN 318
crawler 372
in SearchBlox 342
with XM-InformationMinder 347
crawling alternatives 330
CSS in highlighting 301
Cutting, Doug 9
relevant work 9
CVS
obtaining Lucenes source code 391
Sandbox 268
CyberNeko. See NekoHTML
CzechAnalyzer 282
D
database 8
indexing 362
primary key 362
searching 362
storing index inside Berkeley DB 307
date, indexing 216
DateField 39
alternatives 218
issue 216
min and max constants 173
range queries 96
used with DateFilter 173
DateFilter 171173
caching 177
open-ended ranges 172
with caching 177
within ChainedFilter 306
DbDirectory 308
debugging, queries 94
DefaultSimilarity 79
deleting documents 375
Digester
configuration 379
Directory 19
FSDirectory 19
RAMDirectory 19
directory in Berkeley DB 308
DMOZ 27
DNA 354
Docco 265
DocSearcher 264
Document 20, 71
copy/paste from Luke 274
editing with Luke 275
heterogenous fields 33
document boosting 377
document frequency
seen with Luke 273
document handler
customizing for Ant 286
indexing with Ant 285
document type handling
in SearchBlox 342
documentation 388
dotLucene 317318
downloading Lucene 388
Dutch 354
DutchAnalyzer 282
E
Egothor 24
encoding
ISO-8859-1 142
UTF-8 140
Etymon PJ 264
Explanation 80
F
Field 2022
appending to 33
keyword, analysis 121
storing term vectors 185
file handle
issue 340
Filter 76
caching 177
ChainedFilter 304
custom 209
using HitCollector 203
within a Query 212
FilteredQuery 178, 212
filtering
search space 171178
token. See TokenFilter
foreign language analysis 140
Formatter 300
Fragmenter 300
FrenchAnalyzer 282
fuzzy string similarity 351
FuzzyEnum 350
FuzzyQuery 92
from QueryParser 93
issues 350
performance issue 213
prohibiting 204
G
GCJ 308
German analysis 141
Giustina, Fabrizio 242
Glimpse 26
GNOME 318
Google 6, 27
alternative word suggestions 128
analysis 103
API 352
definitions 292
expense 372
term highlighting 300
government intelligence, use of Lucene 352
H
Harvest 26
Harvest-NG 26
Harwood, Mark 300
highlighting, query terms 300303, 343
Hindi 354
HitCollector 76, 201203
customizing 350
priority-queue idea 360
used by Filters 203
Hits 24, 7071, 76
highlighting 303
ht://Dig 26
TheServerSide usage 371
HTML 8
cookie 77
highlighting 301
<meta> tag 140
parsing 107, 329, 352
HTMLParser 264
HTTP
crawler. See Nutch
session 77
HTTP request
content-type 140
I
I18N. See internationalization
index optimization 5659
disk space requirements 56
performance effect 56
when to do it 58
why do it 57
index structure
converting 400401
performance comparison 402
IndexFiles 389
IndexHTML 390
indexing
adding documents 3133
analysis during 105
Ant task 285
at TheServerSide 373
browsing tool 271
buffering 42
colors 365
compound format 341
compound index 399400
concurrency rules 5960
creation of 12
data structures 11
dates 3940, 216
debugging 66
directory structure 395
disabling locking 66
file format 404
file view with Luke 277
.fnm file 405
for sorting 41
format 393
framework 225226, 254263
HTML 241, 248
incremental 396
index files 397
jGuru design 332
limiting field length 5455
locking 6266
logical view 394
maxFieldLength 5455
maxMergeDocs 4247
mergeFactor 4247
merging indexes 52
Microsoft Word documents 248251
minMergeDocs 42, 47
multifile index structure 395
numbers 4041
open files 4748
parallelization 5254
PDF 235241
performance 4247
plain-text documents 253254
removing documents 3336
rich-text documents 224
RTF documents 252253
scheduling 367
segments 396397
status with LIMO 279
steps 2931
storing in Berkeley DB 307
term dictionary 406
term frequency 406
term positions 406
thread-safety 6062
tools 269
undeleting documents 36
updating documents 36
batching 37
using RAMDirectory 4852
XML 226235
IndexReader 199
deleting documents 375
retrieving term vectors 186
IndexSearcher 23, 70, 78
n-gram extension 358
paging through results 77
using 75
IndexWriter 19
addDocument 106
analyzer 123
information overload 6
Information Retrieval (IR) 7
libraries 2426
Installing Lucene 387392
intelligent agent 6
internationalization 141
inverse document frequency 79
inverted index 404
IR. See Information Retrieval (IR)
ISO-8859-1 142
J
Jakarta Commons Digester 230235
Jakarta POI 249250
Japanese analysis 142
Java Messaging Service 352
in XM-InformationMinder 347
Java, keyword 331
JavaCC 100
building Lucene 392
JavaScript
character escaping 292
query construction 291
query validation 291
JDOM 264
jGuru 341
JGuruMultiSearcher 339
Jones, Tim 150
JPedal 264
jSearch 7
JTidy 242245
indexing HTML with Ant 285
JUnitPerf 213
JWordNet 297
K
keyword analyzer 124
Konrad, Karsten 344
Korean analysis 142
L
language
handling 354
support 343
LARM 7, 372
Levenshtein distance algorithm 92
lexicon, definition 331
LIMO 279
LingPipe 353
linguistics 353
Litchfield, Ben 236
Lookout 6, 318
Lucene
building from source 391
community 10
demonstration applications 389391
developers 10
documentation 388
downloading 388
history of 9
index 11
integration of 8
ports 10
sample application 11
Sandbox 268
understanding 6
users of 10
what it is 7
Lucene ports 312324
summary 313
Lucene Wiki 7
Lucene.Net 6
lucli 269
Luke 271, 391
plug-ins 278
Lupy 308, 320322
M
Managing Gigabytes 26
Matalon, Dror 269
Metaphone 125
MG4J 26
Michaels.com 361371
Microsoft 6, 318
Microsoft Index Server 26
Microsoft Outlook 6, 318
Microsoft Windows 14
Microsoft Word 8
parsing 107
Miller, George 292
and WordNet 292
misspellings 354
matching 363
mock object 131, 211
Moffat, Alistair 26
morphological variation 355
Movable Type 320
MSN 6
MultiFieldQueryParser 160
multifile index, creating 398
multiple indexes 331
MultiSearcher 178185
alternative 339
multithreaded searching. See ParallelMultiSearcher
Multivalent 264
N
Namazu 26
native2ascii 142
natural language with XM-InformationMinder 345
NekoHTML 245248, 329, 352
.NET 10
n-gram TokenStream 357
NGramQuery 358
NGramSearcher 358
Nioche, Julien 279
noisy-channel model 355
normalization
field length 79
query 79
numeric
padding 206
range queries 205
Nutch 7, 9, 329
Explanation 81
O
OLE 2 Compound Document format 249
open files formula 401
OpenOffice SDK 264
optimize 340
orthographic variation 354
Overture 6
P
paging
at jGuru 336
TheServerSide search results 383
through Hits 77
ParallelMultiSearcher 180
Parr, Terence 329
ParseException 204, 379
parsing 73
query expressions. See QueryParser
QueryParser method 73
stripping plurals 334
versus analysis 107
partitioning indexes 180
PDF 8
See also indexing PDF
PDF Text Stream 264
PDFBox 236241
built-in Lucene support 239
PerFieldAnalyzerWrapper
for Keyword fields 123
performance
issues with WildcardQuery 91
iterating Hits warning 369
load testing 217
of sorting 157
SearchBlox case study 341
statistics 370
testing 213, 220
Perl 10
pharmaceutical, uses of Lucene 347
PhrasePrefixQuery 157159
handling synonyms alternative 134
PhraseQuery 87
compared to PhrasePrefixQuery 158
forcing term order 208
from QueryParser 90
in contrast to SpanNearQuery 166
multiple terms 89
position increment issue 138
scoring 90
slop factor 139
with synonyms 132
Piccolo 264
Plucene 318320
POI 264
Porter stemming algorithm 136
Porter, Dr. Martin 25, 136, 283
position, increment offset in SpanQuery 161
precision 11, 360
PrefixQuery 84
from QueryParser 85
optimized WildcardQuery 92
Properties file, encoding 142
PyLucene 308, 322323
Python 10
Q
Query 23, 70, 72
creating programatically 81
preprocessing at jGuru 335
starts with 84
statistics 337
toString 94