Contents


foreword
preface
acknowledgments
about this book
about the authors
about the cover illustration

Part 1 Getting started

Chapter 1 The case for the digital Babel
Understanding digital documents
What is Apache Tika?
Summary
Chapter 2 Getting started with Tika
Working with Tika source code
The Tika application
Tika as an embedded library
Summary
Chapter 3 The information landscape
Measuring information overload
I’m feeling lucky—searching the information landscape
Beyond lucky: machine learning
Summary

Part 2 Tika in detail

Chapter 4 Document type detection
Internet media types
Media types in Tika
File format diagnostics
Tika, the type inspector
Summary
Chapter 5 Content extraction
Full-text extraction
The Parser interface
Document input stream
Structured XHTML output
Context-sensitive parsing
Summary
Chapter 6 Understanding metadata
The standards of metadata
Metadata quality
Metadata in Tika
Practical uses of metadata
Summary
Chapter 7 Language detection
The most translated document in the world
Sounds Greek to me—theory of language detection
Language detection in Tika
Summary
Chapter 8 What’s in a file?
Types of content
How Tika extracts content
Summary

Part 3 Integration and advanced use

Chapter 9 The big picture
Tika in search
Managing and mining information
Buzzword compliance
Summary
Chapter 10 Tika and the Lucene search stack
Load-bearing walls
The steel frame
The finishing touches
Summary
Chapter 11 Extending Tika
Adding type information
Custom type detection
Customized parsing
Summary

Part 4 Case studies

Chapter 12 Powering NASA science data systems
NASA’s Planetary Data System
NASA’s Earth Science Enterprise
Summary
Chapter 13 Content management with Apache Jackrabbit
Introducing Apache Jackrabbit
The text extraction pool
Content-aware WebDAV
Summary
Chapter 14 Curating cancer research data with Tika
The NCI Early Detection Research Network
Integrating Tika
Summary
Chapter 15 The classic search engine example
The Public Terabyte Dataset Project
The Bixo web crawler
Summary

 
appendix A: Tika quick reference
appendix B: Supported metadata keys
index