Contents
foreword
preface
acknowledgments
about this book
about the authors
about the cover illustration
Part 1 Getting started
- Chapter 1 The case for the digital Babel
- Understanding digital documents
- What is Apache Tika?
- Summary
- Chapter 2 Getting started with Tika
- Working with Tika source code
- The Tika application
- Tika as an embedded library
- Summary
- Chapter 3 The information landscape
- Measuring information overload
- I’m feeling lucky—searching the information landscape
- Beyond lucky: machine learning
- Summary
Part 2 Tika in detail
- Chapter 4 Document type detection
- Internet media types
- Media types in Tika
- File format diagnostics
- Tika, the type inspector
- Summary
- Chapter 5 Content extraction
- Full-text extraction
- The Parser interface
- Document input stream
- Structured XHTML output
- Context-sensitive parsing
- Summary
- Chapter 6 Understanding metadata
- The standards of metadata
- Metadata quality
- Metadata in Tika
- Practical uses of metadata
- Summary
- Chapter 7 Language detection
- The most translated document in the world
- Sounds Greek to me—theory of language detection
- Language detection in Tika
- Summary
- Chapter 8 What’s in a file?
- Types of content
- How Tika extracts content
- Summary
Part 3 Integration and advanced use
- Chapter 9 The big picture
- Tika in search
- Managing and mining information
- Buzzword compliance
- Summary
- Chapter 10 Tika and the Lucene search stack
- Load-bearing walls
- The steel frame
- The finishing touches
- Summary
- Chapter 11 Extending Tika
- Adding type information
- Custom type detection
- Customized parsing
- Summary
Part 4 Case studies
- Chapter 12 Powering NASA science data systems
- NASA’s Planetary Data System
- NASA’s Earth Science Enterprise
- Summary
- Chapter 13 Content management with Apache Jackrabbit
- Introducing Apache Jackrabbit
- The text extraction pool
- Content-aware WebDAV
- Summary
- Chapter 14 Curating cancer research data with Tika
- The NCI Early Detection Research Network
- Integrating Tika
- Summary
- Chapter 15 The classic search engine example
- The Public Terabyte Dataset Project
- The Bixo web crawler
- Summary
 
appendix A: Tika quick reference
appendix B: Supported metadata keys
index