By Tika's two main creators and maintainers.
Tika in Action is a hands-on guide to content mining with Apache Tika. The book's many examples and case studies offer real-world experience from domains ranging from search engines to digital asset management and scientific data processing.
about this book
about the authors
about the cover illustration
Part 1 Getting started
1. Chapter 1 The case for the digital Babel
1.1. Understanding digital documents
1.2. What is Apache Tika?
2. Chapter 2 Getting started with Tika
2.1. Working with Tika source code
2.2. The Tika application
2.3. Tika as an embedded library
3. Chapter 3 The information landscape
3.1. Measuring information overload
3.2. I’m feeling lucky—searching the information landscape
3.3. Beyond lucky: machine learning
Part 2 Tika in detail
4. Chapter 4 Document type detection
4.1. Internet media types
4.2. Media types in Tika
4.3. File format diagnostics
4.4. Tika, the type inspector
5. Chapter 5 Content extraction
5.1. Full-text extraction
5.2. The Parser interface
5.3. Document input stream
5.4. Structured XHTML output
5.5. Context-sensitive parsing
6. Chapter 6 Understanding metadata
6.1. The standards of metadata
6.2. Metadata quality
6.3. Metadata in Tika
6.4. Practical uses of metadata
7. Chapter 7 Language detection
7.1. The most translated document in the world
7.2. Sounds Greek to me—theory of language detection
7.3. Language detection in Tika
8. Chapter 8 What’s in a file?
8.1. Types of content
8.2. How Tika extracts content
Part 3 Integration and advanced use
9. Chapter 9 The big picture
9.1. Tika in search
9.2. Managing and mining information
9.3. Buzzword compliance
10. Chapter 10 Tika and the Lucene search stack
10.1. Load-bearing walls
10.2. The steel frame
10.3. The finishing touches
11. Chapter 11 Extending Tika
11.1. Adding type information
11.2. Custom type detection
11.3. Customized parsing
Part 4 Case studies
12. Chapter 12 Powering NASA science data systems
12.1. NASA’s Planetary Data System
12.2. NASA’s Earth Science Enterprise
13. Chapter 13 Content management with Apache Jackrabbit
13.1. Introducing Apache Jackrabbit
13.2. The text extraction pool
13.3. Content-aware WebDAV
14. Chapter 14 Curating cancer research data with Tika
14.1. The NCI Early Detection Research Network
14.2. Integrating Tika
15. Chapter 15 The classic search engine example
15.1. The Public Terabyte Dataset Project
15.2. The Bixo web crawler
Appendix A: : Tika quick reference
Appendix B: : Supported metadata keys
© 2014 Manning Publications Co.
About the Technology
Tika is an Apache toolkit that has built into it everything you and your app need to know about file formats. Using Tika, your applications can discover and extract content from digital documents in almost any format, including exotic ones.
About the book
Tika in Action is the ultimate guide to content mining using Apache Tika. You'll learn how to pull usable information from otherwise inaccessible sources, including internet media and file archives. This example-rich book teaches you to build and extend applications based on real-world experience with search engines, digital asset management, and scientific data processing. In addition to architectural overviews, you'll find detailed chapters on features like metadata extraction, automatic language detection, and custom parser development.
- Crack MS Word, PDF, HTML, and ZIP
- Integrate with search engines, CMS, and other data sources
- Learn through experimentation
- Many examples
About the reader
This book requires no previous knowledge of Tika or text mining techniques. It assumes a working knowledge of Java.
Easily the most definitive guide to this great new text analysis toolkit.
An easy-to-read guide--plenty of technical content.
There's not a single page of 'inaction' in the entire book!
Complete, practical, accurate