Tika in Action
Chris A. Mattmann and Jukka L. Zitting
  • December 2011
  • ISBN 9781935182856
  • 256 pages
  • printed in black & white

By Tika's two main creators and maintainers.

Jérôme Charron, WebPulse

Tika in Action is a hands-on guide to content mining with Apache Tika. The book's many examples and case studies offer real-world experience from domains ranging from search engines to digital asset management and scientific data processing.

Table of Contents show full




about this book

about the authors

about the cover illustration

Part 1 Getting started

1. Chapter 1 The case for the digital Babel

1.1. Understanding digital documents

1.2. What is Apache Tika?

1.3. Summary

2. Chapter 2 Getting started with Tika

2.1. Working with Tika source code

2.2. The Tika application

2.3. Tika as an embedded library

2.4. Summary

3. Chapter 3 The information landscape

3.1. Measuring information overload

3.2. I’m feeling lucky—searching the information landscape

3.3. Beyond lucky: machine learning

3.4. Summary

Part 2 Tika in detail

4. Chapter 4 Document type detection

4.1. Internet media types

4.2. Media types in Tika

4.3. File format diagnostics

4.4. Tika, the type inspector

4.5. Summary

5. Chapter 5 Content extraction

5.1. Full-text extraction

5.2. The Parser interface

5.3. Document input stream

5.4. Structured XHTML output

5.5. Context-sensitive parsing

5.6. Summary

6. Chapter 6 Understanding metadata

6.1. The standards of metadata

6.2. Metadata quality

6.3. Metadata in Tika

6.4. Practical uses of metadata

6.5. Summary

7. Chapter 7 Language detection

7.1. The most translated document in the world

7.2. Sounds Greek to me—theory of language detection

7.3. Language detection in Tika

7.4. Summary

8. Chapter 8 What’s in a file?

8.1. Types of content

8.2. How Tika extracts content

8.3. Summary

Part 3 Integration and advanced use

9. Chapter 9 The big picture

9.2. Managing and mining information

9.3. Buzzword compliance

9.4. Summary

10. Chapter 10 Tika and the Lucene search stack

10.1. Load-bearing walls

10.2. The steel frame

10.3. The finishing touches

10.4. Summary

11. Chapter 11 Extending Tika

11.1. Adding type information

11.2. Custom type detection

11.3. Customized parsing

11.4. Summary

Part 4 Case studies

12. Chapter 12 Powering NASA science data systems

12.1. NASA’s Planetary Data System

12.2. NASA’s Earth Science Enterprise

12.3. Summary

13. Chapter 13 Content management with Apache Jackrabbit

13.1. Introducing Apache Jackrabbit

13.2. The text extraction pool

13.3. Content-aware WebDAV

13.4. Summary

14. Chapter 14 Curating cancer research data with Tika

14.1. The NCI Early Detection Research Network

14.2. Integrating Tika

14.3. Summary

15. Chapter 15 The classic search engine example

15.1. The Public Terabyte Dataset Project

15.2. The Bixo web crawler

15.3. Summary

Appendix A: : Tika quick reference

Appendix B: : Supported metadata keys


© 2014 Manning Publications Co.

About the Technology

Tika is an Apache toolkit that has built into it everything you and your app need to know about file formats. Using Tika, your applications can discover and extract content from digital documents in almost any format, including exotic ones.

About the book

Tika in Action is the ultimate guide to content mining using Apache Tika. You'll learn how to pull usable information from otherwise inaccessible sources, including internet media and file archives. This example-rich book teaches you to build and extend applications based on real-world experience with search engines, digital asset management, and scientific data processing. In addition to architectural overviews, you'll find detailed chapters on features like metadata extraction, automatic language detection, and custom parser development.

What's inside

  • Crack MS Word, PDF, HTML, and ZIP
  • Integrate with search engines, CMS, and other data sources
  • Learn through experimentation
  • Many examples

About the reader

This book requires no previous knowledge of Tika or text mining techniques. It assumes a working knowledge of Java.

About the authors

Chris Mattmann is an information architect experienced in the construction of large data-intensive systems. Jukka Zitting is a core Tika developer, a member of the JCR expert group, and chairman of the Apache Jackrabbit project.

combo $44.99 pBook + eBook
eBook $35.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks

Easily the most definitive guide to this great new text analysis toolkit.

John Guthrie, SAP

An easy-to-read guide--plenty of technical content.

Rick Wagner, Red Hat

There's not a single page of 'inaction' in the entire book!

Sean Kelly, Technologist, NASA

Complete, practical, accurate

Julien Nioche, DigitalPebble Ltd