Relevant Search
With applications for Solr and Elasticsearch
Doug Turnbull and John Berryman
Foreword by Trey Grainger
  • June 2016
  • ISBN 9781617292774
  • 360 pages
  • printed in black & white

One of the best and most engaging technical books I’ve ever read.

From the Foreword by Trey Grainger, Author of "Solr in Action"

Relevant Search demystifies relevance work. Using Elasticsearch, it teaches you how to return engaging search results to your users, helping you understand and leverage the internals of Lucene-based search engines.

Table of Contents detailed table of contents

1. The search relevance problem

1.1. Your Goal: Gaining The Skills of A Relevance Engineer

1.2. Why is Search Relevance So Hard?

1.2.1. What's a "relevant" search result?

1.2.2. Search: There's No Silver Bullet!

1.3. Gaining insight from relevance research?

1.3.1. Information retrieval

1.3.2. Can we use Information Retrieval to solve relevance?

1.4. How do you solve relevance?

1.5. More than technology: curation, collaboration, & feedback

1.6. Summary

2. Search — Under The Hood

2.1. Search 101

2.1.1. What's a Search Document?

2.1.2. Searching the content

2.1.4. Getting content into the search engine

2.2. Search Engine Data Structures

2.2.1. The Inverted Index

2.2.2. Other Pieces of the Inverted Index

2.3. Indexing Content: Extraction, Enrichment, Analysis, and Indexing

2.3.1. Extracting Content Into Documents

2.3.2. Enriching Documents to Clean, Augment, and Merge Data

2.3.3. Performing Analysis

2.3.4. Indexing

2.4. Document Search and Retrieval

2.4.1. Boolean Matching: AND/OR/NOT

2.4.2. Boolean Queries in Lucene-Based Search (MUST/MUST_NOT/SHOULD)

2.4.3. Positional and Phrase Matching

2.4.4. Enabling Exploration: Filtering, Facets, and Aggregations

2.4.5. Sorting, Ranked Results, and Relevance

2.5. Summary

3. Debugging your first relevance problem

3.1. Applications to Solr & Elasticsearch: Examples in Elasticsearch

3.2. Our Most Prominent Data Set: TMDB

3.3. Examples Programmed in Python

3.4. Our First Search Application

3.4.1. Our first searches of the TMDB Elasticsearch Index

3.5. Debugging Query Matching

3.5.1. Examining The Underlying Query Strategy

3.5.2. Taking Apart Query Parsing

3.5.3. Debugging Analysis To Solve Matching Issues

3.5.4. Our Query Vs The Inverted Index

3.5.5. Fixing Our Matching By Changing Analyzers

3.6. Debugging Ranking

3.6.1. Decomposing Relevance Score With Lucene's Explain

3.6.2. The Vector-Space Model, The Relevance Explain, and You!

3.6.3. Practical Caveats to the vector space model

3.6.4. Scoring matches to measure relevance

3.6.5. Computing Weights with TF*IDF

3.6.6. Lies, Damned Lies, and Similarity

3.6.7. Factoring in the Search Term’s Importance

3.6.8. Fixing Space Jam vs Alien Ranking

3.7. Solved? Our Work is Never Over!

3.8. Summary

4. Taming Tokens

4.1. Tokens as Document Features

4.1.1. The Matching Process

4.1.2. Tokens, More Than Just Words

4.2. Controlling Precision and Recall

4.2.1. Precision and Recall by Example

4.2.2. Analysis for Precision or Recall

4.2.3. Taking Recall to Extremes

4.3. Precision AND Recall — Have Your Cake and Eat it Too

4.3.1. Scoring strength of a feature in a single field

4.3.2. Scoring beyond TF*IDF: multiple search terms and multiple fields

4.4. Analysis Strategies

4.4.1. Dealing with Delimiters

4.4.2. Capturing Meaning with Synonyms

4.4.4. Modeling Specificity with Synonyms

4.4.5. Modeling Specificity with Paths

4.4.6. Tokenize the World!

4.4.7. Tokenizing Integers

4.4.8. Tokenizing Geographic Data

4.4.9. Tokenizing Melodies

4.5. Summary

5. Basic Multifield Search

5.1. Signals and Signal Modeling

5.1.1. What is a Signal?

5.1.2. Starting With The Source Data Model

5.1.3. Implementing a Signal

5.1.4. Programming Relevance via Data Modeling

5.2. TMDB — Search, The Final Frontier!

5.2.1. Violating The Prime Directive

5.2.2. Flattening Nested Docs

5.3.1. Starting Out With Best Fields

5.3.2. Controlling Field Preference In Search Results

5.3.3. Better Best Fields With More Precise Signals?

5.3.4. Letting Losers Share The Glory: Calibrating Best Fields

5.3.5. Counting Multiple Signals using Most Fields

5.3.6. Boosting in Most-Fields

5.3.7. When Additional Matches Don't Matter

5.3.8. Does Most Fields Count The Right Signals?

5.4. Summary

6. Term-Centric Search

6.2.1. Hunting for Albino Elephants

6.2.2. Albino Elephant in Star Trek Example

6.2.3. Signal Discordance

6.2.4. The Mechanics of Signal Discordance

6.3. Your First Term-Centric Searches

6.3.1. The Term-Centric Ranking Function

6.3.2. Running a Term-Centric Query Parser (Into The Ground)

6.3.3. Understanding Field Synchronicity

6.3.4. Field Synchronicity and Signal Modeling

6.3.5. Query Parsers and Signal Discordance

6.4.1. Combining Fields into Custom All Fields

6.4.2. Solving Signal Discordance With Cross Fields

6.5. Combining Field Centric and Term-Centric Strategies: Having Your Cake and Eating It Too

6.5.1. Grouping "Like Fields" Together

6.5.2. Limits of Like Fields

6.5.3. Combining Greedy Naïve Search and Conservative Amplifiers

6.5.4. Term—Centric vs Field—Centric and Precision vs Recall

6.5.5. Considering Filtering, Boosting, and Reranking

6.6. Summary

7. Shaping the Relevance Function

7.1. What Do We Mean By Score Shaping?

7.2. Boosting: Shaping by Promoting Results

7.2.1. Boosting: The Final Frontier

7.2.2. When Boosting — Add or Multiply? Boolean or Function Query?

7.2.3. You Chose Door A: Additive Boosting with Boolean Queries

7.2.4. You Chose Door B: Introducing Function Queries: Ranking with Math

7.2.5. Hands on with Function Queries: Simple Multiplicative Boosting

7.2.6. Boosting basics: Signals, Signals Everywhere

7.3. Filtering: Shaping by Excluding Results

7.4. Score Shaping Strategies For Satisfying Business Needs

7.4.1. Search ALL THE MOVIES!

7.4.2. Modeling Your Boosting Signals

7.4.3. Building the Ranking Function: Adding High Value Tiers

7.4.4. High Value Tier Scored with A Function Query

7.4.5. Ignoring TF x IDF

7.4.6. Capturing General-Quality Metrics

7.4.7. Achieving Users' Recency Goals

7.4.8. Combining The Function Queries

7.4.9. Putting It All Together!

7.5. Summary

8. Providing relevance feedback

8.1.1. Immediate Results with Search—as—You—Type

8.1.2. Help Users Find the Best Query with Search Completion

8.1.3. Correcting Typos and Misspellings with Search Suggest

8.2. Relevance Feedback while Browsing

8.2.1. Building Faceted Browsing

8.2.2. Breadcrumb Navigation

8.2.3. Selecting Alternative Result Ordering

8.3. Relevance Feedback in the Search Results Listing

8.3.1. What Information Should be Presented in Listing Items?

8.3.2. Relevance Feedback through Snippets and Highlighting

8.3.3. Grouping together similar documents

8.3.4. Helping the User When There are no Results

8.4. Summary

9. Designing a Relevance-Focused Search Application

9.1. Yowl! The Awesome New Startup!

9.2. Gather Information and Requirements

9.2.1. Understand Users and Their Information Needs

9.2.2. Understand Business Needs

9.2.3. Identifying Required and Available Information

9.3. Design the Search Application

9.3.1. Visualize the User's Experience

9.3.2. Define and Model Signals

9.3.3. Combine and Balance Signals

9.4. Deploy, Monitor, Improve

9.4.1. Monitor

9.4.2. Identify Problems and Fix them!

9.5. Knowing When Good is Good Enough

9.6. Summary

10. The Relevance Centered Enterprise

10.1. Feedback: the bedrock of the relevance centered enterprise

10.2. Why user—focused culture before data—driven culture?

10.3. Flying relevance blind

10.4. Relevance Feedback Awakenings: Domain Experts and Expert Users

10.5. Relevance Feedback Maturing: Content Curation

10.5.1. The Role Of The Content Curator

10.5.2. The Risk Of Miscommunication With The Content Curator

10.6. Relevance Streamlined: Engineer/Curator Pairing

10.7. Relevance Accelerated: Test—Driven Relevance

10.7.1. Understanding Test-Driven Relevance

10.7.2. Using Test—Driven Relevance with User Behavioral Data

10.8. Beyond Test—Driven Relevance: Learning To Rank

10.9. Summary

11. Semantic And Personalized Search

11.1. Personalizing search based upon user profiles

11.1.1. Gathering user profile information

11.1.2. Tying profile information back to the search index

11.2. Personalizing search based upon user behavior

11.2.1. Introducing Collaborative Filtering

11.2.2. Basic collaborative filtering using co—occurrence counting

11.2.3. Tying user behavior information back to the search index

11.3.1. Building concept signals

11.3.2. Augmenting content with synonyms

11.4. Building concept search using machine learning

11.5. The personalized search — conceptual search connection

11.6.1. Replacing Search with Recommendation

11.7. Best wishes on your search relevance journey

11.8. Summary

Appendixes

Appendix A: Indexing directly from TMDB

A.1. Set TMDB Key & Load IPython Notebook

A.2. Setting up for the TMDB API

A.3. Crawling the TMDB API

A.4. Indexing TMDB Movies to Elasticsearch

Appendix B: Solr reader's Companion

B.1. Chapter 4: Taming Solr's Terms

B.1.1. Summary of Solr Analysis and Mappings Features

B.1.2. Building Custom Analyzers in Solr

B.1.3. Field Mappings in Solr

B.2. Chapters 5 and 6: Multifield Search in Solr

B.2.1. Summary of Query Feature Mappings

B.2.2. Understanding Query Differences Between Solr and Elasticsearch

B.2.3. Querying Solr: The Ergonomics

B.2.4. Term Centric and Field Centric Search with The edismax Query Parser

B.3. Chapter 7: Shaping Solr’s Relevance Function

B.3.1. Summary of Boosting Feature Mappings

B.3.2. Solr's Boolean Boosting

B.3.3. Solr's Function Queries

B.3.4. Multiplicative Boosting in Solr

B.4. Chapter 8: Relevance Feedback

B.4.1. Summary of Relevance Feedback Feature Mappings

B.4.2. Solr Autocomplete: Match Phrase Prefix

B.4.3. Faceted Browsing in Solr (aka "Solr Facets" not "Elasticsearch Aggregrations")

B.4.4. Field Collapsing

B.4.5. Suggest and Highlight Components

About the Technology

Users are accustomed to and expect instant, relevant search results. To achieve this, you must master the search engine. Yet for many developers, relevance ranking is mysterious or confusing.

About the book

Relevant Search demystifies the subject and shows you that a search engine is a programmable relevance framework. You'll learn how to apply Elasticsearch or Solr to your business's unique ranking problems. The book demonstrates how to program relevance and how to incorporate secondary data sources, taxonomies, text analytics, and personalization. In practice, a relevance framework requires softer skills as well, such as collaborating with stakeholders to discover the right relevance requirements for your business. By the end, you’ll be able to achieve a virtuous cycle of provable, measurable relevance improvements over a search product’s lifetime.

What's inside

  • Techniques for debugging relevance

  • Applying search engine features to real problems

  • Using the user interface to guide searchers

  • A systematic approach to relevance

  • A business culture focused on improving search

About the reader

For developers trying to build smarter search with Elasticsearch or Solr.

About the authors

Doug Turnbull is lead relevance consultant at OpenSource Connections, where he frequently speaks and blogs. John Berryman is a data engineer at Eventbrite, where he specializes in recommendations and search.


Buy
  • combo $44.99 pBook + eBook
  • eBook $35.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks

Will help you solve real-world search relevance problems for Lucene-based search engines.

Dimitrios Kouzis-Loukas, Bloomberg L.P.

An inspiring book revealing the essence and mechanics of relevant search.

Ursin Stauss, Swiss Post

Arms you with invaluable knowledge to temper the relevancy of search results and harness the powerful features provided by modern search engines.

Russ Cam, Elastic