table of content

Part 1: Search meets deep learning

1 Neural search

1.1 Neural networks and deep learning

1.2 What is machine learning?

1.3 What deep learning can do for search

1.4 A roadmap for learning deep learning

1.5 Retrieving useful information

1.5.1 Text, tokens, terms, and search fundamentals

1.5.2 Relevance first

1.5.3 Classic retrieval models

1.5.4 Precision and recall

1.6 Unsolved problems

1.7 Opening the search engine black box

1.8 Deep learning to the rescue

1.9 Index, please meet neuron

1.10 Neural network training

1.11 The promises of neural search

Summary

2 Generating synonyms

2.1 Introduction to synonym expansion

2.1.1 Why synonyms?

2.1.2 Vocabulary-based synonym matching

2.2 The importance of context

2.3 Feed-forward neural networks

2.4 Using word2vec

2.4.1 Setting up word2vec in Deeplearning4j

2.4.2 Word2vec-based synonym expansion

2.5 Evaluations and comparisons

2.6 Considerations for production systems

2.6.1 Synonyms vs. antonyms

Summary

Part 2: Throwing neural nets at a search engine

3 From plain retrieval to text generation

3.1 Information need vs. query: Bridging the gap

3.1.1 Generating alternative queries

3.1.2 Data preparation

3.1.3 Wrap-up of generating data

3.2 Learning over sequences

3.3 Recurrent neural networks

3.3.1 RNN internals and dynamics

3.3.2 Long-term dependencies

3.3.3 Long short-term memory networks

3.4 LSTM networks for unsupervised text generation

3.4.1 Unsupervised query expansion

3.5 From unsupervised to supervised text generation

3.5.1 Sequence-to-sequence modeling

3.6 Considerations for production systems

Summary

4 More-sensitive query suggestions

4.1 Generating query suggestions

4.1.1 Suggesting while composing queries

4.1.2 Dictionary-based suggesters

4.2 Lucene Lookup APIs

4.3 Analyzed suggesters

4.4 Using language models

4.5 Content-based suggesters

4.6 Neural language models

4.7 Character-based neural language model for suggestions

4.8 Tuning the LSTM language model

4.9 Diversifying suggestions using word embeddings

Summary

5 Ranking search results with word embeddings

5.1 The importance of ranking

5.2 Retrieval models

5.2.1 TF-IDF and the vector space model

5.2.2 Ranking documents in Lucene

5.2.3 Probabilistic models

5.3 Neural information retrieval

5.4 From word to document vectors

5.5 Evaluations and comparisons

5.5.1 Similarity based on averaged word embeddings

Summary

6 Document embeddings for rankings and recommendations

6.1 From word to document embeddings

6.2 Using paragraph vectors in ranking

6.2.1 Paragraph vector—based similarity

6.3 Document embeddings and related content

6.3.1 Search, recommendations, and related content

6.3.2 Using frequent terms to find similar content

6.3.3 Retrieving similar content with paragraph vectors

6.3.4 Retrieving similar content with vectors from encoder-decoder models

Summary

Part 3: One step beyond

Searching across languages

7.1 Serving users who speak multiple languages

7.1.1 Translating documents vs. queries

7.1.2 Cross-language search

7.1.3 Querying in multiple languages on top of Lucene

7.2 Statistical machine translation

7.2.1 Alignment

7.2.2 Phrase-based translation

7.3 Working with parallel corpora

7.4 Neural machine translation

7.4.1 Encoder-decoder models

7.4.2 Encoder-decoder for MT in DL4J

7.5 Word and document embeddings for multiple languages

7.5.1 Linear projected monolingual embeddings

Summary

8 Content-based image search

8.1 Image contents and search

8.2 A look back: Text-based image retrieval

8.3 Understanding images

8.3.1 Image representations

8.3.2 Feature extraction

8.4 Deep learning for image representation

8.4.1 Convolutional neural networks

8.4.2 Image search

8.4.3 Locality-sensitive hashing

8.5 Working with unlabeled images

Summary

9 A peek at performance

9.1 Performance and the promises of deep learning

9.1.1 From model design to production

9.2 Indexes and neurons working together

9.3 Working with streams of data

Summary

Overview

2 Generating synonyms

Synonyms are a practical way to boost recall in search by allowing different phrasings of the same intent to match the same documents. The chapter shows how synonym expansion can be applied at indexing or query time, and explains the trade-offs: expanding at index time increases index size but speeds up queries, while expanding at query time keeps the index lean but adds runtime cost. It first walks through a vocabulary-based approach using Lucene analyzers and token filters, then improves maintainability by loading synonyms from files and large resources like WordNet. However, static dictionaries are limited: they may not fit domain data, lag behind slang, acronyms, or evolving usage, and ignore context.

To overcome these limits, the chapter introduces learning synonyms from data via word2vec, grounded in the distributional hypothesis that words in similar contexts tend to share meaning. After a primer on feed-forward neural networks and backpropagation, it details the two word2vec architectures—CBOW and Skip-gram—and how they produce word embeddings whose proximity (e.g., cosine similarity) reveals semantic relatedness. The chapter demonstrates that higher-dimensional embeddings trained on larger, domain-relevant corpora yield much better neighbors, and shows a practical setup with Deeplearning4j to train vectors from song lyrics and query nearest words to act as synonyms.

Finally, it integrates learned synonyms into a Lucene pipeline by building a token filter that augments tokens with top, high-confidence neighbors from word2vec, while controlling index growth and noise via thresholds, part-of-speech constraints, document/term informativeness, and limits on the number of expansions. It underscores evaluation through precision, recall, zero-result rates, and model selection via cross-validation and parameter tuning (dimensions, window size, model choice). For production, it recommends retraining embeddings directly from the indexed corpus—using a custom SentenceIterator over stored field values—so synonym generation stays current as data evolves, and revisits the operational choice between index-time and query-time expansion based on performance needs.

Figure 2.1. Synonym expansion at search time, with a neural network

Figure 2.2. Synonym expansion graph

Figure 2.3. Split portions of the text depending on the type of data

Figure 2.4. Predicting price with a feed forward neural network with 3 inputs, 5 hidden units and 1 output unit

Figure 2.5. Propagating signal through the network

Figure 2.6. Backpropagating signal from output to hidden layer

Figure 2.7. Geometric interpretation of backpropagation with gradient descent

Figure 2.8. Plotted word vectors for Aeoroplane

Figure 2.9. Feeding word2vec (skip gram model) with text fragments

Figure 2.10. Continuous Bag of Words model

Figure 2.11. Continuous Skip-gram Model

Figure 2.12. Highlights of Word2vec vectors over Hot 100 Billboard dataset

Figure 2.13. Token stream after word2vec synonym expansion

Summary

Synonym expansion can be an handy technique to improve recall and make the users of our search engine happier
Common synonym expansion techniques are based on static dictionaries / vocabularies which might require manual maintenance or, at least, are often very far from the data they are used for
Word2vec is a neural network based algorithm for learning vector representations for words which can be used to find words with similar meanings—or at least that appear in similar contexts so that it sounds reasonable to exploit it for synonym expansion too
Word2vec can provide good results, but we need to manage word senses or part of speech when using it for synonyms

FAQ

What is synonym expansion and why does it help search?

Synonym expansion adds alternative terms with the same or very similar meaning to the query or to the indexed text at the same position. This increases the chances that differently worded queries match the same documents, boosting recall and reducing zero-result queries. Example: a query with “plane” can match a lyric indexed with “aeroplane.”

Should I expand synonyms at index time or at search time?

Both are valid with trade-offs. Index-time expansion makes indexing slower and the index larger, but queries run faster (no expansion at search). Search-time expansion keeps the index smaller and ingestion faster, but each query pays the expansion cost. Choose based on workload, latency, and throughput constraints.

How do vocabulary-based synonym lists work, and what are their limitations?

They map terms to synonyms via a maintained dictionary (e.g., a file or WordNet). Pros: predictable, easy to plug into Lucene’s SynonymGraphFilter. Cons: upkeep burden, language coverage gaps, and no awareness of corpus-specific usage or context (slang, acronyms, domain terms).

How does word2vec generate synonyms from my data?

Word2vec learns vector representations (embeddings) so that words used in similar contexts are close in vector space (distributional hypothesis). By finding nearest neighbors (e.g., cosine similarity), you can surface corpus-specific synonym-like terms. It’s language-agnostic and context-driven, unlike grammar-based lexicons.

CBOW vs Skip-gram: which word2vec model should I use?

Both learn embeddings from windows of text. CBOW predicts a target word from its context; Skip-gram predicts context words from a target. In practice, Skip-gram often performs better for infrequent words and is a common default in libraries like Deeplearning4J.

What word2vec parameters matter most for synonym quality?

Key settings include: embedding size (e.g., 100+ dimensions, higher for larger corpora), window size (context width, e.g., 5), training epochs, and model choice (CBOW vs Skip-gram). Too few dimensions or too little data yields poor neighbors; larger corpora and sensible windows improve results.

How do I integrate word2vec synonyms into a Lucene pipeline?

Train word2vec (e.g., with DL4J), then implement a TokenFilter that, for each token, looks up the top-k nearest words above a similarity threshold and emits them at the same position. Use SynonymGraphFilter-compatible behavior so phrase queries and positions still align.

How can I prevent index bloat or noisy expansions with word2vec?

Use safeguards: limit top-k synonyms, enforce a minimum similarity score, restrict to certain parts of speech (e.g., nouns and verbs), prioritize short or low-recall documents, and skip low-weight or very common terms. These controls keep expansions precise and manageable.

How do I evaluate if synonym expansion improved my search?

Compare metrics before/after enabling it: recall, precision, and zero-result rates. Use held-out sets or A/B tests. For model tuning, apply cross-validation: train on a subset, select parameters on a validation set, and report final effectiveness on a separate test set.

How do I keep synonyms up to date in production as the corpus changes?

Retrain embeddings periodically from the indexed data itself (e.g., iterate stored field values via a custom SentenceIterator over the Lucene index). This avoids needing original source files, adapts to evolving language/terms, and lets you refresh the synonym filter on a schedule.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $17.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more