table of content

Part 1 Foundations

1 LLMs and the need for RAG

1.1 Curse of the LLMs and the idea of RAG

1.1.1 LLMs are not trained for facts

1.1.2 What is RAG?

1.2 The novelty of RAG

1.2.1 The RAG discovery

1.2.2 How does RAG help?

1.3 Popular RAG use cases

1.3.1 Search Engine Experience

1.3.2 Personalized marketing content generation

1.3.3 Real-time event commentary

1.3.4 Conversational agents

1.3.5 Document question answering systems

1.3.6 Virtual assistants

1.3.7 AI-powered research

1.3.8 Social media monitoring and sentiment analysis

1.3.9 News generation and content curation

Summary

2 RAG systems and their design

2.1 What does a RAG system look like?

2.2 Design of RAG systems

2.3 Indexing pipeline

2.4 Generation pipeline

2.5 Evaluation and monitoring

2.6 The RAGOps Stack

2.7 Caching, guardrails, security, and other layers

Summary

Part 2 Creating RAG systems

3 Indexing pipeline: Creating a knowledge base for RAG

3.1 Data loading

3.2 Data splitting (chunking)

3.2.1 Advantages of chunking

3.2.2 Chunking process

3.2.3 Chunking methods

3.2.4 Choosing a chunking strategy

3.3 Data conversion (embeddings)

3.3.1 What are embeddings?

3.3.2 Common pre-trained embeddings models

3.3.3 Embeddings use cases

3.3.4 How to choose embeddings?

3.4 Storage (vector databases)

3.4.1 What are vector databases?

3.4.2 Types of vector databases

3.4.3 Choosing a vector database

3.4.1 Data loading

3.4.2 Data conversion

4 Generation pipeline: Generating contextual LLM responses

4.1 Generation pipeline overview

4.2 Retrieval

4.2.1 Progression of retrieval methods

4.2.2 Popular retrievers

4.2.3 A simple retriever implementation

4.3 Augmentation

4.3.1 RAG prompt engineering techniques

4.3.2 A simple augmentation prompt creation

4.4 Generation

4.4.1 Categorization of LLMs and suitability for RAG

4.4.2 Completing the RAG pipeline: Generation using LLMs

4.4.1 Retrieval

4.4.2 Augmentation

4.4.3 Generation

5 RAG evaluation: Accuracy, relevance, and faithfulness

5.1 Key aspects of RAG evaluation

5.1.1 Quality scores

5.1.2 Required abilities

5.2 Evaluation metrics

5.2.1 Retrieval metrics

5.2.2 RAG-specific metrics

5.3 Frameworks

5.3.1 RAGAs

5.3.2 Automated RAG evaluation system

5.4 Benchmarks

5.4.1 RGB

5.5 Limitations and best practices

5.5.1 RAG evaluation fundamentals

5.5.2 Evaluation metrics

5.5.3 Evaluation frameworks

5.5.4 Benchmarks

5.5.5 Limitations and best practices

Part 3 RAG in production

6 Progression of RAG systems: Naïve, advanced, and modular RAG

6.1 Limitations of naïve RAG

6.2 Advanced RAG techniques

6.3 Pre-retrieval techniques

6.3.1 Index optimization

6.3.2 Query optimization

6.4 Retrieval strategies

6.4.1 Hybrid retrieval

6.4.2 Iterative retrieval

6.4.3 Recursive retrieval

6.4.4 Adaptive retrieval

6.5 Post-retrieval techniques

6.5.1 Compression

6.6 Modular RAG

6.6.1 Core modules

6.6.2 New modules

6.6.1 Limitations of naïve RAG

6.6.2 Advanced RAG techniques

6.6.3 Modular RAG framework

6.6.4 Tradeoffs and best practices

7 Evolving RAGOps stack

7.1 The evolving RAGOps stack

7.1.1 Critical layers

7.1.2 Essential layers

7.1.3 Enhancement layers

7.2 Production best practices

7.2.1 Critical layers

7.2.2 Essential layers

7.2.3 Enhancement layers

7.2.4 Production best practices

Part 4 Additional considerations

8 Graph, multimodal, agentic, and other RAG variants

8.1 What are RAG variants, and why do we need them?

8.2 Multimodal RAG

8.2.1 Data modality

8.2.2 Multimodal RAG use cases

8.2.3 Multimodal RAG pipelines

8.2.4 Challenges and best practices

8.3 Knowledge graph RAG

8.3.1 Knowledge graphs

8.3.2 Knowledge graph RAG use cases

8.3.3 Graph RAG approaches

8.3.4 Graph RAG pipelines

8.3.5 Challenges and best practices

8.4 Agentic RAG

8.4.1 LLM agents

8.4.2 Agentic RAG capabilities

8.4.3 Agentic RAG pipelines

8.4.4 Challenges and pest practices

8.5 Other RAG variants

8.5.1 Corrective RAG

8.5.2 Speculative RAG

8.5.3 Self-reflective (self RAG)

8.5.4 RAPTOR

8.5.1 Introducing RAG variants

8.5.2 Multimodal rag

8.5.3 Knowledge graph RAG

8.5.4 Agentic RAG

8.5.5 Other RAG variants

9 RAG development framework and further exploration

9.1 RAG development framework

9.1.1 Initiation stage: Defining and scoping the RAG system

9.2 Design stage: Layering the RAGOps stack

9.2.1 Indexing pipeline design

9.2.2 Generation pipeline design

9.2.3 Other design considerations

9.2.4 Development stage: Building modular RAG pipelines

9.2.5 Evaluation stage: Validating and optimizing the RAG system

9.2.6 Deployment stage: Launching and scaling the RAG system

9.2.7 Maintenance stage: Ensuring reliability and adaptability

9.3 Ideas for further exploration

9.3.1 Fine-tuning within RAG

9.3.2 Long-context windows in LLMs

9.3.3 Managed solutions

9.3.4 Difficult queries

9.3.1 RAG development framework

9.3.2 RAG development framework stages

9.3.3 Best practices in RAG development

9.3.4 Ideas for further exploration

Overview

6 Progression of RAG Systems: Naïve to Advanced, and Modular RAG

This chapter motivates the progression from a simple, naïve Retrieval-Augmented Generation pipeline to advanced and modular RAG suitable for production. It diagnoses where naïve RAG breaks down—poor precision and recall in retrieval, redundant and disjointed augmentation constrained by context windows, and generation issues such as hallucination, bias, and over-reliance on retrieved snippets. Framing RAG as two pipelines (indexing and generation), the chapter sets out to improve relevance, faithfulness, and robustness by introducing targeted interventions before, during, and after retrieval.

Advanced RAG replaces “retrieve-then-read” with a “rewrite–retrieve–rerank–read” flow and layers techniques across stages. Pre-retrieval emphasizes index optimization (tuning chunk sizes, context-enriched chunking, fetching surrounding chunks, metadata filtering/enrichment, hierarchical and graph-based index structures, and domain-tuned embeddings) and query optimization (multi-/sub-/step-back expansions, rewriting and HyDE-style transformations, and routing via intent, metadata, or semantic similarity). Retrieval improves via hybrid combinations of sparse, dense, and graph search, as well as iterative, recursive, and adaptive strategies that refine or decide retrieval dynamically. Post-retrieval focuses on compression to reduce noise and fit context windows, and reranking to prioritize the most relevant evidence. Throughout, the chapter stresses trade-offs: these gains typically add compute, latency, and system complexity, so choices should be driven by use-case evaluation.

Modular RAG generalizes these ideas into a composable architecture where core components—Indexing, Retrieval, Generation, plus Pre- and Post-retrieval—are interchangeable, and new modules augment capability: Search (source-specific access), Fusion (multi-query expansion and result merging), Memory (judicious use of the model’s parametric knowledge), Routing (path selection across tools and data), and Task Adapters (lightly tailoring for downstream tasks). The chapter positions naïve RAG as a subset of advanced RAG, which itself is a subset of modular RAG, encouraging incremental adoption. While modularity enables rapid experimentation, scalability, and maintainability, it also demands clear interfaces, orchestration, compatibility testing, and careful performance–cost–latency balancing to deliver production-grade systems.

Naïve RAG is a sequential “Retrieve then Read” process.

Drawbacks of Naïve RAG at each stage of the process

Advanced RAG is a Rewrite-Retrieve-Rerank-Read process as compared to a Retrieve-Read Naïve RAG process

An illustration of an index optimized knowledge base

Hybrid retriever employs multiple querying techniques and combines the results

Iterative, Recursive and Adaptive retrieval incorporate repeated retrieval cycles. (Source – Adapted from Retrieval-Augmented Generation for Large Language Models: A Survey, Gao et al)

Illustrative example of advanced generation pipeline.

Naïve, Advanced and Modular approaches to RAG are progressive in nature. Naïve RAG is a sub-component of Advanced RAG which is a sub-component of Modular RAG

FAQ

Why is the Naïve “Retrieve-then-Read” RAG approach inadequate for production?

Retrieval: Low precision (irrelevant chunks) and low recall (misses relevant info).
Augmentation: Redundant or overlapping chunks, disjointed context from multiple sources, and limits due to LLM context windows.
Generation: Hallucinations, bias/toxicity risks, difficulty reconciling conflicting info, and over-reliance on retrieved context over model knowledge.

How does Advanced RAG improve on Naïve RAG?

Advanced RAG shifts from “Retrieve-then-Read” to “Rewrite → Retrieve → Rerank → Read.” It optimizes the query (rewrite/expand/transform), enhances retrieval (hybrid/iterative/recursive/adaptive), reranks results to prioritize relevance, and then generates, leading to higher faithfulness and relevance.

What pre-retrieval index optimization techniques strengthen retrieval?

Chunk Optimization: Tune chunk size to balance context vs. noise; consider fetching surrounding chunks to preserve flow in long-form content.
Context-Enriched Chunking: Attach a document or section summary to each chunk for richer context with modest overhead.
Operational Trade‑offs: Better accuracy vs. higher compute, storage, and latency; reassess as data and use cases change.

How do metadata enhancements and embedding fine-tuning boost searchability?

Metadata Filtering: Use attributes (timestamp, author, category) to pre-filter before similarity search, reducing noise and avoiding stale content.
Metadata Enrichment: Add summaries, tags, or synthetic queries (e.g., Reverse HyDE-style prompts) to improve matching.
Fine-tuning Embeddings: For domain-specific language, fine-tune embedding models to improve semantic similarity and retrieval quality.

Which index structures can improve retrieval, and when should I use them?

Parent–Child Structure: Retrieve precise child chunks and refer to parent for broader context; helpful when documents are hierarchical.
Knowledge Graph Index (GraphRAG): Store entities and relations to improve context, reasoning, and explainability; best for large, complex domains despite higher build/maintenance cost.

How can I optimize user queries before retrieval?

Query Expansion: Generate multiple variants (multi-query), decompose into sub-queries, or “step back” to a higher-level abstraction to boost recall.
Query Transformation: Rewrite vague inputs into retrieval-suitable queries; use HyDE to create a hypothetical answer and embed it for similarity search.
Caveats: Avoid over-expansion and drift; ensure intent is preserved.

What is query routing and when is it essential?

Routing selects the best retrieval workflow per query based on intent, domain, language, or complexity.

Intent Classification: Use a classifier or LLM prompts to pick a retrieval path.
Metadata Routing: Extract keywords/tags from the query to filter by chunk metadata.
Semantic Routing: Match the query to representative exemplars for each method and choose the closest.

Use routing when sources and query types vary widely (e.g., support bots handling technical vs. billing queries).

Which retrieval strategies should I consider beyond basic similarity search?

Hybrid Retrieval: Combine keyword (e.g., BM25), dense vectors, and graphs; union/intersection with weights for precision/recall control.
Iterative Retrieval: Alternate retrieval and generation to refine searches for multi-hop questions.
Recursive Retrieval: Transform queries from retrieved evidence to uncover scattered info (e.g., IRCoT).
Adaptive Retrieval: Let an LLM decide when/what to retrieve during generation (e.g., Self-RAG, FLARE); aligns with agentic AI.

All add compute and latency; tune for your cost/accuracy target.

What post-retrieval techniques help the LLM use context effectively?

Compression: Remove irrelevant tokens or compress to context embeddings (e.g., COCOM, xRAG) to reduce noise and fit context windows.
Reranking: Reorder retrieved chunks so the most relevant, task-aligned evidence is prioritized (e.g., LTR/BERT-based or API rerankers).

Balance compression against potential information loss; reranking adds overhead but boosts answer quality.

What is Modular RAG, and what modules does it include?

Modular RAG decomposes RAG into swappable components for flexibility, scalability, and rapid experimentation. Naïve RAG is a subset of Advanced RAG, which is a subset of Modular RAG.

Core Modules: Indexing (embeddings/vector stores/chunkers), Retrieval (dense/keyword/graph), Generation (LLM choice and augmentation), plus Pre- and Post-retrieval modules.
New Modules: Search (multi-source), Fusion (RAG-Fusion: multi-query + merge/rerank), Memory (use LLM parametric knowledge, reflection tokens), Routing (pick optimal path), Task Adapter (tailor to tasks like summarization/translation).

Trade-offs: Greater compatibility, interface, and orchestration complexity; potential added latency/cost; requires robust testing of modules individually and in combination.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$44.99 $33.74

you save $11.25 (25%)

include audio $19.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$44.99 $33.74

you save $11.25 (25%)

include audio $19.99 $14.99

eBook

pdf, ePub, online

$44.99 $33.74

you save $11.25 (25%)

include audio $19.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more