table of content

Part 1 Foundations

1 LLMs and the need for RAG

1.1 Curse of the LLMs and the idea of RAG

1.1.1 LLMs are not trained for facts

1.1.2 What is RAG?

1.2 The novelty of RAG

1.2.1 The RAG discovery

1.2.2 How does RAG help?

1.3 Popular RAG use cases

1.3.1 Search Engine Experience

1.3.2 Personalized marketing content generation

1.3.3 Real-time event commentary

1.3.4 Conversational agents

1.3.5 Document question answering systems

1.3.6 Virtual assistants

1.3.7 AI-powered research

1.3.8 Social media monitoring and sentiment analysis

1.3.9 News generation and content curation

Summary

2 RAG systems and their design

2.1 What does a RAG system look like?

2.2 Design of RAG systems

2.3 Indexing pipeline

2.4 Generation pipeline

2.5 Evaluation and monitoring

2.6 The RAGOps Stack

2.7 Caching, guardrails, security, and other layers

Summary

Part 2 Creating RAG systems

3 Indexing pipeline: Creating a knowledge base for RAG

3.1 Data loading

3.2 Data splitting (chunking)

3.2.1 Advantages of chunking

3.2.2 Chunking process

3.2.3 Chunking methods

3.2.4 Choosing a chunking strategy

3.3 Data conversion (embeddings)

3.3.1 What are embeddings?

3.3.2 Common pre-trained embeddings models

3.3.3 Embeddings use cases

3.3.4 How to choose embeddings?

3.4 Storage (vector databases)

3.4.1 What are vector databases?

3.4.2 Types of vector databases

3.4.3 Choosing a vector database

3.4.1 Data loading

3.4.2 Data conversion

4 Generation pipeline: Generating contextual LLM responses

4.1 Generation pipeline overview

4.2 Retrieval

4.2.1 Progression of retrieval methods

4.2.2 Popular retrievers

4.2.3 A simple retriever implementation

4.3 Augmentation

4.3.1 RAG prompt engineering techniques

4.3.2 A simple augmentation prompt creation

4.4 Generation

4.4.1 Categorization of LLMs and suitability for RAG

4.4.2 Completing the RAG pipeline: Generation using LLMs

4.4.1 Retrieval

4.4.2 Augmentation

4.4.3 Generation

5 RAG evaluation: Accuracy, relevance, and faithfulness

5.1 Key aspects of RAG evaluation

5.1.1 Quality scores

5.1.2 Required abilities

5.2 Evaluation metrics

5.2.1 Retrieval metrics

5.2.2 RAG-specific metrics

5.3 Frameworks

5.3.1 RAGAs

5.3.2 Automated RAG evaluation system

5.4 Benchmarks

5.4.1 RGB

5.5 Limitations and best practices

5.5.1 RAG evaluation fundamentals

5.5.2 Evaluation metrics

5.5.3 Evaluation frameworks

5.5.4 Benchmarks

5.5.5 Limitations and best practices

Part 3 RAG in production

6 Progression of RAG systems: Naïve, advanced, and modular RAG

6.1 Limitations of naïve RAG

6.2 Advanced RAG techniques

6.3 Pre-retrieval techniques

6.3.1 Index optimization

6.3.2 Query optimization

6.4 Retrieval strategies

6.4.1 Hybrid retrieval

6.4.2 Iterative retrieval

6.4.3 Recursive retrieval

6.4.4 Adaptive retrieval

6.5 Post-retrieval techniques

6.5.1 Compression

6.6 Modular RAG

6.6.1 Core modules

6.6.2 New modules

6.6.1 Limitations of naïve RAG

6.6.2 Advanced RAG techniques

6.6.3 Modular RAG framework

6.6.4 Tradeoffs and best practices

7 Evolving RAGOps stack

7.1 The evolving RAGOps stack

7.1.1 Critical layers

7.1.2 Essential layers

7.1.3 Enhancement layers

7.2 Production best practices

7.2.1 Critical layers

7.2.2 Essential layers

7.2.3 Enhancement layers

7.2.4 Production best practices

Part 4 Additional considerations

8 Graph, multimodal, agentic, and other RAG variants

8.1 What are RAG variants, and why do we need them?

8.2 Multimodal RAG

8.2.1 Data modality

8.2.2 Multimodal RAG use cases

8.2.3 Multimodal RAG pipelines

8.2.4 Challenges and best practices

8.3 Knowledge graph RAG

8.3.1 Knowledge graphs

8.3.2 Knowledge graph RAG use cases

8.3.3 Graph RAG approaches

8.3.4 Graph RAG pipelines

8.3.5 Challenges and best practices

8.4 Agentic RAG

8.4.1 LLM agents

8.4.2 Agentic RAG capabilities

8.4.3 Agentic RAG pipelines

8.4.4 Challenges and pest practices

8.5 Other RAG variants

8.5.1 Corrective RAG

8.5.2 Speculative RAG

8.5.3 Self-reflective (self RAG)

8.5.4 RAPTOR

8.5.1 Introducing RAG variants

8.5.2 Multimodal rag

8.5.3 Knowledge graph RAG

8.5.4 Agentic RAG

8.5.5 Other RAG variants

9 RAG development framework and further exploration

9.1 RAG development framework

9.1.1 Initiation stage: Defining and scoping the RAG system

9.2 Design stage: Layering the RAGOps stack

9.2.1 Indexing pipeline design

9.2.2 Generation pipeline design

9.2.3 Other design considerations

9.2.4 Development stage: Building modular RAG pipelines

9.2.5 Evaluation stage: Validating and optimizing the RAG system

9.2.6 Deployment stage: Launching and scaling the RAG system

9.2.7 Maintenance stage: Ensuring reliability and adaptability

9.3 Ideas for further exploration

9.3.1 Fine-tuning within RAG

9.3.2 Long-context windows in LLMs

9.3.3 Managed solutions

9.3.4 Difficult queries

9.3.1 RAG development framework

9.3.2 RAG development framework stages

9.3.3 Best practices in RAG development

9.3.4 Ideas for further exploration

Overview

8 Graph, Multimodal, Agentic and other RAG variants

This chapter situates RAG variants as pragmatic adaptations of the standard indexing–retrieval–augmentation–generation loop, created to meet real-world demands such as multimodal inputs, deeper relational reasoning, better accuracy, and lower latency/cost. As RAG moves from simple text search to production settings across domains like healthcare, finance, and software engineering, systems must handle images, audio, video, graphs, and external tools while remaining precise and efficient. The chapter lays out why these variants emerged, what problems they target, how their pipelines change, and the trade-offs they introduce, aiming to equip readers with both conceptual grounding and implementation-oriented guidance.

The three principal variants are Multimodal RAG, Knowledge Graph RAG, and Agentic RAG. Multimodal RAG extends beyond text by introducing modality-aware indexing (specialized loaders and chunkers) and embedding choices—shared multimodal spaces, paired modality models (e.g., image–text, audio–text), or text conversion plus summaries—paired with retrieval strategies that mirror the chosen embedding route and generation via multimodal LLMs; its benefits come with higher complexity, latency, and potential information loss. Knowledge Graph RAG injects structure and relationships through nodes, edges, and triples stored in graph databases, enabling multi-hop reasoning via approaches like hierarchical structure awareness, graph-enhanced vector search, and community detection with summaries; pipelines add entity–relation extraction and graph traversal (e.g., Cypher), but demand careful scoping, cost control, and maintenance. Agentic RAG introduces LLM agents with a core model, memory, planning, and tools to route queries across sources, invoke APIs, adapt retrieval, and iterate retrieval–generation; agents can also enrich indexing (goal-aware chunking, metadata, embedding selection) and generation (dynamic prompting), while requiring safeguards, error containment, and resource budgeting.

Beyond these, the chapter surveys variants that target specific bottlenecks. Corrective RAG (CRAG) evaluates retrieved content, supplements via web search, and refines knowledge to boost factuality. Speculative RAG clusters documents and has small models draft answers in parallel, with a larger verifier selecting the best, cutting latency while preserving quality. Self-RAG trains reflection tokens to decide when to retrieve, assess relevance/support, and critique outputs in real time. RAPTOR builds recursive, tree-structured summaries to capture both granular details and overarching themes for stronger thematic and multi-hop queries. Taken together, these patterns broaden RAG’s applicability: choose based on use-case needs, weigh accuracy against latency and cost, and design for evaluation, governance, and maintainability.

Examples of different data modalities

Images, text, video and audio plotted on the same embeddings space. Dog, Bark and Dog’s image close to each other.

CLIP uses multimodal pre-training to convert classification into a retrieval task, which enables pre-trained models to tackle zero-shot recognition.

Multimodal Indexing pipeline presents three options

For each of the three approaches, the generation pipeline also adapts

Knowledge Graph representation of customer activity where nodes (circles) represent entities, edges (arrows) represent relationships and attributes (rectangles) are the properties

While search in a hierarchical index structure happens at the lowest level, retrieved documents are more contextually complete from a higher level of hierarchy

Entities and relationships extracted from the chunks play a crucial role. While chunks that are similar to the user query are retrieved, the chunks that have entities related to the entities of similar chunks are also retrieved.

Communities club entities under a consistent theme and summarize the information at this group level. Since the summaries are created from a high number of thematically related chunks, these summaries can answer broad queries.

Indexing pipeline for graph RAG. Chunks can directly be stored for simple structure aware indexing and community summaries can be created and stored with the graph

An LLM Agent’s four components break down users query, recall the history of interaction with the user and leverage external tools to accomplish tasks and respond to the user.

A simple task of responding to user query on flight schedule responded to by an LLM agent by using the planning, memory and tools modules.

Agentic embellishment to the indexing pipeline enhances the quality of the knowledge base

CRAG corrects the knowledge at the most granular level and hence the name Corrective RAG. Source: Corrective Retrieval Augmented Generation (https://arxiv.org/abs/2401.15884)

FAQ

What are RAG variants and why do we need them?

RAG variants are adaptations of the standard RAG pipeline that tailor indexing, retrieval, augmentation, and generation to specific needs. They emerged to handle multimodal data (beyond text), improve relational reasoning across documents, enable adaptive decision-making, and meet production constraints like accuracy, latency, and cost. Key variants in this chapter are Multimodal RAG, Knowledge Graph RAG, and Agentic RAG, plus notable others like CRAG, Speculative RAG, Self RAG, and RAPTOR.

How does Multimodal RAG differ from text-only RAG in indexing and generation?

- Indexing: Adds loaders for images/audio/video/tables; uses specialty chunking (e.g., VAD for audio, scene detection for video); introduces multimodal or modality-specific embeddings; stores both vectors and raw files (and mappings if summaries are used).
- Generation: Retrieval adapts to embedding strategy (shared-space similarity vs. multi-vector retrieval vs. text-summary retrieval). Prompts include raw files when needed. Uses multimodal LLMs (e.g., GPT‑4o/4o mini, Google Gemini, Llama 3.2, Pixtral) instead of text-only LLMs.

What embedding strategies can I use for multimodal data, and what are the trade-offs?

- Shared (multimodal) embeddings: One vector space for all modalities; enables cross-modal search; simpler ops; can miss fine-grained content (e.g., charts).
- Modality-specific (e.g., CLIP image-text, CLAP audio-text): Separate spaces per modality; requires multi-vector retrieval and post-retrieval reranking; better control per modality.
- Convert non-text to text: Transcribe/describe with a multimodal LLM and embed as text; simplest retrieval path but risks information loss. Hybrid variation: retrieve via text summaries, also pass original media to the multimodal LLM at generation time.

What are typical Multimodal RAG use cases?

- Medical diagnosis assistants combining clinical text, tabular labs, and diagnostic images.
- Investment analysis across filings, charts, and statements.
- E-commerce buying assistants using product images, specs (tables), and reviews.
- Coding assistants mixing docs and code snippets.
- Equipment maintenance using inspection images/video, sensor data, and reports.

What is Knowledge Graph RAG and when does it help most?

Graph RAG augments vector search with knowledge graphs (nodes, edges, attributes) to capture relationships and support multi-hop reasoning. It helps when answers require connecting information across documents, summarizing themes, disambiguating entities, or traversing complex networks (e.g., treatment interactions, contract dependencies, customer journeys). It’s typically implemented as a hybrid of vectors and graphs, not a replacement.

What practical approaches exist to integrate graphs into RAG?

- Structure-aware indexing: Store parent-child (and deeper) hierarchies as a graph; retrieve fine-grained chunks and pull parent context for completeness.
- Graph-enhanced vector search: Do a standard similarity search; then traverse the graph to fetch related entities/chunks around initial hits; rerank before generation.
- Community detection and summaries: Detect densely connected subgraphs (e.g., Leiden/Louvain), summarize communities with an LLM, and retrieve at the community level for broad thematic queries.

How do I build a Graph RAG pipeline (tools, storage, and querying)?

- Indexing: Chunk documents; extract entities/relations/attributes with an LLM (e.g., LangChain’s LLMGraphTransformer); iteratively store in a graph DB (e.g., Neo4j via Neo4jGraph). Optionally build community summaries (e.g., graphrag) and store them as vectors for hybrid search.
- Retrieval/Generation: Translate natural-language queries into graph queries (Cypher) via templates or LLM-generated queries (e.g., GraphCypherQAChain). Combine graph traversal results with vector hits; augment the prompt with graph-derived text (and community summaries if used).

What is Agentic RAG and what capabilities do agents add?

Agentic RAG embeds LLM-based agents into the pipeline to make autonomous decisions. Capabilities include: query understanding and routing to the most relevant knowledge source; tool usage (web search, SQL, external APIs); adaptive retrieval (iterative reformulation and re-retrieval); dynamic prompting and iterative retrieval-generation (e.g., review and refine answers). Agents can also enhance indexing via smarter parsing, metadata extraction, task-driven chunking, and embedding/storage choices.

What are key challenges and best practices for Multimodal, Graph, and Agentic RAG?

- Multimodal: Higher latency/cost; ensure alignment across modalities; include only value-adding modalities; consider text-conversion for simplicity when acceptable.
- Graph: Expensive to build/maintain; start narrow and expand; evaluate retrieval accuracy carefully; expect deployment-specific schemas and updates.
- Agentic: Control tool counts and decision scope; add failsafes and guardrails; monitor compounded error rates in multi-agent setups; match autonomy to required accuracy.

Which other RAG variants should I know, and what problems do they target?

- Corrective RAG (CRAG): Evaluates retrieved docs; if weak, triggers web search and knowledge refinement; boosts factual accuracy (adds latency, depends on evaluator quality).
- Speculative RAG: Cluster docs; small models draft multiple answers in parallel; a larger model verifies/selects; reduces latency with verification overhead.
- Self RAG: Uses reflection tokens to decide when to retrieve, assess relevance/support, and critique outputs; improves accuracy with extra compute/training needs.
- RAPTOR: Builds a tree of recursive summaries (bottom-up); supports thematic/multi-hop queries with targeted retrieval; computationally heavier and clustering-sensitive.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$44.99 $33.74

you save $11.25 (25%)

include audio $19.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$44.99 $33.74

you save $11.25 (25%)

include audio $19.99 $14.99

eBook

pdf, ePub, online

$44.99 $33.74

you save $11.25 (25%)

include audio $19.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more