table of content

Part 1 Foundations

1 LLMs and the need for RAG

1.1 Curse of the LLMs and the idea of RAG

1.1.1 LLMs are not trained for facts

1.1.2 What is RAG?

1.2 The novelty of RAG

1.2.1 The RAG discovery

1.2.2 How does RAG help?

1.3 Popular RAG use cases

1.3.1 Search Engine Experience

1.3.2 Personalized marketing content generation

1.3.3 Real-time event commentary

1.3.4 Conversational agents

1.3.5 Document question answering systems

1.3.6 Virtual assistants

1.3.7 AI-powered research

1.3.8 Social media monitoring and sentiment analysis

1.3.9 News generation and content curation

Summary

2 RAG systems and their design

2.1 What does a RAG system look like?

2.2 Design of RAG systems

2.3 Indexing pipeline

2.4 Generation pipeline

2.5 Evaluation and monitoring

2.6 The RAGOps Stack

2.7 Caching, guardrails, security, and other layers

Summary

Part 2 Creating RAG systems

3 Indexing pipeline: Creating a knowledge base for RAG

3.1 Data loading

3.2 Data splitting (chunking)

3.2.1 Advantages of chunking

3.2.2 Chunking process

3.2.3 Chunking methods

3.2.4 Choosing a chunking strategy

3.3 Data conversion (embeddings)

3.3.1 What are embeddings?

3.3.2 Common pre-trained embeddings models

3.3.3 Embeddings use cases

3.3.4 How to choose embeddings?

3.4 Storage (vector databases)

3.4.1 What are vector databases?

3.4.2 Types of vector databases

3.4.3 Choosing a vector database

3.4.1 Data loading

3.4.2 Data conversion

4 Generation pipeline: Generating contextual LLM responses

4.1 Generation pipeline overview

4.2 Retrieval

4.2.1 Progression of retrieval methods

4.2.2 Popular retrievers

4.2.3 A simple retriever implementation

4.3 Augmentation

4.3.1 RAG prompt engineering techniques

4.3.2 A simple augmentation prompt creation

4.4 Generation

4.4.1 Categorization of LLMs and suitability for RAG

4.4.2 Completing the RAG pipeline: Generation using LLMs

4.4.1 Retrieval

4.4.2 Augmentation

4.4.3 Generation

5 RAG evaluation: Accuracy, relevance, and faithfulness

5.1 Key aspects of RAG evaluation

5.1.1 Quality scores

5.1.2 Required abilities

5.2 Evaluation metrics

5.2.1 Retrieval metrics

5.2.2 RAG-specific metrics

5.3 Frameworks

5.3.1 RAGAs

5.3.2 Automated RAG evaluation system

5.4 Benchmarks

5.4.1 RGB

5.5 Limitations and best practices

5.5.1 RAG evaluation fundamentals

5.5.2 Evaluation metrics

5.5.3 Evaluation frameworks

5.5.4 Benchmarks

5.5.5 Limitations and best practices

Part 3 RAG in production

6 Progression of RAG systems: Naïve, advanced, and modular RAG

6.1 Limitations of naïve RAG

6.2 Advanced RAG techniques

6.3 Pre-retrieval techniques

6.3.1 Index optimization

6.3.2 Query optimization

6.4 Retrieval strategies

6.4.1 Hybrid retrieval

6.4.2 Iterative retrieval

6.4.3 Recursive retrieval

6.4.4 Adaptive retrieval

6.5 Post-retrieval techniques

6.5.1 Compression

6.6 Modular RAG

6.6.1 Core modules

6.6.2 New modules

6.6.1 Limitations of naïve RAG

6.6.2 Advanced RAG techniques

6.6.3 Modular RAG framework

6.6.4 Tradeoffs and best practices

7 Evolving RAGOps stack

7.1 The evolving RAGOps stack

7.1.1 Critical layers

7.1.2 Essential layers

7.1.3 Enhancement layers

7.2 Production best practices

7.2.1 Critical layers

7.2.2 Essential layers

7.2.3 Enhancement layers

7.2.4 Production best practices

Part 4 Additional considerations

8 Graph, multimodal, agentic, and other RAG variants

8.1 What are RAG variants, and why do we need them?

8.2 Multimodal RAG

8.2.1 Data modality

8.2.2 Multimodal RAG use cases

8.2.3 Multimodal RAG pipelines

8.2.4 Challenges and best practices

8.3 Knowledge graph RAG

8.3.1 Knowledge graphs

8.3.2 Knowledge graph RAG use cases

8.3.3 Graph RAG approaches

8.3.4 Graph RAG pipelines

8.3.5 Challenges and best practices

8.4 Agentic RAG

8.4.1 LLM agents

8.4.2 Agentic RAG capabilities

8.4.3 Agentic RAG pipelines

8.4.4 Challenges and pest practices

8.5 Other RAG variants

8.5.1 Corrective RAG

8.5.2 Speculative RAG

8.5.3 Self-reflective (self RAG)

8.5.4 RAPTOR

8.5.1 Introducing RAG variants

8.5.2 Multimodal rag

8.5.3 Knowledge graph RAG

8.5.4 Agentic RAG

8.5.5 Other RAG variants

9 RAG development framework and further exploration

9.1 RAG development framework

9.1.1 Initiation stage: Defining and scoping the RAG system

9.2 Design stage: Layering the RAGOps stack

9.2.1 Indexing pipeline design

9.2.2 Generation pipeline design

9.2.3 Other design considerations

9.2.4 Development stage: Building modular RAG pipelines

9.2.5 Evaluation stage: Validating and optimizing the RAG system

9.2.6 Deployment stage: Launching and scaling the RAG system

9.2.7 Maintenance stage: Ensuring reliability and adaptability

9.3 Ideas for further exploration

9.3.1 Fine-tuning within RAG

9.3.2 Long-context windows in LLMs

9.3.3 Managed solutions

9.3.4 Difficult queries

9.3.1 RAG development framework

9.3.2 RAG development framework stages

9.3.3 Best practices in RAG development

9.3.4 Ideas for further exploration

Overview

5 RAG Evaluation: Accuracy, Relevance, Faithfulness

This chapter explains why rigorous evaluation is essential for Retrieval-Augmented Generation: it verifies that retrieved context is relevant and that generated answers are grounded and useful. It frames quality along three core dimensions—context relevance, answer faithfulness (groundedness), and answer relevance—and highlights system abilities that matter in practice: noise robustness, negative rejection, information integration, and counterfactual robustness. Beyond accuracy, it urges attention to latency, robustness across query types, and ethical considerations such as bias and toxicity. Because RAG serves diverse applications, the chapter stresses designing use‑case‑specific criteria alongside general measures.

The chapter groups metrics into retrieval metrics (precision, recall, F1, MRR, MAP, nDCG) and RAG‑specific metrics that directly assess the three quality scores. It discusses how human judgments and ground truth datasets anchor evaluations, and how synthetic data can scale this process. Frameworks help automate end‑to‑end assessment: RAGAs offers fast, practical evaluation and synthetic test generation using an LLM-as-judge, while ARES trains classifiers and reports confidence intervals, generating both positive and negative examples. Interpreting metric trends guides improvements—e.g., precision/recall trade‑offs for retrievers, ranking issues indicated by MRR/nDCG, or prompt/LLM adjustments for faithfulness and answer relevance.

For comparison across systems, the chapter reviews benchmarks: classic QA sets (such as SQuAD, Natural Questions, HotpotQA) and BEIR emphasize retrieval, while newer RAG-focused suites—RGB (robustness and rejection), Multihop RAG (multi-document reasoning and null queries), and CRAG (diverse domains with four-class grading)—provide more holistic coverage. It also outlines current limitations: inconsistent metric definitions, reliance on LLMs as judges (and potential self-reference), static benchmarks that miss evolving knowledge, and scalability/cost concerns. Recommended practices include combining multiple frameworks and metrics, adding human review, tailoring evaluations to domain and task, tracking latency and safety signals, using different judge models, and regularly updating datasets and methods as the field evolves.

Precision and Recall

f1-score balances precision and recall. A medium value of both precision and recall gets a higher f1-score than if one value is very high and the other is very low.

MRR considers the ranking but doesn’t consider all the documents

MAP considers all the retrieved documents and gives a higher score for better ranking

nDCG addresses degrees of relevance in documents and penalizes incorrect ranking

Context relevance evaluates the degree to which the retrieved information is relevant to the query

Answer Faithfulness evaluates the closeness of the generated response to the retrieved context

Answer relevance is calculated as mean of cosine similarity between original and synthetic questions.

Synthetic ground truths data generation using RAGAs

Synthetic test data generated using RAGAs

BEIR – 9 tasks and 18 (of 19) datasets (Source: BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models https://arxiv.org/pdf/2104.08663v4)

Four abilities required of RAG systems (Source: Benchmarking Large Language Models in Retrieval-Augmented Generation, Chen et al - https://arxiv.org/pdf/2309.0143)

8 question types in CRAG.

FAQ

Why evaluate a RAG pipeline at all?

To verify two things: (1) the retriever returns context that is relevant to the query, and (2) the generator produces answers grounded in that context. Evaluation establishes a performance baseline, guides improvements, and enables fair comparison with other systems via standardized metrics, frameworks, and benchmarks.

What are the three core RAG quality scores?

- Context Relevance: How well the retrieved passages align with the user query.
- Answer Faithfulness (groundedness): Whether the answer’s claims are supported by the retrieved context (lowers hallucinations).
- Answer Relevance: How directly and completely the answer addresses the original query (not the same as truthfulness).

Which abilities should a robust RAG system demonstrate?

- Noise Robustness: Filter out related-but-useless passages.
- Negative Rejection: Say “I don’t know” when the KB lacks relevant info.
- Information Integration: Combine evidence across multiple documents.
- Counterfactual Robustness: Detect and reject incorrect or conflicting context.

Which retrieval metrics should I track, and what do they capture?

- Precision / Recall / F1: Basic quality and coverage (don’t consider ranking).
- Precision@k: Quality of the top-k results that will feed augmentation.
- MRR: How early the first relevant result appears.
- MAP: Precision across recall levels with ranking sensitivity.
- nDCG: Ranking quality with graded (multi-level) relevance.

How do RAG-specific metrics differ from common NLG metrics like BLEU/ROUGE?

BLEU/ROUGE focus on surface overlap and fluency; they don’t assess grounding in retrieved context. RAG-specific metrics (context relevance, answer faithfulness, answer relevance) explicitly measure whether the system retrieved the right evidence, used it faithfully, and answered the user’s question.

How does the “LLM as a judge” approach work, and what are its risks?

An LLM scores relevance/faithfulness by classifying claims or comparing questions/answers. Risks include dependence on the judge model’s quality, potential bias, and “self-reference” if the same model generates and judges. Mitigations: use a different judge LLM, sample and manually spot-check, or ensemble multiple judges.

What is RAGAs and when should I use it?

RAGAs is an evaluation framework that: (1) synthetically generates test Q-C-A data from your corpus, (2) computes RAG-specific and related metrics (e.g., context precision/recall, faithfulness, answer relevancy), and (3) supports production monitoring. Use it for quick, practical evaluations without heavy human annotation.

What is ARES and how does it differ from RAGAs?

ARES (Stanford/Databricks) also uses LLM-as-judge but: (1) trains a classifier instead of relying solely on prompt heuristics, (2) reports confidence intervals via prediction-powered inference, and (3) builds positive/negative synthetic triples with a human-preference validation set. Use it for deeper, statistically grounded analysis (with more setup).

Which benchmarks are popular for RAG and what do they test?

- Retrieval-focused: SQuAD, Natural Questions, HotpotQA, BEIR (ranking via nDCG), mainly test retrieval quality.
- RAG-oriented: RGB (noise/negative rejection/integration/counterfactual), Multihop RAG (multi-doc, inference/comparison/temporal/null), CRAG (8 question types, graded scoring). Choose based on the abilities and domains you need.

What are key limitations and best practices in RAG evaluation?

Limitations: no single standard for RAG metrics, reliance on judge LLMs, static benchmarks, cost/latency blind spots. Best practices: combine multiple frameworks/metrics, tailor evaluations to your use case, use a different judge LLM (or ensembles), include human review, track latency/bias/toxicity, and use domain-relevant and regularly updated datasets.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$44.99 $28.34

you save $16.65 (37%)

include audio $19.99 $12.59

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$44.99 $28.34

you save $16.65 (37%)

include audio $19.99 $12.59

eBook

pdf, ePub, online

$44.99 $28.34

you save $16.65 (37%)

include audio $19.99 $12.59

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more