table of content

Part 1 Foundations

1 LLMs and the need for RAG

1.1 Curse of the LLMs and the idea of RAG

1.1.1 LLMs are not trained for facts

1.1.2 What is RAG?

1.2 The novelty of RAG

1.2.1 The RAG discovery

1.2.2 How does RAG help?

1.3 Popular RAG use cases

1.3.1 Search Engine Experience

1.3.2 Personalized marketing content generation

1.3.3 Real-time event commentary

1.3.4 Conversational agents

1.3.5 Document question answering systems

1.3.6 Virtual assistants

1.3.7 AI-powered research

1.3.8 Social media monitoring and sentiment analysis

1.3.9 News generation and content curation

Summary

2 RAG systems and their design

2.1 What does a RAG system look like?

2.2 Design of RAG systems

2.3 Indexing pipeline

2.4 Generation pipeline

2.5 Evaluation and monitoring

2.6 The RAGOps Stack

2.7 Caching, guardrails, security, and other layers

Summary

Part 2 Creating RAG systems

3 Indexing pipeline: Creating a knowledge base for RAG

3.1 Data loading

3.2 Data splitting (chunking)

3.2.1 Advantages of chunking

3.2.2 Chunking process

3.2.3 Chunking methods

3.2.4 Choosing a chunking strategy

3.3 Data conversion (embeddings)

3.3.1 What are embeddings?

3.3.2 Common pre-trained embeddings models

3.3.3 Embeddings use cases

3.3.4 How to choose embeddings?

3.4 Storage (vector databases)

3.4.1 What are vector databases?

3.4.2 Types of vector databases

3.4.3 Choosing a vector database

3.4.1 Data loading

3.4.2 Data conversion

4 Generation pipeline: Generating contextual LLM responses

4.1 Generation pipeline overview

4.2 Retrieval

4.2.1 Progression of retrieval methods

4.2.2 Popular retrievers

4.2.3 A simple retriever implementation

4.3 Augmentation

4.3.1 RAG prompt engineering techniques

4.3.2 A simple augmentation prompt creation

4.4 Generation

4.4.1 Categorization of LLMs and suitability for RAG

4.4.2 Completing the RAG pipeline: Generation using LLMs

4.4.1 Retrieval

4.4.2 Augmentation

4.4.3 Generation

5 RAG evaluation: Accuracy, relevance, and faithfulness

5.1 Key aspects of RAG evaluation

5.1.1 Quality scores

5.1.2 Required abilities

5.2 Evaluation metrics

5.2.1 Retrieval metrics

5.2.2 RAG-specific metrics

5.3 Frameworks

5.3.1 RAGAs

5.3.2 Automated RAG evaluation system

5.4 Benchmarks

5.4.1 RGB

5.5 Limitations and best practices

5.5.1 RAG evaluation fundamentals

5.5.2 Evaluation metrics

5.5.3 Evaluation frameworks

5.5.4 Benchmarks

5.5.5 Limitations and best practices

Part 3 RAG in production

6 Progression of RAG systems: Naïve, advanced, and modular RAG

6.1 Limitations of naïve RAG

6.2 Advanced RAG techniques

6.3 Pre-retrieval techniques

6.3.1 Index optimization

6.3.2 Query optimization

6.4 Retrieval strategies

6.4.1 Hybrid retrieval

6.4.2 Iterative retrieval

6.4.3 Recursive retrieval

6.4.4 Adaptive retrieval

6.5 Post-retrieval techniques

6.5.1 Compression

6.6 Modular RAG

6.6.1 Core modules

6.6.2 New modules

6.6.1 Limitations of naïve RAG

6.6.2 Advanced RAG techniques

6.6.3 Modular RAG framework

6.6.4 Tradeoffs and best practices

7 Evolving RAGOps stack

7.1 The evolving RAGOps stack

7.1.1 Critical layers

7.1.2 Essential layers

7.1.3 Enhancement layers

7.2 Production best practices

7.2.1 Critical layers

7.2.2 Essential layers

7.2.3 Enhancement layers

7.2.4 Production best practices

Part 4 Additional considerations

8 Graph, multimodal, agentic, and other RAG variants

8.1 What are RAG variants, and why do we need them?

8.2 Multimodal RAG

8.2.1 Data modality

8.2.2 Multimodal RAG use cases

8.2.3 Multimodal RAG pipelines

8.2.4 Challenges and best practices

8.3 Knowledge graph RAG

8.3.1 Knowledge graphs

8.3.2 Knowledge graph RAG use cases

8.3.3 Graph RAG approaches

8.3.4 Graph RAG pipelines

8.3.5 Challenges and best practices

8.4 Agentic RAG

8.4.1 LLM agents

8.4.2 Agentic RAG capabilities

8.4.3 Agentic RAG pipelines

8.4.4 Challenges and pest practices

8.5 Other RAG variants

8.5.1 Corrective RAG

8.5.2 Speculative RAG

8.5.3 Self-reflective (self RAG)

8.5.4 RAPTOR

8.5.1 Introducing RAG variants

8.5.2 Multimodal rag

8.5.3 Knowledge graph RAG

8.5.4 Agentic RAG

8.5.5 Other RAG variants

9 RAG development framework and further exploration

9.1 RAG development framework

9.1.1 Initiation stage: Defining and scoping the RAG system

9.2 Design stage: Layering the RAGOps stack

9.2.1 Indexing pipeline design

9.2.2 Generation pipeline design

9.2.3 Other design considerations

9.2.4 Development stage: Building modular RAG pipelines

9.2.5 Evaluation stage: Validating and optimizing the RAG system

9.2.6 Deployment stage: Launching and scaling the RAG system

9.2.7 Maintenance stage: Ensuring reliability and adaptability

9.3 Ideas for further exploration

9.3.1 Fine-tuning within RAG

9.3.2 Long-context windows in LLMs

9.3.3 Managed solutions

9.3.4 Difficult queries

9.3.1 RAG development framework

9.3.2 RAG development framework stages

9.3.3 Best practices in RAG development

9.3.4 Ideas for further exploration

Overview

4 Generation Pipeline: Generating Contextual LLM Responses

This chapter explains how a Retrieval Augmented Generation (RAG) system turns a user query into a grounded answer by orchestrating three steps—retrieval, augmentation, and generation—on top of a knowledge base created in the indexing pipeline. It frames the generation pipeline as the bridge between stored non‑parametric knowledge and the LLM, emphasizing that the quality of each step compounds in the final response. Readers are guided from concepts to practice with a compact, end‑to‑end Python walkthrough so they can stand up a basic RAG system and understand the design choices that affect accuracy, cost, and reliability.

The chapter first focuses on retrieval, defining retrievers as components that search a vectorized knowledge base and return the most relevant documents to a query. It traces the evolution from classical IR (Boolean, Bag‑of‑Words, TF‑IDF) to stronger ranking with BM25, and then to embeddings-based search, contrasting static (Word2Vec, GloVe) with contextual embeddings (e.g., transformer-based). It highlights cosine similarity ranking, the tight coupling of indexing and retrieval, and practical options such as vector stores (FAISS, Pinecone, Milvus, Weaviate), cloud search services, and domain sources (e.g., Wikipedia, ArXiv). A simple retriever is implemented using OpenAI embeddings with FAISS similarity search, underscoring that retriever quality largely determines downstream answer quality and that hybrid and reranking strategies are common in production.

Next, augmentation is presented as prompt engineering that fuses the user query with retrieved context, starting with contextual prompting and controlled generation (instructing the model to say “I don’t know” when context is insufficient). It then introduces few‑shot examples for format and style, Chain‑of‑Thought for reasoning, and points to advanced techniques such as self‑consistency, Tree‑of‑Thoughts, and tool‑assisted prompting. The generation section frames model choice for RAG across foundation versus fine‑tuned models, open‑source versus proprietary options, and small versus large model trade‑offs (customization, cost, deployment, reasoning, and context length). The chapter closes by completing the pipeline—feeding the augmented prompt to an LLM (e.g., GPT‑4‑class model)—and reiterating that, with the indexing and generation pipelines in place, readers can build a working RAG system, while recognizing that rigorous evaluation and advanced strategies further improve robustness and fidelity.

Generation Pipeline Overview with the three components i.e. retrieval, augmentation and generation

A Retriever searches through the knowledge base and returns the most relevant documents

Calculating TF-IDF to rank documents based on search terms

BM25 also considers the length of the documents

Static Embeddings vs Contextual Embeddings

Similarity calculation and results ranking in embeddings-based retrieval technique

Simple augmentation combines the user query with retrieved documents to send to the LLM

Information augmented to the original question with an added instruction

Example of Few Shot Prompting in the context of RAG

Chain-of-Thought (CoT) prompting for reasoning tasks

Supervised fine tuning is a classification model training process

FAQ

What is the Generation Pipeline in a RAG system?

The Generation Pipeline has three steps: Retrieval (find relevant documents from the knowledge base), Augmentation (merge the user’s query with the retrieved context into a prompt), and Generation (use an LLM to produce the final answer based on that augmented prompt).

Why are indexing and retrieval tightly coupled?

Data is indexed in a specific representation (e.g., embeddings). To retrieve it effectively, you must query using the same representation and similarity metric. In practice, if you index with embeddings, you also retrieve with embeddings and cosine similarity (or a related metric).

Which retrieval methods are commonly used in RAG, and what are their trade-offs?

Popular methods include TF-IDF (simple, fast, limited semantics), BM25 (strong keyword retrieval with length normalization), static embeddings (capture some semantics but not context), and contextual embeddings (rich semantics, handle polysemy, higher compute). Advanced options include dense, sparse, hybrid, cross-encoder rerankers, and graph-based approaches.

When should I use contextual embeddings over TF-IDF/BM25?

Use contextual embeddings when you need semantic matching, disambiguation (e.g., “apple” fruit vs company), and robustness to paraphrasing. They’re preferred for most RAG use cases, though they cost more to compute than sparse keyword methods.

What is hybrid retrieval and why is reranking helpful?

Hybrid retrieval combines sparse (e.g., BM25) and dense (embeddings) methods to balance recall and semantic precision. Often a fast sparse or dense retriever gets candidates, then a stronger model (e.g., cross-encoder reranker) refines ranking for higher answer quality.

What are popular retriever tools and integrations I can use?

Vector stores (FAISS, Pinecone, Milvus, Weaviate) provide similarity search and hybrid search. Cloud services include Amazon Kendra, Azure AI Search, and Google Vertex AI Search. Web sources like Wikipedia and ArXiv have dedicated connectors. LangChain abstracts many of these as plug-and-play retrievers.

What does augmentation mean in RAG, and which prompt techniques work best?

Augmentation builds a prompt by combining the user question with retrieved context and clear instructions. Effective techniques include Contextual Prompting (use only provided context), Controlled Generation (say “I don’t know” if context is insufficient), Few-Shot (examples for format/style), and Chain-of-Thought (step-by-step reasoning). Advanced techniques include Self-Consistency, Tree of Thoughts, ReAct, and more.

How can I reduce hallucinations in RAG answers?

Use Controlled Generation instructions (answer only from the given context; otherwise state uncertainty), improve retrieval quality (better embeddings, hybrid search, reranking), and ensure the augmented prompt is explicit about grounding responses in the supplied documents.

How do I choose an LLM for my RAG system?

Consider: Foundation vs Fine-Tuned (speed vs domain/task optimization), Open-Source vs Proprietary (customization/deployment control vs ease, support, managed services), and Model Size (large for reasoning and long context; small for speed and edge deployment). Also weigh domain specificity, cost, privacy, and integration needs.

What does a minimal end-to-end RAG implementation look like?

Index documents as embeddings in a vector store (e.g., FAISS). At query time, retrieve the most similar chunks (similarity search). Build an augmented prompt that includes the question, retrieved context, and guardrails (e.g., “use only the context; say if unknown”). Send that prompt to an LLM (e.g., GPT-4o) and return the grounded answer.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$44.99 $33.74

you save $11.25 (25%)

include audio $19.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$44.99 $33.74

you save $11.25 (25%)

include audio $19.99 $14.99

eBook

pdf, ePub, online

$44.99 $33.74

you save $11.25 (25%)

include audio $19.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more