A Simple Guide to Retrieval Augmented Generation you own this product

Abhinav Kimothi

June 2025
ISBN 9781633435858
256 pages

Included with a Manning Online subscription

printed in black & white

available in Russian, Simplified Chinese

catalog / Data Science / Machine Learning / Large Language Models

table of content

Part 1 Foundations

1 LLMs and the need for RAG

1.1 Curse of the LLMs and the idea of RAG

1.1.1 LLMs are not trained for facts

1.1.2 What is RAG?

1.2 The novelty of RAG

1.2.1 The RAG discovery

1.2.2 How does RAG help?

1.3 Popular RAG use cases

1.3.1 Search Engine Experience

1.3.2 Personalized marketing content generation

1.3.3 Real-time event commentary

1.3.4 Conversational agents

1.3.5 Document question answering systems

1.3.6 Virtual assistants

1.3.7 AI-powered research

1.3.8 Social media monitoring and sentiment analysis

1.3.9 News generation and content curation

Summary

2 RAG systems and their design

2.1 What does a RAG system look like?

2.2 Design of RAG systems

2.3 Indexing pipeline

2.4 Generation pipeline

2.5 Evaluation and monitoring

2.6 The RAGOps Stack

2.7 Caching, guardrails, security, and other layers

Summary

Part 2 Creating RAG systems

3 Indexing pipeline: Creating a knowledge base for RAG

3.1 Data loading

3.2 Data splitting (chunking)

3.2.1 Advantages of chunking

3.2.2 Chunking process

3.2.3 Chunking methods

3.2.4 Choosing a chunking strategy

3.3 Data conversion (embeddings)

3.3.1 What are embeddings?

3.3.2 Common pre-trained embeddings models

3.3.3 Embeddings use cases

3.3.4 How to choose embeddings?

3.4 Storage (vector databases)

3.4.1 What are vector databases?

3.4.2 Types of vector databases

3.4.3 Choosing a vector database

3.4.1 Data loading

3.4.2 Data conversion

4 Generation pipeline: Generating contextual LLM responses

4.1 Generation pipeline overview

4.2 Retrieval

4.2.1 Progression of retrieval methods

4.2.2 Popular retrievers

4.2.3 A simple retriever implementation

4.3 Augmentation

4.3.1 RAG prompt engineering techniques

4.3.2 A simple augmentation prompt creation

4.4 Generation

4.4.1 Categorization of LLMs and suitability for RAG

4.4.2 Completing the RAG pipeline: Generation using LLMs

4.4.1 Retrieval

4.4.2 Augmentation

4.4.3 Generation

5 RAG evaluation: Accuracy, relevance, and faithfulness

5.1 Key aspects of RAG evaluation

5.1.1 Quality scores

5.1.2 Required abilities

5.2 Evaluation metrics

5.2.1 Retrieval metrics

5.2.2 RAG-specific metrics

5.3 Frameworks

5.3.1 RAGAs

5.3.2 Automated RAG evaluation system

5.4 Benchmarks

5.4.1 RGB

5.5 Limitations and best practices

5.5.1 RAG evaluation fundamentals

5.5.2 Evaluation metrics

5.5.3 Evaluation frameworks

5.5.4 Benchmarks

5.5.5 Limitations and best practices

Part 3 RAG in production

6 Progression of RAG systems: Naïve, advanced, and modular RAG

6.1 Limitations of naïve RAG

6.2 Advanced RAG techniques

6.3 Pre-retrieval techniques

6.3.1 Index optimization

6.3.2 Query optimization

6.4 Retrieval strategies

6.4.1 Hybrid retrieval

6.4.2 Iterative retrieval

6.4.3 Recursive retrieval

6.4.4 Adaptive retrieval

6.5 Post-retrieval techniques

6.5.1 Compression

6.6 Modular RAG

6.6.1 Core modules

6.6.2 New modules

6.6.1 Limitations of naïve RAG

6.6.2 Advanced RAG techniques

6.6.3 Modular RAG framework

6.6.4 Tradeoffs and best practices

7 Evolving RAGOps stack

7.1 The evolving RAGOps stack

7.1.1 Critical layers

7.1.2 Essential layers

7.1.3 Enhancement layers

7.2 Production best practices

7.2.1 Critical layers

7.2.2 Essential layers

7.2.3 Enhancement layers

7.2.4 Production best practices

Part 4 Additional considerations

8 Graph, multimodal, agentic, and other RAG variants

8.1 What are RAG variants, and why do we need them?

8.2 Multimodal RAG

8.2.1 Data modality

8.2.2 Multimodal RAG use cases

8.2.3 Multimodal RAG pipelines

8.2.4 Challenges and best practices

8.3 Knowledge graph RAG

8.3.1 Knowledge graphs

8.3.2 Knowledge graph RAG use cases

8.3.3 Graph RAG approaches

8.3.4 Graph RAG pipelines

8.3.5 Challenges and best practices

8.4 Agentic RAG

8.4.1 LLM agents

8.4.2 Agentic RAG capabilities

8.4.3 Agentic RAG pipelines

8.4.4 Challenges and pest practices

8.5 Other RAG variants

8.5.1 Corrective RAG

8.5.2 Speculative RAG

8.5.3 Self-reflective (self RAG)

8.5.4 RAPTOR

8.5.1 Introducing RAG variants

8.5.2 Multimodal rag

8.5.3 Knowledge graph RAG

8.5.4 Agentic RAG

8.5.5 Other RAG variants

9 RAG development framework and further exploration

9.1 RAG development framework

9.1.1 Initiation stage: Defining and scoping the RAG system

9.2 Design stage: Layering the RAGOps stack

9.2.1 Indexing pipeline design

9.2.2 Generation pipeline design

9.2.3 Other design considerations

9.2.4 Development stage: Building modular RAG pipelines

9.2.5 Evaluation stage: Validating and optimizing the RAG system

9.2.6 Deployment stage: Launching and scaling the RAG system

9.2.7 Maintenance stage: Ensuring reliability and adaptability

9.3 Ideas for further exploration

9.3.1 Fine-tuning within RAG

9.3.2 Long-context windows in LLMs

9.3.3 Managed solutions

9.3.4 Difficult queries

9.3.1 RAG development framework

9.3.2 RAG development framework stages

9.3.3 Best practices in RAG development

9.3.4 Ideas for further exploration

Overview

1 Large Language Models and the Need for Retrieval Augmented Generation

This chapter introduces the promise and pitfalls of modern Large Language Models and motivates Retrieval Augmented Generation as a practical remedy. It sets the stage by explaining why LLMs have become central to language tasks while noting their limits in accuracy, recency, and access to proprietary knowledge. The chapter’s goal is to build a foundation for designing, implementing, and evaluating RAG systems so readers can confidently apply them to real-world problems.

It explains LLMs as next-token predictors trained on massive text corpora using transformer architectures, available as powerful foundation models or smaller task-specific variants. Readers learn how to work with LLMs through prompts and inference, and how prompt engineering (roles, examples, clear instructions) can improve results. Key operational ideas such as context windows, temperature, few-shot prompting, and in-context learning are introduced, alongside a quick survey of common applications like writing, summarization, translation, coding, classification, information extraction, and conversational interfaces.

The chapter then details why RAG is needed: knowledge cutoffs, hallucinations, and lack of non-public context limit LLM reliability. RAG addresses these by retrieving relevant external information (non-parametric memory), augmenting the prompt, and letting the model generate grounded, up-to-date, and context-aware answers—often with source attribution—without costly retraining. It frames RAG as combining parametric and non-parametric memory, highlights the resulting gains in factuality and trust, and surveys prominent uses including next-gen search experiences, personalized content, real-time commentary, support agents, document Q&A, virtual assistants, and AI-assisted research.

ChatGPT response to the question, “Who won the 2023 cricket world cup?” (Variation 1), Source: Screenshot by author of his account on https://chat.openai.com

ChatGPT response to the question, “Who won the 2023 cricket world cup?” (Variation 2), Source: Screenshot by author of his account on https://chat.openai.com

Wikipedia Article on 2023 Cricket World Cup, Source : https://en.wikipedia.org/wiki/2023_Cricket_World_Cup

ChatGPT response to the question, augmented with external context, Source : Screenshot by author of his account on https://chat.openai.com

Retrieval Augmented Generation: A Simple Definition

Google Trends of “Generative AI” and “Large Language Models” from Nov ’22 to Nov ‘23

Two token prediction techniques – Causal Language Model & Masked Language Model

Illustrative probability distribution of words after “The Teacher”

Transformer Architecture, Source: Attention is all you need, Vasvani et al.

Popular proprietary and open source LLMs as of April 2024 (non-exhaustive list)

Prompt, Completion, and Inference

RAG enhances the parametric memory of an LLM by creating access to non-parametric memory

Summary

RAG enhances the memory of LLMs by creating access to external information.
LLMs are next word, (or token) prediction models that have been trained on massive amounts of text data to generate human-like text.
Interaction with LLMs is carried out using natural language prompts and prompt engineering is an important discipline.
LLMs face challenges of having a knowledge cut-off date and being trained only on public data. They are also prone to generating factually incorrect information (hallucinations).
RAG overcomes the limitations of the LLMs by incorporating non-parametric memory and increases the context awareness and reliability in the responses.
Popular use cases of RAG are search engines, document question answering systems, conversational agents, personalized content generation, virtual assistants among others.

FAQ

What is Retrieval Augmented Generation (RAG)?

RAG is a technique that improves Large Language Model (LLM) answers by fetching relevant information from external sources, adding that context to the user’s prompt, and then letting the LLM generate a response. By grounding the model in up-to-date, task-specific knowledge, RAG makes outputs more accurate and trustworthy.

Why do Large Language Models need RAG?

LLMs have inherent limitations: they can be outdated due to a knowledge cut-off, may hallucinate confident but incorrect facts, and typically lack access to proprietary or non-public data. RAG addresses these by supplying fresh, verified, and domain-specific context at inference time.

How does RAG work at a high level?

RAG follows three steps: 1) Retrieve: find relevant passages from an external knowledge source (e.g., documents, wikis, APIs). 2) Augment: attach the retrieved passages to the prompt as context. 3) Generate: have the LLM produce an answer grounded in that context.

What are Large Language Models and how do they generate text?

LLMs are transformer-based models trained on massive text corpora to predict the next token in a sequence. By learning statistical patterns of language, they can write, summarize, translate, and converse. During inference, they sample likely next tokens to produce fluent, human-like text.

What’s the difference between parametric and non-parametric memory?

Parametric memory is the knowledge stored in an LLM’s learned parameters during training. It is fixed after training and limited by training data and model size. Non-parametric memory is external knowledge (documents, databases, the web) retrieved at runtime. RAG combines both, augmenting limited parametric memory with flexible, updatable non-parametric memory.

How do I interact with an LLM, and what is prompt engineering?

Users provide a prompt (input) and receive a completion (output) during inference. Prompt engineering improves results by specifying roles, providing examples (few-shot), and giving clear instructions. Key controls include context window (max tokens for input+output) and temperature (randomness of generation).

When should I use RAG vs. fine-tuning or training from scratch?

Use RAG when you need up-to-date facts, citations, or access to proprietary data with minimal cost and quick iteration. Choose supervised fine-tuning when you need the model to consistently follow new formats or domain styles beyond what prompts can achieve. Training from scratch is reserved for highly specialized domains with unique vocabularies and abundant domain data, but it is costly and time-consuming.

How does RAG reduce hallucinations and increase reliability?

By grounding the model in retrieved, sourceable context, RAG constrains generation to relevant facts, lowers the chance of fabricated details, and enables citation of sources so users can verify answers.

What are popular use cases for RAG?

- LLM-first search experiences with natural-language answers and citations - Personalized marketing content grounded in brand and audience data - Real-time event commentary via live data feeds - Conversational support agents using manuals, policies, and FAQs - Document question answering over proprietary corpora - Virtual assistants personalized with user/context data - Research assistants in law, finance, and ESG analysis

What key terms should I know before building a RAG system?

- Context window: Token limit for prompt plus output - Temperature: Controls randomness; higher = more diverse - Few-shot prompting: Include examples in the prompt - In-context learning: Model uses provided context without changing parameters - SFT (Supervised Fine-Tuning): Further trains the model on labeled examples - SLMs: Smaller, faster models for narrower tasks - Bias/toxicity: Risk from training data that must be mitigated

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$39.99 $19.99

you save $20.00 (50%)

include audio $19.99 $12.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$39.99 $19.99

you save $20.00 (50%)

include audio $19.99 $12.99

eBook

pdf, ePub, online

$39.99 $19.99

you save $20.00 (50%)

include audio $19.99 $12.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more