Overview

1 Large Language Models and the Need for Retrieval Augmented Generation

This chapter introduces the promise and pitfalls of modern Large Language Models and motivates Retrieval Augmented Generation as a practical remedy. It sets the stage by explaining why LLMs have become central to language tasks while noting their limits in accuracy, recency, and access to proprietary knowledge. The chapter’s goal is to build a foundation for designing, implementing, and evaluating RAG systems so readers can confidently apply them to real-world problems.

It explains LLMs as next-token predictors trained on massive text corpora using transformer architectures, available as powerful foundation models or smaller task-specific variants. Readers learn how to work with LLMs through prompts and inference, and how prompt engineering (roles, examples, clear instructions) can improve results. Key operational ideas such as context windows, temperature, few-shot prompting, and in-context learning are introduced, alongside a quick survey of common applications like writing, summarization, translation, coding, classification, information extraction, and conversational interfaces.

The chapter then details why RAG is needed: knowledge cutoffs, hallucinations, and lack of non-public context limit LLM reliability. RAG addresses these by retrieving relevant external information (non-parametric memory), augmenting the prompt, and letting the model generate grounded, up-to-date, and context-aware answers—often with source attribution—without costly retraining. It frames RAG as combining parametric and non-parametric memory, highlights the resulting gains in factuality and trust, and surveys prominent uses including next-gen search experiences, personalized content, real-time commentary, support agents, document Q&A, virtual assistants, and AI-assisted research.

ChatGPT response to the question, “Who won the 2023 cricket world cup?” (Variation 1), Source: Screenshot by author of his account on https://chat.openai.com
ChatGPT response to the question, “Who won the 2023 cricket world cup?” (Variation 2), Source: Screenshot by author of his account on https://chat.openai.com
Wikipedia Article on 2023 Cricket World Cup, Source : https://en.wikipedia.org/wiki/2023_Cricket_World_Cup
ChatGPT response to the question, augmented with external context, Source : Screenshot by author of his account on https://chat.openai.com
Retrieval Augmented Generation: A Simple Definition
Google Trends of “Generative AI” and “Large Language Models” from Nov ’22 to Nov ‘23
Two token prediction techniques – Causal Language Model & Masked Language Model
Illustrative probability distribution of words after “The Teacher”
Transformer Architecture, Source: Attention is all you need, Vasvani et al.
Popular proprietary and open source LLMs as of April 2024 (non-exhaustive list)
Prompt, Completion, and Inference
RAG enhances the parametric memory of an LLM by creating access to non-parametric memory

Summary

  • RAG enhances the memory of LLMs by creating access to external information.
  • LLMs are next word, (or token) prediction models that have been trained on massive amounts of text data to generate human-like text.
  • Interaction with LLMs is carried out using natural language prompts and prompt engineering is an important discipline.
  • LLMs face challenges of having a knowledge cut-off date and being trained only on public data. They are also prone to generating factually incorrect information (hallucinations).
  • RAG overcomes the limitations of the LLMs by incorporating non-parametric memory and increases the context awareness and reliability in the responses.
  • Popular use cases of RAG are search engines, document question answering systems, conversational agents, personalized content generation, virtual assistants among others.

FAQ

What is Retrieval Augmented Generation (RAG)?RAG is a technique that improves Large Language Model (LLM) answers by fetching relevant information from external sources, adding that context to the user’s prompt, and then letting the LLM generate a response. By grounding the model in up-to-date, task-specific knowledge, RAG makes outputs more accurate and trustworthy.
Why do Large Language Models need RAG?LLMs have inherent limitations: they can be outdated due to a knowledge cut-off, may hallucinate confident but incorrect facts, and typically lack access to proprietary or non-public data. RAG addresses these by supplying fresh, verified, and domain-specific context at inference time.
How does RAG work at a high level?RAG follows three steps: 1) Retrieve: find relevant passages from an external knowledge source (e.g., documents, wikis, APIs). 2) Augment: attach the retrieved passages to the prompt as context. 3) Generate: have the LLM produce an answer grounded in that context.
What are Large Language Models and how do they generate text?LLMs are transformer-based models trained on massive text corpora to predict the next token in a sequence. By learning statistical patterns of language, they can write, summarize, translate, and converse. During inference, they sample likely next tokens to produce fluent, human-like text.
What’s the difference between parametric and non-parametric memory?Parametric memory is the knowledge stored in an LLM’s learned parameters during training. It is fixed after training and limited by training data and model size. Non-parametric memory is external knowledge (documents, databases, the web) retrieved at runtime. RAG combines both, augmenting limited parametric memory with flexible, updatable non-parametric memory.
How do I interact with an LLM, and what is prompt engineering?Users provide a prompt (input) and receive a completion (output) during inference. Prompt engineering improves results by specifying roles, providing examples (few-shot), and giving clear instructions. Key controls include context window (max tokens for input+output) and temperature (randomness of generation).
When should I use RAG vs. fine-tuning or training from scratch?Use RAG when you need up-to-date facts, citations, or access to proprietary data with minimal cost and quick iteration. Choose supervised fine-tuning when you need the model to consistently follow new formats or domain styles beyond what prompts can achieve. Training from scratch is reserved for highly specialized domains with unique vocabularies and abundant domain data, but it is costly and time-consuming.
How does RAG reduce hallucinations and increase reliability?By grounding the model in retrieved, sourceable context, RAG constrains generation to relevant facts, lowers the chance of fabricated details, and enables citation of sources so users can verify answers.
What are popular use cases for RAG?- LLM-first search experiences with natural-language answers and citations - Personalized marketing content grounded in brand and audience data - Real-time event commentary via live data feeds - Conversational support agents using manuals, policies, and FAQs - Document question answering over proprietary corpora - Virtual assistants personalized with user/context data - Research assistants in law, finance, and ESG analysis
What key terms should I know before building a RAG system?- Context window: Token limit for prompt plus output - Temperature: Controls randomness; higher = more diverse - Few-shot prompting: Include examples in the prompt - In-context learning: Model uses provided context without changing parameters - SFT (Supervised Fine-Tuning): Further trains the model on labeled examples - SLMs: Smaller, faster models for narrower tasks - Bias/toxicity: Risk from training data that must be mitigated

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • A Simple Guide to Retrieval Augmented Generation ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • A Simple Guide to Retrieval Augmented Generation ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • A Simple Guide to Retrieval Augmented Generation ebook for free