Overview

1 Large Language Models and the Need for Retrieval Augmented Generation

Large Language Models have rapidly become central to modern AI applications, yet their impressive fluency masks practical limits that affect reliability. This chapter introduces Retrieval Augmented Generation as a pragmatic way to strengthen LLMs by supplying relevant, external information at query time. It sets expectations for the book: define RAG, explain why it’s needed, outline how LLMs work and are used, and preview how RAG-enabled systems are designed, so readers gain the foundation to explore RAG’s components in depth.

LLMs are next‑token predictors trained on vast text corpora, typically with transformer architectures, and are often consumed as foundation models. Users interact through prompts to obtain completions during inference, with prompt engineering (roles, examples/few‑shot, structured reasoning strategies) improving results. Key operational concepts include context windows, temperature, in‑context learning versus supervised fine‑tuning, and the trade‑offs of small versus large models. These systems already power diverse tasks such as writing, summarization, translation, code generation, information extraction, classification, and conversational interfaces.

Despite their breadth, LLMs suffer from knowledge cutoffs, hallucinations, and lack of access to proprietary or up‑to‑date information. RAG addresses these gaps by retrieving pertinent content from external sources, augmenting the prompt, and then generating grounded answers—effectively extending an LLM’s parametric memory with a flexible non‑parametric memory. This reduces hallucinations, enables source citation, and makes responses more contextual and trustworthy, often at far lower cost than continual pretraining or fine‑tuning. The chapter highlights prominent RAG applications, including modern search experiences, personalized content, real‑time event commentary, support chatbots, document question answering, virtual assistants, and AI‑assisted research, underscoring RAG’s role in making LLMs practical and dependable.

ChatGPT response to the question, “Who won the 2023 cricket world cup?” (Variation 1), Source: Screenshot by author of his account on https://chat.openai.com
ChatGPT response to the question, “Who won the 2023 cricket world cup?” (Variation 2), Source: Screenshot by author of his account on https://chat.openai.com
Wikipedia Article on 2023 Cricket World Cup, Source : https://en.wikipedia.org/wiki/2023_Cricket_World_Cup
ChatGPT response to the question, augmented with external context, Source : Screenshot by author of his account on https://chat.openai.com
Retrieval Augmented Generation: A Simple Definition
Google Trends of “Generative AI” and “Large Language Models” from Nov ’22 to Nov ‘23
Two token prediction techniques – Causal Language Model & Masked Language Model
Illustrative probability distribution of words after “The Teacher”
Transformer Architecture, Source: Attention is all you need, Vasvani et al.
Popular proprietary and open source LLMs as of April 2024 (non-exhaustive list)
Prompt, Completion, and Inference
RAG enhances the parametric memory of an LLM by creating access to non-parametric memory

Summary

  • RAG enhances the memory of LLMs by creating access to external information.
  • LLMs are next word, (or token) prediction models that have been trained on massive amounts of text data to generate human-like text.
  • Interaction with LLMs is carried out using natural language prompts and prompt engineering is an important discipline.
  • LLMs face challenges of having a knowledge cut-off date and being trained only on public data. They are also prone to generating factually incorrect information (hallucinations).
  • RAG overcomes the limitations of the LLMs by incorporating non-parametric memory and increases the context awareness and reliability in the responses.
  • Popular use cases of RAG are search engines, document question answering systems, conversational agents, personalized content generation, virtual assistants among others.

FAQ

What is Retrieval Augmented Generation (RAG)?RAG is a technique that retrieves relevant information from external sources, augments the user’s prompt with that information, and then uses an LLM to generate a more accurate, contextual, and trustworthy response.
Why do Large Language Models (LLMs) need RAG?LLMs have limitations: knowledge cut-off dates, a tendency to hallucinate, and no access to proprietary or non-public data. RAG addresses these by supplying up-to-date, verifiable, and domain-specific context at inference time.
How does a basic RAG workflow operate?In three steps: (1) Retrieve relevant context from an external knowledge source, (2) Augment the prompt with that context, and (3) Generate the answer with the LLM. In production, this is automated and scalable across sources and queries.
What do “parametric” and “non-parametric” memory mean in RAG?Parametric memory is the knowledge encoded in an LLM’s learned parameters (weights). Non-parametric memory is external knowledge (documents, databases, web) fetched at inference. RAG combines both to ground responses.
What are Large Language Models and how do they generate text?LLMs are next-token prediction models trained on massive text corpora (often using transformer architectures). They generate text by choosing the most probable next token given the context, repeating this step to form complete outputs.
What are prompts, completions, and inference?A prompt is the input you give an LLM. The model’s output is the completion. The process of producing a completion from a prompt is inference.
What is prompt engineering and which settings matter?Prompt engineering is crafting inputs to elicit better outputs (e.g., assigning a role, giving examples/few-shot, and clear instructions). Key settings include context window (max tokens the model can process) and temperature (controls randomness).
When should I pre-train or fine-tune a model instead of (or alongside) RAG?Pre-train or fine-tune when you need deep domain adaptation (e.g., highly specialized medical or legal language) and consistent behavior. Use RAG for cheaper, dynamic updates and access to proprietary or real-time data; many systems combine both.
How does RAG reduce hallucinations and increase trust?By grounding the model with retrieved, sourceable context, RAG makes outputs more factual, enables source citation, and reduces the model’s tendency to “confidently guess.”
What are common use cases for RAG?- LLM-first search experiences with cited answers - Personalized content generation - Real-time event commentary - Conversational support agents - Document question-answering over proprietary data - Virtual assistants with user/context awareness - AI-powered research in domains like law and finance

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • A Simple Guide to Retrieval Augmented Generation ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • A Simple Guide to Retrieval Augmented Generation ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • A Simple Guide to Retrieval Augmented Generation ebook for free