1 Large Language Models and the Need for Retrieval Augmented Generation
This chapter introduces the promise and pitfalls of modern Large Language Models and motivates Retrieval Augmented Generation as a practical remedy. It sets the stage by explaining why LLMs have become central to language tasks while noting their limits in accuracy, recency, and access to proprietary knowledge. The chapter’s goal is to build a foundation for designing, implementing, and evaluating RAG systems so readers can confidently apply them to real-world problems.
It explains LLMs as next-token predictors trained on massive text corpora using transformer architectures, available as powerful foundation models or smaller task-specific variants. Readers learn how to work with LLMs through prompts and inference, and how prompt engineering (roles, examples, clear instructions) can improve results. Key operational ideas such as context windows, temperature, few-shot prompting, and in-context learning are introduced, alongside a quick survey of common applications like writing, summarization, translation, coding, classification, information extraction, and conversational interfaces.
The chapter then details why RAG is needed: knowledge cutoffs, hallucinations, and lack of non-public context limit LLM reliability. RAG addresses these by retrieving relevant external information (non-parametric memory), augmenting the prompt, and letting the model generate grounded, up-to-date, and context-aware answers—often with source attribution—without costly retraining. It frames RAG as combining parametric and non-parametric memory, highlights the resulting gains in factuality and trust, and surveys prominent uses including next-gen search experiences, personalized content, real-time commentary, support agents, document Q&A, virtual assistants, and AI-assisted research.
ChatGPT response to the question, “Who won the 2023 cricket world cup?” (Variation 1), Source: Screenshot by author of his account on https://chat.openai.com
ChatGPT response to the question, “Who won the 2023 cricket world cup?” (Variation 2), Source: Screenshot by author of his account on https://chat.openai.com
Wikipedia Article on 2023 Cricket World Cup, Source : https://en.wikipedia.org/wiki/2023_Cricket_World_Cup
ChatGPT response to the question, augmented with external context, Source : Screenshot by author of his account on https://chat.openai.com
Retrieval Augmented Generation: A Simple Definition
Google Trends of “Generative AI” and “Large Language Models” from Nov ’22 to Nov ‘23
Two token prediction techniques – Causal Language Model & Masked Language Model
Illustrative probability distribution of words after “The Teacher”
Transformer Architecture, Source: Attention is all you need, Vasvani et al.
Popular proprietary and open source LLMs as of April 2024 (non-exhaustive list)
Prompt, Completion, and Inference
RAG enhances the parametric memory of an LLM by creating access to non-parametric memory
Summary
- RAG enhances the memory of LLMs by creating access to external information.
- LLMs are next word, (or token) prediction models that have been trained on massive amounts of text data to generate human-like text.
- Interaction with LLMs is carried out using natural language prompts and prompt engineering is an important discipline.
- LLMs face challenges of having a knowledge cut-off date and being trained only on public data. They are also prone to generating factually incorrect information (hallucinations).
- RAG overcomes the limitations of the LLMs by incorporating non-parametric memory and increases the context awareness and reliability in the responses.
- Popular use cases of RAG are search engines, document question answering systems, conversational agents, personalized content generation, virtual assistants among others.
A Simple Guide to Retrieval Augmented Generation ebook for free