Build an Advanced RAG Application (From Scratch) you own this product

Hamza Farooq

MEAP began October 2024
Last updated April 2026
Publication in Fall 2026 (estimated)

ISBN 9781633436527
325 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean

catalog / Data Science / Deep Learning / Generative AI

resources: Book forum

table of content

PART 1: THE FUNDAMENTALS

1 The World of Large Language Models

1.1 What are Large Language Models, anyway?

1.1.1 Application of LLMs

1.1.2 Understanding the Scale of LLMs

1.1.3 Training LLMs

1.2 The Anatomy of an LLM Application

1.3 Challenges and Limitations of LLMs

1.4 The Startup World of LLMs

1.5 Summary

2 An in-depth look into the soul of the Transformer Architecture

2.1 The Transformers Architecture improvements over Recurrent Neural Networks

2.2 Digging into the Transformer’s underlying architecture

2.3 Encoder and Decoder Models

2.3.1 The Encoder Models

2.3.2 The Decoder Models and its Meteoric

2.3.3 Combining the power of Encoders and Decoders

2.4 Case Study: A Hotel Search Engine Utilizing Encoder & Decoder Models

2.5 Summary

PART 2: RETRIEVAL SYSTEMS

3 Encoder models in action: Semantic-Based Retrieval Systems

3.1 Information Retrieval Systems: A Historic Overview

3.2 Keyword Search using Inverted Index and TF-IDF

3.2.1 Implementing keyword search using inverted index and TF-IDF

3.3 Semantic Search from Scratch

3.4 Implementing Semantic Search with Python and Sentence Transformers

3.5 Summary

4 Semantic Search from Scratch

4.1 Loading the data for semantic search

4.2 Generate Embeddings from Hotel Reviews

4.2.1 Selecting the right Encoder Model

4.3 Similarity scores using Cross Encoders and Bi-Encoders

4.4 Introduction to FAISS and Vector Databases

4.5 Putting it all together: Travelle in action

4.6 Summary

PART 3: BUILDING APPLICATION AND AUGMENTING ANSWERS

5 Decoders in Action

5.1 Understanding the Core Principles of Decoder Models

5.1.1 Autoregressive Nature of LLMs

5.2 Decoding Algorithms

5.2.1 Greedy Decoding

5.2.2 Beam Search

5.2.3 Sampling Methods

5.3 Getting Started with Large Language Models and Prompting

5.3.1 What is Prompting?

5.3.2 Prompt Engineering

5.4 Selecting the Right LLM for our Application

5.5 Challenges with LLMs

5.6 References:

5.7 Summary

6 Retrieval Augmented Generation (RAG)

6.1 Retrieval Augmented Generation (R.A.G)

6.2 Using Vector Databases: For scalability and enhancing search capabilities

6.3 Document Chunking: Overcoming Context Length Limitations, the need for chunking

6.3.1 The Technical Reality of Context Windows in Embedding Models

6.3.2 The Computational Impossibility of Extending Context

6.3.3 Why More Compute Power Doesn’t Help

6.3.4 A Practical Example: The Long Document Problem

6.3.5 The Chunking Solution

6.3.6 Different Kinds of Chunking techniques

6.4 Building Research Paper RAG Tool with Qdrant

6.5 Evaluating RAG System Performance

6.5.1 Component-Level Evaluation: Dissecting the RAG Pipeline

6.5.2 The RAGAS Framework: A Comprehensive Evaluation Suite

6.6 Summary

PART 4: AGENTIC RAG (AKA RAG 2.0)

7 Enterprise RAG: Agentic Routing, Semantic Caching, and Query Rewriting

7.1 The Enterprise RAG Landscape

7.2 Agentic Routing

7.2.1 The Three-Route Architecture

7.2.2 Implementing the Router

7.3 Semantic Caching

7.3.1 Why Exact Match Caching Fails for RAG

7.3.2 Architecture: FAISS as the Cache Index

7.3.3 The Time-Sensitivity Filter and Semantic Cache Class

7.4 Query Rewriting and Sub-query Decomposition

7.4.1 The Ambiguity Problem in Enterprise Search

7.4.2 Single-Query Rewriting

7.4.3 Sub-query Decomposition

7.5 Combining all three into one Pipeline

7.6 Summary

8 Deploying RAG into Production

Overview

1 The World of Large Language Models

Language underpins human connection, and the chapter traces how computers learned to work with it—from early natural language processing to today’s deep-learning-driven large language models (LLMs). Fueled by neural networks, abundant data, and powerful compute, LLMs progressed beyond narrow voice assistants to systems that predict and generate coherent text, sustain dialogue, and reason over context. The discussion treats LLMs as practical building blocks in a larger machine-learning ecosystem, and previews the expanding horizon of multimodal models that jointly understand text, images, and audio for more natural, human-like interactions.

On the application front, LLMs power conversational agents, text and code generation, retrieval and classification, recommendation, editing, and agent-based task automation. A highlighted pattern is Retrieval-Augmented Generation (RAG), which couples targeted retrieval from curated sources with generation to ground answers in fresher, domain-specific context. Realizing these capabilities depends on training at scale with vast, diverse corpora to learn linguistic patterns and semantics, followed by fine-tuning for specific domains. The chapter explains the compute demands (GPUs/TPUs), the roles of training versus fine-tuning, and the practical orchestration required to design, resource, and deploy effective LLM applications.

The chapter also surveys core challenges—bias and ethics, limited interpretability, and hallucinations—emphasizing the need for safeguards, validation, and responsible use. Finally, it maps the startup landscape catalyzed by LLMs: quick-to-build wrappers, infrastructure providers (e.g., vector databases and LLM frameworks), and capital-intensive “GPU-rich” model labs competing at the frontier. With this context, the book positions itself as a hands-on guide to building robust, context-aware LLM applications, setting up deeper dives into architectures like Transformers in subsequent chapters.

An output for a given prompt using ChatGPT

Rose Goldberg’s famous self-operation napkin constructing an LLM application demands a thoughtful orchestration of resources, from computational power to application definition, echoing the complexity of Rube Goldberg's contraptions.

A Python code snippet demonstrating how to use the Ares API to retrieve information about taco spots in San Francisco using the internet. Instead of just showing URLs, the API returns actual answers with web URLs as source

Retrieval Augmentation Generation is used to enhance the capabilities of LLMs, especially in generating relevant and contextually appropriate responses. The approach involves incorporating an initial retrieval step before generating a response to leverage information from a knowledge base.

Summary

Large language models (LLMs) are the latest breakthrough in natural language processing after statistical models and deep learning. LLMs stand on the shoulders of this prior research but take language understanding to new heights through scale.
Pretrained on massive text corpora, LLMs like GPT-3 capture broad knowledge about language in their model parameters. This allows them to achieve state-of-the-art performance on language tasks.
Applications powered by LLMs include text generation, classification, translation, and semantic search to name a few.
LLMs utilize multi-billion parameter Transformer architectures. Training such gigantic models requires massive computational resources only recently made possible through advances in AI hardware.
Bias and safety are key challenges with large models. Extensive testing is required to prevent unintended model behavior across diverse demographics.
Numerous startups are offering LLM model APIs, democratizing access and allowing innovation in the realm of Generative AI.

FAQ

What is a Large Language Model (LLM)?

An LLM is a deep learning model trained on massive text corpora to predict the next token in a sequence. By learning statistical patterns of language, it can generate coherent, context-aware, human-like text across many topics and tasks.

How are LLMs different from early virtual assistants like Siri or Alexa?

Early assistants relied on narrow, predefined intents and reactive patterns. LLMs produce open-ended, context-rich responses, anticipate conversational turns, and engage in fluid back-and-forth dialogue that often feels more natural and flexible.

What are the most common applications of LLMs?

They power conversational assistants; generate text and code; improve information retrieval; perform language understanding (sentiment, intent, NER); support recommendation systems; assist with content creation and editing; and act as agent backbones for task automation. Retrieval-Augmented Generation (RAG) is a popular pattern that boosts factuality with external context.

What does “scale” mean for LLMs, and why does it matter?

Scale refers to vast training data and billions of parameters. This enables nuanced, contextually accurate outputs, but demands significant compute for training and fine-tuning, making these systems resource-intensive and costly to build and operate.

How are LLMs trained, and what is fine-tuning?

Pretraining exposes the model to large text datasets to learn next-token prediction by adjusting internal weights and biases. Fine-tuning adapts a pretrained model to a specific task or domain (e.g., legal or medical) using targeted data, improving performance without starting from scratch.

Why do LLMs need so much data, and what is Common Crawl?

Large, diverse data helps models learn general patterns, semantics, and contextual cues; boosts robustness; handles ambiguity; and reduces overfitting. Common Crawl is a nonprofit web-scale dataset containing hundreds of billions of pages collected over many years, frequently used in LLM training.

What hardware and resources are needed to train LLMs?

Training typically uses distributed clusters of GPUs (e.g., NVIDIA) or TPUs (Google) over weeks or months. Frontier labs deploy thousands of high-end GPUs like H100s. Due to costs, many apps access models via paid APIs priced by tokens.

What are multimodal models, and how do they compare to text-only LLMs?

Multimodal models process multiple input types (text, images, audio) simultaneously, enabling tasks like visual Q&A and richer context understanding. They mirror human perception more closely than text-only LLMs. An example is Google’s Gemini.

What is Retrieval‑Augmented Generation (RAG) and how does it work?

RAG augments a model’s responses with external context: 1) Retrieve relevant documents from a curated corpus; 2) Select candidates; 3) Integrate them into the prompt/context; 4) Generate an answer. It improves accuracy and freshness but depends on the quality of the underlying sources and typically searches a focused collection, not the whole internet.

What are the key challenges and risks of LLMs?

Bias from training data; ethical concerns (misleading or harmful content); limited interpretability (“black box” behavior); and hallucinations—confident but incorrect outputs. Mitigation requires careful data curation, evaluation, validation, and governance.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $33.59

you save $14.40 (30%)

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more