Enterprise RAG you own this product

Scaling Retrieval Augmented Generation

Tyler Suard

MEAP began March 2025
Last updated July 2025
Publication in Spring 2026 (estimated)

ISBN 9781633435476
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean

catalog / Data Science / Deep Learning / Generative AI

table of content

PART 1: BUILDING YOUR RAG

1 Intro to enterprise RAG

1.1 A brief intro to RAG

1.2 The difference between naive RAG and enterprise RAG

1.3 Why businesses need enterprise RAG

1.4 Example use cases

1.5 Building a RAG system

1.6 Summary

2 Nothing happens until someone writes an eval

2.1 Introducing evals and eval-driven development

2.2 Why you can’t write a single line of code until you have an eval

2.3 How to use evals

2.4 Automatic evals in end-to-end testing

2.4.1 How to use LLMS in the eval process

2.4.2 Testing all data sources

2.4.3 Implementing automated evals

2.5 Summary

3 Search service ingestion

3.1 Setting up Azure AI Search

3.2 Preparing and uploading SQL records to AI Search

3.2.1 Code setup

3.2.2 Summarize one SQL record

3.2.3 Embed the record summary

3.2.4 Upload the record and embedding to Azure AI Search

3.2.5 Putting it all together

3.2.6 Loop through all SQL records

3.2.7 Test AI Search

3.2.8 Search filters and selecting fields

3.2.9 Ordering search results for consistency

3.2.10 Updating AI Search Index: adding new records

3.2.11 Updating AI Search Index: deleting records

3.3 Preparing and uploading unstructured data to AI Search

3.4 Summary

4 Retrieval using AutoGen agents

4.1 Using agents for smarter retrieval in RAG

4.1.1 Getting into agents

4.1.2 Setting up AutoGen agents

4.2 Assigning a search function to the agents

4.2.1 Creating query embeddings for the AI search index

4.2.2 Setting up the Azure AI Search client

4.3 Searching multiple databases simultaneously

4.3.1 Using group chats for simultaneous queries

4.4 Writing the final answer

4.5 Making sure the bot answers in the right language

4.5.1 Define the language checker function

4.6 Storing answer information

4.7 The RAGChat function: one RAG to rule them all

4.8 Question triage

4.8.1 Code Setup

4.9 Websocket integration for streaming responses

4.9.1 Essential setup

4.9.2 Handling websocket messages

4.9.3 Starting the websocket server

4.10 Testing the websocket server with a client

4.11 Summary

PART 2: DEPLOYING AND IMPROVING

5 Hosting, scaling, and load testing

5.1 Packaging and containers

5.1.1 Removing state from the container

5.1.2 Creating a CosmosDB database for chat logs

5.1.3 Modifying chapter 4 code to hit the new CosmosDB database

5.1.4 Returning the final answer for integration tests

5.1.5 Getting ready to containerize

5.2 Deploying

5.2.1 Creating a container registry

5.2.2 Creating a web app

5.2.3 Setting up the web app

5.2.4 Setting up connections with Github

5.3 Trying out your deployed app

5.4 Breaking things, on purpose: load testing with Locust

5.4.1 Adding logging

5.4.2 Load testing using Locust

5.4.3 Fixing the things that Locust breaks

5.5 Summary

== 6 Communication strategies: Disclaimers, feedback, and prompt tuning

6.1 Disclaimer message and user guide

6.2 Introduction to user feedback

6.3 Building a user feedback system part 1: CosmosDB

6.4 Building a user feedback system part 2: Function App

6.5 Why not direct? Why not use our web app?

6.6 App setup

6.7 Code

6.8 Deploying the function app

6.9 Environment variables

6.10 Trying out our feedback loop

6.11 Ask a question and give negative feedback

6.12 Check the logs and diagnose the problem

6.13 Update the prompt and/or your code

6.14 Redeploy and try it again

6.15 Email the user

6.16 A fully fleshed-out writer prompt

6.17 Responding (nicely) to dumb user feedback

6.18 Summary

7 Security and governance

PART 3: MAINTAINING

8 Monitoring and observability

9 Writing readable code

10 Tips and Troubleshooting

Overview

1 Intro to enterprise RAG

This chapter introduces Retrieval Augmented Generation (RAG) as a practical way to get precise, conversational answers from company data, like having a tireless digital assistant who knows where everything is. RAG pairs a language model with search to understand natural-language questions, retrieve the most relevant information from diverse sources (databases, documents, apps), and compose clear answers in seconds. Beyond simple lookups, it adapts to user intent and language, turning fragmented, hard-to-reach information into immediate, useful responses.

The chapter contrasts Naive RAG—embedding a query and doing a basic vector search—with Enterprise RAG built for real business constraints. While Naive RAG can work for simple tasks, it often misretrieves, hallucinates, and struggles at scale. Enterprise RAG adds a robust pipeline: input validation, question triage, query rewriting, hybrid (keyword + vector) search across multiple sources, asynchronous agents, relevance ranking and filtering, and a writer step to deliver consistent, grounded answers. It also addresses operational needs like multilingual support, data freshness, access control, guardrails, reliability, and cost management, driving both higher accuracy and faster time to answer.

Why it matters for business: Enterprise RAG accelerates decisions, improves customer service, streamlines collaboration, and works across organizations of any size. The chapter illustrates use cases from small shops to global enterprises, including inventory and reordering, competitive intelligence, healthcare insights, finance summaries, and academic research support. It also previews how to build such a system: ingesting and chunking content with metadata, embedding and indexing, optimizing retrieval with query rewriting and agents, and generating polished, trustworthy responses. By the end of the book, readers will be able to implement a scalable RAG solution that makes organizational knowledge instantly usable.

The left column shows the multiple steps and complexity of manually searching a SQL database for records. Compare this with the relative ease and simplicity of asking the question of a RAG chatbot instead,shown in the right column.

In a RAG system, the user question, the prompt, and the retrieved data are combined and sent to an LLM, which generates an answer using all that input information.

Traditional manual workflow for retrieving answers, requiring database queries, corrections, and manual review. This process is time-consuming, and requires a lot of effort.

Basic RAG process with embedding, vector search, and a large language model. This simple approach is efficient but prone to errors and lacks context handling.

Enterprise RAG pipeline improves speed, accuracy, and scalability by incorporating validation, query rewriting, and asynchronous agents, reducing response times to 30 seconds..

A naive RAG pipeline with limited steps for retrieving answers. Suitable for simple queries but insufficient for handling complex or large-scale enterprise needs.

Key questions for designing enterprise RAG systems, addressing user input limits, database performance, context accuracy, and feedback management for better scalability and reliability.

Enterprise RAG system architecture showing ingestion, retrieval, and generation steps. Raw data is preprocessed, embedded, and searched to deliver accurate, context-aware answers.

Summary

Retrieval Augmented Generation (RAG) is an advanced AI technology that combines conversational skills with real-time data retrieval, like an efficient assistant.
RAG allows users to ask questions in plain language and receive detailed, specific information tailored to their needs, accessing data from databases, documents, and applications like Slack.
Naive RAG, while easy to set up, often falls short in business environments due to misunderstandings of context, retrieving incorrect data, or providing inaccurate ("hallucinated") answers.
Enterprise RAG is designed to handle complex business scenarios, accurately processing diverse questions in different languages and grasping user intent.
Implementing Enterprise RAG leads to streamlined operations, faster decision-making, improved collaboration, and enhanced customer service by resolving issues quickly.
The book will guide readers step-by-step in building their own Enterprise RAG system, empowering them to harness the full potential of AI-driven data retrieval.

FAQ

What is Retrieval Augmented Generation (RAG)?

RAG is an AI approach that pairs a conversational language model with live retrieval from your data sources. You ask a question in natural language, it searches databases and documents for relevant facts, and then the model writes a clear, tailored answer—typically in seconds.

How does a RAG system produce an answer?

The system combines three ingredients: your question, a system prompt, and retrieved passages from your data. These are sent to a large language model (LLM), which uses the retrieved context to generate a grounded, human-readable response.

What is “Naive RAG,” and why does it often fail in businesses?

Naive RAG embeds the question, runs a simple vector similarity search over pre-embedded chunks, and lets the LLM answer from the closest matches. In practice it often retrieves the wrong passages, struggles with large or complex datasets, and can hallucinate. Many implementations stall at this stage and fail to meet enterprise requirements.

What makes “Enterprise RAG” different from Naive RAG?

Enterprise RAG adds structure and safeguards: input validation, question triage, query rewriting, asynchronous agents, hybrid search (keyword + vector), result ordering/filtering, and a writer agent. The result is faster (often 10–30 seconds), more accurate, scalable retrieval that works across multiple data sources and messy real-world queries.

What are the key steps in an Enterprise RAG pipeline?

Typical stages include: Input Validation, Question Triage, Query Rewriting, Asynchronous Agents with a high-quality LLM, Enterprise Search using hybrid indexing, Order and Filter Results, and a Writer Agent to compose a clear, consistent final answer. If nothing relevant is found, the system asks for clarification.

Why chunk documents and add metadata during ingestion?

Chunking turns long documents into smaller, meaningful sections so retrieval is precise and token usage is lower. Metadata (for example, product name, topic, page) speeds filtering, improves accuracy, and makes it easy to cite or link back to sources.

What kinds of data sources can Enterprise RAG use?

It can search both structured and unstructured content—SQL databases, PDFs and other documents, as well as apps like Slack. The architecture is designed to scale as new sources are added.

How does Enterprise RAG reduce hallucinations and manage risk?

It grounds answers in retrieved data, applies guardrails to avoid inappropriate output, and enforces access controls to protect sensitive information. Consistency checks, clear capability limits, and escalation when no answer exists help reduce legal and reputational risk.

What business value does Enterprise RAG deliver?

It shortens time-to-answer, boosts employee productivity, and improves customer support (organizations report sizable reductions in resolution time). Teams collaborate more effectively when answers are a question away, and use cases span small shops to large enterprises, healthcare, finance, and education.

What costs and skills are involved in implementing Enterprise RAG?

Expect spend on compute, search infrastructure, and LLM usage, plus ongoing development and maintenance. You’ll need AI/ML and data engineering expertise for ingestion, retrieval, evaluation, and guardrails—though managed services (for example, Azure AI Search) can reduce build complexity.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more