PART 1: THE FUNDAMENTALS

1 The World of Large Language Models

1.1 What are Large Language Models, anyway?

1.1.1 Application of LLMs

1.1.2 Understanding the Scale of LLMs

1.1.3 Training LLMs

1.2 The Anatomy of an LLM Application

1.3 Challenges and Limitations of LLMs

1.4 The Startup World of LLMs

1.5 Summary

2 An in-depth look into the soul of the Transformer Architecture

2.1 The Transformers Architecture improvements over Recurrent Neural Networks

2.2 Digging into the Transformer’s underlying architecture

2.3 Encoder and Decoder Models

2.3.1 The Encoder Models

2.3.2 The Decoder Models and its Meteoric

2.3.3 Combining the power of Encoders and Decoders

2.4 Case Study: A Hotel Search Engine Utilizing Encoder & Decoder Models

2.5 Summary

3 Encoder models in action: Semantic-Based Retrieval Systems

3.1 Information Retrieval Systems: A Historic Overview

3.2 Keyword Search using Inverted Index and TF-IDF

3.2.1 Implementing keyword search using inverted index and TF-IDF

3.3 Semantic Search from Scratch

3.4 Implementing Semantic Search with Python and Sentence Transformers

3.5 Summary

PART 2: RETRIEVAL SYSTEMS

4 Semantic Search from Scratch

4.1 Loading the data for semantic search

4.2 Generate Embeddings from Hotel Reviews

4.2.1 Selecting the right Encoder Model

4.3 Similarity scores using Cross Encoders and Bi-Encoders

4.4 Introduction to FAISS and Vector Databases

4.5 Putting it all together: Travelle in action

4.6 Summary

5 Decoders in Action

5.1 Understanding the Core Principles of Decoder Models

5.1.1 Autoregressive Nature of LLMs

5.2 Decoding Algorithms

5.2.1 Greedy Decoding

5.2.2 Beam Search

5.2.3 Sampling Methods

5.3 Getting Started with Large Language Models and Prompting

5.3.1 What is Prompting?

5.3.2 Prompt Engineering

5.4 Selecting the Right LLM for our Application

5.5 Challenges with LLMs

5.6 References:

5.7 Summary

6 Combining Encoder & Decoder Model to Create RAG Applications

7 Semantic Search: A Practical Case Study

Overview

1 The World of Large Language Models

Language underpins how humans communicate and create, and the chapter traces how Natural Language Processing evolved from early rule-based systems to deep learning, culminating in Large Language Models that can understand context and generate coherent, human-like text. It frames LLMs as practical building blocks within a broader machine-learning ecosystem, emphasizing real-world uses over math-heavy theory. The narrative also previews the rise of multimodal systems that integrate text with images and audio, hinting at more natural, versatile AI experiences.

At their core, LLMs are probabilistic predictors trained on massive corpora to anticipate the next word, yielding fluent generation and strong language understanding. Their capabilities span conversation, text and code generation, retrieval and search augmentation, sentiment and entity analysis, recommendations, content drafting and editing, and agent-based task execution. Delivering this power requires scale: large datasets, distributed training on GPUs/TPUs, and a pipeline of pretraining followed by task- or domain-specific fine-tuning. Retrieval-Augmented Generation extends models with targeted external knowledge to improve relevance and freshness of responses.

The chapter balances promise with pragmatism, outlining challenges such as data bias, ethical risks, limited interpretability, and hallucinations, and it underscores the need for validation, monitoring, and responsible use. It sketches the anatomy of an LLM application—from hardware choices and data pipelines to model adaptation and deployment—and uses RAG to illustrate how retrieval, context integration, and generation combine into a practical workflow. Finally, it surveys the startup landscape, from simple wrappers to infrastructure providers and well-funded model labs, and sets the book’s focus on helping readers build effective, context-aware LLM applications in practice.

An output for a given prompt using ChatGPT

Rose Goldberg’s famous self-operation napkin constructing an LLM application demands a thoughtful orchestration of resources, from computational power to application definition, echoing the complexity of Rube Goldberg's contraptions.

A Python code snippet demonstrating how to use the Ares API to retrieve information about taco spots in San Francisco using the internet. Instead of just showing URLs, the API returns actual answers with web URLs as source

Retrieval Augmentation Generation is used to enhance the capabilities of LLMs, especially in generating relevant and contextually appropriate responses. The approach involves incorporating an initial retrieval step before generating a response to leverage information from a knowledge base.

Summary

Large language models (LLMs) are the latest breakthrough in natural language processing after statistical models and deep learning. LLMs stand on the shoulders of this prior research but take language understanding to new heights through scale.
Pretrained on massive text corpora, LLMs like GPT-3 capture broad knowledge about language in their model parameters. This allows them to achieve state-of-the-art performance on language tasks.
Applications powered by LLMs include text generation, classification, translation, and semantic search to name a few.
LLMs utilize multi-billion parameter Transformer architectures. Training such gigantic models requires massive computational resources only recently made possible through advances in AI hardware.
Bias and safety are key challenges with large models. Extensive testing is required to prevent unintended model behavior across diverse demographics.
Numerous startups are offering LLM model APIs, democratizing access and allowing innovation in the realm of Generative AI.

FAQ

What is a Large Language Model (LLM)?

An LLM is a deep learning model trained on massive text corpora to predict the next word in a sequence. By learning linguistic patterns and context, it can generate coherent, human-like text, engage in conversation, summarize, translate, and more.

How are LLMs different from early virtual assistants like Siri or Alexa?

Early assistants operated within narrow, predefined scopes. LLMs go beyond reactive patterns: they proactively generate language, handle broader contexts, anticipate conversational turns, and produce paragraphs of nuanced, contextually relevant text.

What are the main applications of LLMs?

Conversational assistants and chatbots
Text and code generation (summarization, translation, creative writing)
Information retrieval and organization
Language understanding (sentiment, intent, NER, tutoring)
Recommendation systems
Content creation and editing (clarity, grammar, style)
Agent-based task fulfillment (autonomously executing multi-step tasks)

What is Retrieval-Augmented Generation (RAG) and when should I use it?

RAG retrieves relevant information from a curated knowledge base and integrates it into the prompt before generation. Steps include retrieval, candidate selection, context integration, and response generation. It’s ideal for specialized or up-to-date information, but it does not guarantee source reliability and works best with focused document sets.

Why do LLMs require so much data and compute?

They need vast, diverse text (e.g., web-scale corpora) to learn general patterns, semantics, context, and to handle ambiguity while avoiding overfitting. Training is compute-intensive, often distributed across GPUs/TPUs for weeks or months. Providers commonly recoup costs via token-based API pricing and subscriptions.

How do training and fine-tuning differ?

Training (pretraining) exposes the model to huge datasets to learn next-word prediction, adjusting weights and biases iteratively. Fine-tuning adapts the pretrained model to a specific task or domain (e.g., legal or medical text) using targeted examples to improve performance on that use case.

What are multimodal models, and how do they differ from text-only LLMs?

Multimodal models process multiple data types (text, images, audio) simultaneously. Unlike text-only LLMs, they can, for example, interpret an image, understand speech, and generate relevant text, enabling applications like visual QA and mixed-media content creation.

What are the key challenges and limitations of LLMs?

Data bias that can surface in outputs
Ethical concerns (misleading or harmful content)
Limited interpretability (“black box” behavior)
Hallucinations (confident but incorrect or nonsensical content)

Mitigation requires careful monitoring, validation, and responsible deployment.

What goes into building an LLM application?

Define the application’s purpose, select appropriate hardware (often GPUs), and plan for data, training, and fine-tuning. Expect orchestration of multiple components—retrieval, prompting, evaluation, and deployment—much like coordinating a complex, Rube Goldberg–style system.

How has the rise of LLMs shaped the startup ecosystem?

Three broad groups have emerged: (1) application “wrappers” that add UX or vertical features atop LLMs, (2) infrastructure startups (e.g., vector databases and LLM frameworks) powering enterprise solutions, and (3) GPU-rich companies training frontier models. Funding tends to favor infrastructure and model labs, while many simple wrappers are easy to replicate.

1.1.1 Application of LLMs

1.1.2 Understanding the Scale of LLMs

1.1.3 Training LLMs

2.3.1 The Encoder Models

2.3.2 The Decoder Models and its Meteoric

2.3.3 Combining the power of Encoders and Decoders

3.2.1 Implementing keyword search using inverted index and TF-IDF

4.2.1 Selecting the right Encoder Model

5.1.1 Autoregressive Nature of LLMs

5.2.1 Greedy Decoding

5.2.2 Beam Search

5.2.3 Sampling Methods

5.3.1 What is Prompting?

5.3.2 Prompt Engineering

An output for a given prompt using ChatGPT

Rose Goldberg’s famous self-operation napkin constructing an LLM application demands a thoughtful orchestration of resources, from computational power to application definition, echoing the complexity of Rube Goldberg's contraptions.

A Python code snippet demonstrating how to use the Ares API to retrieve information about taco spots in San Francisco using the internet. Instead of just showing URLs, the API returns actual answers with web URLs as source

Retrieval Augmentation Generation is used to enhance the capabilities of LLMs, especially in generating relevant and contextually appropriate responses. The approach involves incorporating an initial retrieval step before generating a response to leverage information from a knowledge base.

pro $24.99 per month

lite $19.99 per month

team

pro

team

pro

team