Overview

1 Understanding Large Language Models

Large language models are deep neural networks trained on vast text corpora to produce coherent, context-aware language. They mark a shift from earlier, task-specific NLP systems by generalizing across translation, summarization, question answering, sentiment analysis, and more. Their apparent “understanding” reflects statistical pattern mastery rather than human comprehension, and their gains stem from scaling data and parameters alongside advances in deep learning, especially the transformer architecture.

The transformer enables models to capture long-range dependencies via self-attention. Its original encoder–decoder design inspired two influential families: BERT-like models that excel at understanding and classification through masked word prediction, and GPT-like models that generate text autoregressively via next-word prediction. Despite this seemingly simple pretraining objective, GPT models exhibit zero-shot and few-shot abilities and show emergent behaviors such as translation, enabled by large-scale training, tokenization, and diverse datasets.

Building and deploying LLMs typically follows a two-stage process: pretraining on raw text to create a general foundation model, followed by finetuning on labeled data for instruction following or classification. Custom models can outperform general ones in domain tasks and offer benefits in privacy, on-device deployment, latency, cost control, and autonomy. While pretraining is resource-intensive, many open-source models and pretrained weights make practical development accessible. This book implements the end-to-end pipeline step by step: data preparation and attention, educational-scale pretraining and evaluation, and finetuning a foundation model into an assistant or classifier.

As this hierarchical depiction of the relationship between the different fields suggests, LLMs represent a specific application of deep learning techniques, leveraging their ability to process and generate human-like text. Deep learning is a specialized branch of machine learning that focuses on using multi-layer neural networks. And machine learning and deep learning are fields aimed at implementing algorithms that enable computers to learn from data and perform tasks that typically require human intelligence.
LLM interfaces enable natural language communication between users and AI systems. This screenshot shows ChatGPT writing a poem according to a user's specifications.
Pretraining an LLM involves next-word prediction on large text datasets. A pretrained LLM can then be finetuned using a smaller labeled dataset.
A simplified depiction of the original transformer architecture, which is a deep learning model for language translation. The transformer consists of two parts, an encoder that processes the input text and produces an embedding representation (a numerical representation that captures many different factors in different dimensions) of the text that the decoder can use to generate the translated text one word at a time. Note that this figure shows the final stage of the translation process where the decoder has to generate only the final word ("Beispiel"), given the original input text ("This is an example") and a partially translated sentence ("Das ist ein"), to complete the translation.
A visual representation of the transformer's encoder and decoder submodules. On the left, the encoder segment exemplifies BERT-like LLMs, which focus on masked word prediction and are primarily used for tasks like text classification. On the right, the decoder segment showcases GPT-like LLMs, designed for generative tasks and producing coherent text sequences.
In addition to text completion, GPT-like LLMs can solve various tasks based on their inputs without needing retraining, finetuning, or task-specific model architecture changes. Sometimes, it is helpful to provide examples of the target within the input, which is known as a few-shot setting. However, GPT-like LLMs are also capable of carrying out tasks without a specific example, which is called zero-shot setting.
In the next-word pretraining task for GPT models, the system learns to predict the upcoming word in a sentence by looking at the words that have come before it. This approach helps the model understand how words and phrases typically fit together in language, forming a foundation that can be applied to various other tasks.
The GPT architecture employs only the decoder portion of the original transformer. It is designed for unidirectional, left-to-right processing, making it well-suited for text generation and next-word prediction tasks to generate text in iterative fashion one word at a time.
The stages of building LLMs covered in this book include implementing the LLM architecture and data preparation process, pretraining an LLM to create a foundation model, and finetuning the foundation model to become a personal assistant or text classifier.

Summary

  • LLMs have transformed the field of natural language processing, which previously mostly relied on explicit rule-based systems and simpler statistical methods. The advent of LLMs introduced new deep learning-driven approaches that led to advancements in understanding, generating, and translating human language.
  • Modern LLMs are trained in two main steps.
    • First, they are pretrained on a large corpus of unlabeled text by using the prediction of the next word in a sentence as a "label."
    • Then, they are finetuned on a smaller, labeled target dataset to follow instructions or perform classification tasks.
  • LLMs are based on the transformer architecture. The key idea of the transformer architecture is an attention mechanism that gives the LLM selective access to the whole input sequence when generating the output one word at a time.
  • The original transformer architecture consists of an encoder for parsing text and a decoder for generating text.
  • LLMs for generating text and following instructions, such as GPT-3 and ChatGPT, only implement decoder modules, simplifying the architecture.
  • Large datasets consisting of billions of words are essential for pretraining LLMs. In this book, we will implement and train LLMs on small datasets for educational purposes but also see how we can load openly available model weights.
  • While the general pretraining task for GPT-like models is to predict the next word in a sentence, these LLMs exhibit "emergent" properties such as capabilities to classify, translate, or summarize texts.
  • Once an LLM is pretrained, the resulting foundation model can be finetuned more efficiently for various downstream tasks.
  • LLMs finetuned on custom datasets can outperform general LLMs on specific tasks.

FAQ

What is a large language model (LLM)?An LLM is a deep neural network trained on massive text corpora to generate, understand, and respond with human-like text. It is a form of generative AI, producing text token by token based on context learned during training.
What does “large” mean in LLMs, and why are they trained with next-word prediction?“Large” refers to both the number of model parameters (often billions) and the scale/diversity of the training data. Next-word prediction leverages the sequential nature of language; despite its simplicity, it teaches the model rich context, structure, and relationships, enabling broad capabilities.
Do LLMs truly “understand” language?LLMs do not possess human-like consciousness or comprehension. “Understanding” means they produce coherent, contextually relevant text by modeling statistical patterns learned from data.
How do LLMs differ from earlier NLP approaches?Earlier NLP systems were narrow and often relied on manual features or task-specific architectures. LLMs generalize across many tasks, handling complex generation and interpretation without handcrafted rules.
What is the transformer architecture and why is self-attention important?The transformer is a neural architecture with encoder and decoder modules connected by self-attention. Self-attention lets the model weigh relationships among tokens across long ranges, capturing context that leads to coherent, relevant outputs.
How do GPT and BERT differ?GPT is a decoder-only, autoregressive model trained for next-word prediction, optimized for text generation. BERT uses the encoder side with masked-word prediction, excelling at understanding tasks like classification; both build on transformers but target different objectives.
What are zero-shot and few-shot capabilities in GPT-like models?Zero-shot means solving a task without seeing task-specific examples at inference time. Few-shot means the model adapts from a handful of examples provided in the prompt; both arise from broad pretraining.
What are pretraining and finetuning, and what finetuning types are common?Pretraining teaches a base (foundation) model general language patterns using large unlabeled text via next-word prediction. Finetuning specializes the model on labeled data; common types include instruction-finetuning (instruction–response pairs) and classification finetuning (texts with labels).
Why build or customize your own LLM?Building one deepens understanding and enables tailoring to domain-specific tasks, which can outperform general models. Custom LLMs also improve data privacy, enable on-device deployment, reduce latency/costs, and give full control over updates.
What stages will this book follow to build an LLM from scratch?1) Implement data preprocessing and core components like attention. 2) Pretrain a GPT-like model (on small, educational datasets) and learn evaluation. 3) Finetune the pretrained model for instruction following or classification.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Large Language Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Large Language Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Large Language Model (From Scratch) ebook for free