table of content

1 Understanding large language models

1.1 What is an LLM?

1.2 Applications of LLMs

1.3 Stages of building and using LLMs

1.4 Introducing the transformer architecture

1.5 Utilizing large datasets

1.6 A closer look at the GPT architecture

1.7 Building a large language model

2 Working with text data

2.1 Understanding word embeddings

2.2 Tokenizing text

2.3 Converting tokens into token IDs

2.4 Adding special context tokens

2.5 Byte pair encoding

2.6 Data sampling with a sliding window

2.7 Creating token embeddings

2.8 Encoding word positions

3 Coding attention mechanisms

3.1 The problem with modeling long sequences

3.2 Capturing data dependencies with attention mechanisms

3.3 Attending to different parts of the input with self-attention

3.3.1 A simple self-attention mechanism without trainable weights

3.3.2 Computing attention weights for all input tokens

3.4 Implementing self-attention with trainable weights

3.4.1 Computing the attention weights step by step

3.4.2 Implementing a compact self-attention Python class

3.5 Hiding future words with causal attention

3.5.1 Applying a causal attention mask

3.5.2 Masking additional attention weights with dropout

3.5.3 Implementing a compact causal attention class

3.6 Extending single-head attention to multi-head attention

3.6.1 Stacking multiple single-head attention layers

3.6.2 Implementing multi-head attention with weight splits

4 Implementing a GPT model from scratch to generate text

4.1 Coding an LLM architecture

4.2 Normalizing activations with layer normalization

4.3 Implementing a feed forward network with GELU activations

4.4 Adding shortcut connections

4.5 Connecting attention and linear layers in a transformer block

4.6 Coding the GPT model

4.7 Generating text

5 Pretraining on unlabeled data

5.1 Evaluating generative text models

5.1.1 Using GPT to generate text

5.1.2 Calculating the text generation loss

5.1.3 Calculating the training and validation set losses

5.2 Training an LLM

5.3 Decoding strategies to control randomness

5.3.1 Temperature scaling

5.3.2 Top-k sampling

5.3.3 Modifying the text generation function

5.4 Loading and saving model weights in PyTorch

5.5 Loading pretrained weights from OpenAI

6 Fine-tuning for classification

6.1 Different categories of fine-tuning

6.2 Preparing the dataset

6.3 Creating data loaders

6.4 Initializing a model with pretrained weights

6.5 Adding a classification head

6.6 Calculating the classification loss and accuracy

6.7 Fine-tuning the model on supervised data

6.8 Using the LLM as a spam classifier

7 Fine-tuning to follow instructions

7.1 Introduction to instruction fine-tuning

7.2 Preparing a dataset for supervised instruction fine-tuning

7.3 Organizing data into training batches

7.4 Creating data loaders for an instruction dataset

7.5 Loading a pretrained LLM

7.6 Fine-tuning the LLM on instruction data

7.7 Extracting and saving responses

7.8 Evaluating the fine-tuned LLM

7.9 Conclusions

7.9.1 What’s next?

7.9.2 Staying up to date in a fast-moving field

7.9.3 Final words

Appendixes

Appendix A: Introduction to PyTorch

A.1 What is PyTorch?

A.1.1 The three core components of PyTorch

A.1.2 Defining deep learning

A.1.3 Installing PyTorch

A.2 Understanding tensors

A.2.1 Scalars, vectors, matrices, and tensors

A.2.2 Tensor data types

A.2.3 Common PyTorch tensor operations

A.3 Seeing models as computation graphs

A.4 Automatic differentiation made easy

A.5 Implementing multilayer neural networks

A.6 Setting up efficient data loaders

A.7 A typical training loop

A.8 Saving and loading models

A.9 Optimizing training performance with GPUs

A.9.1 PyTorch computations on GPU devices

A.9.2 Single-GPU training

A.9.3 Training with multiple GPUs

Appendix B: References and further reading

Appendix C: Exercise solutions

Appendix D: Adding bells and whistles to the training loop

D.1 Learning rate warmup

D.2 Cosine decay

D.3 Gradient clipping

D.4 The modified training function

Appendix E: Parameter-efficient fine-tuning with LoRA

E.1 Introduction to LoRA

E.2 Preparing the dataset

E.3 Initializing the model

E.4 Parameter-efficient fine-tuning with LoRA

Overview

5 Pretraining on Unlabeled Data

This chapter shifts from building a GPT-style architecture to actually pretraining it on unlabeled text. It begins by setting up text generation and, crucially, introducing numerical evaluation so training can be guided by measurable objectives. Cross-entropy loss (and its companion metric, perplexity) is used to quantify next-token prediction quality on both training and validation splits prepared from tokenized text, providing a practical yardstick for progress and a foundation for later optimization.

With evaluation in place, the chapter implements a compact yet complete PyTorch training workflow: batching, forward and loss computation, backpropagation, and weight updates with AdamW. Utility routines compute average losses over data loaders, periodically assess train/validation performance, and print sample generations to monitor qualitative gains. Because the educational dataset is intentionally small, the chapter discusses overfitting and demonstrates decoding controls—temperature scaling and top-k sampling—to trade off determinism and diversity, reduce verbatim memorization, and produce more varied outputs via a refined text generation function.

Finally, the chapter covers persistence and reuse. It shows how to save and reload model parameters and optimizer state for continued training, then loads publicly released GPT-2 weights into the custom GPT implementation by aligning configurations and carefully mapping parameters to corresponding layers. Successful loading is validated through coherent text generation, providing a stronger starting point for future tasks such as finetuning for classification and instruction following in subsequent chapters.

A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset and finetuning it on a labeled dataset. This chapter focuses on pretraining the LLM, which includes implementing the training code, evaluating the performance, and saving and loading model weights.

An overview of the topics covered in this chapter. We begin by recapping the text generation from the previous chapter and implementing basic model evaluation techniques that we can use during the pretraining stage.

Generating text involves encoding text into token IDs that the LLM processes into logit vectors. The logit vectors are then converted back into token IDs, detokenized into a text representation.

For each of the 3 input tokens, shown on the left, we compute a vector containing probability scores corresponding to each token in the vocabulary. The index position of the highest probability score in each vector represents the most likely next token ID. These token IDs associated with the highest probability scores are selected and mapped back into a text that represents the text generated by the model.

We now implement the text evaluation function in the remainder of this section. In the next section, we apply this evaluation function to the entire dataset we use for model training.

Before training, the model produces random next-token probability vectors. The goal of model training is to ensure that the probability values corresponding to the highlighted target token IDs are maximized.

Calculating the loss involves several steps. Steps 1 to 3 calculate the token probabilities corresponding to the target tensors. These probabilities are then transformed via a logarithm and averaged in steps 4-6.

After computing the cross entropy loss in the previous section, we now apply this loss computation to the entire text dataset that we will use for model training.

When preparing the data loaders, we split the input text into training and validation set portions. Then, we tokenize the text (only shown for the training set portion for simplicity) and divide the tokenized text into chunks of a user-specified length (here 6). Finally, we shuffle the rows and organize the chunked text into batches (here, batch size 2), which we can use for model training.

We have recapped the text generation process and implemented basic model evaluation techniques to compute the training and validation set losses. Next, we will go to the training functions and pretrain the LLM.

A typical training loop for training deep neural networks in PyTorch consists of several steps, iterating over the batches in the training set for several epochs. In each loop, we calculate the loss for each training set batch to determine loss gradients, which we use to update the model weights so that the training set loss is minimized.

At the beginning of the training, we observe that both the training and validation set losses sharply decrease, which is a sign that the model is learning. However, the training set loss continues to decrease past the second epoch, whereas the validation loss stagnates. This is a sign that the model is still learning, but it's overfitting to the training set past epoch 2.

Our model can generate coherent text after implementing the training function. However, it often memorizes passages from the training set verbatim. The following section covers strategies to generate more diverse output texts.

A temperature of 1 represents the unscaled probability scores for each token in the vocabulary. Decreasing the temperature to 0.1 sharpens the distribution, so the most likely token (here "forward") will have an even higher probability score. Vice versa, increasing the temperature to 5 makes the distribution more uniform.

Using top-k sampling with k=3, we focus on the 3 tokens associated with the highest logits and mask out all other tokens with negative infinity (-inf) before applying the softmax function. This results in a probability distribution with a probability value 0 assigned to all non-top-k tokens.

After training and inspecting the model, it is often helpful to save the model so that we can use or continue training it later, which is the topic of this section before we load the pretrained model weights from OpenAI in the final section of this chapter.

GPT-2 LLMs come in several different model sizes, ranging from 124 million to 1,558 million parameters. The core architecture is the same, with the only difference being the embedding sizes and the number of times individual components like the attention heads and transformer blocks are repeated.

Summary

When LLMs generate text, they output one token at a time.
By default, the next token is generated by converting the model outputs into probability scores and selecting the token from the vocabulary that corresponds to the highest probability score, which is known as "greedy decoding."
Using probabilistic sampling and temperature scaling, we can influence the diversity and coherence of the generated text.
Training and validation set losses can be used to gauge the quality of text generated by LLM during training.
Pretraining an LLM involves changing its weights to minimize the training loss.
The training loop for LLMs itself is a standard procedure in deep learning, using a conventional cross entropy loss and AdamW optimizer.
Pretraining an LLM on a large text corpus is time- and resource-intensive so we can load openly available weights from OpenAI as an alternative to pretraining the model on a large dataset ourselves.

FAQ

How is cross-entropy loss computed for next-token prediction in this chapter?

It measures how well the model predicts the correct next token. Practically: - Model outputs logits with shape [batch, tokens, vocab]. - Targets are token IDs with shape [batch, tokens] (inputs shifted by one). - Flatten to [batch*tokens, vocab] and [batch*tokens]. - Use torch.nn.functional.cross_entropy(logits_flat, targets_flat). This internally applies softmax, selects the correct target probabilities, takes the negative log, and averages. Lower loss means better predictions.

What is perplexity and how do I interpret it?

Perplexity = exp(cross-entropy loss). It’s a more interpretable metric: lower is better. Roughly, it reflects the model’s effective uncertainty about the next token (e.g., an untrained model on a 50k-token vocab can show very high perplexity, while a trained model’s perplexity drops as it becomes more confident and accurate).

How are training and validation losses computed over the dataset?

- calc_loss_batch: moves a batch to device, runs the model, and computes cross-entropy loss. - calc_loss_loader: iterates over all (or a limited number of) batches, sums and averages losses. - For evaluation: set model.eval() and wrap in torch.no_grad() to disable dropout and gradient tracking, then restore model.train() afterward.

What does the training loop do step by step?

Each iteration: - model.train(); for each batch: optimizer.zero_grad(); compute loss; loss.backward(); optimizer.step(). - Track tokens seen and global steps. - Periodically evaluate (train/val losses) and print a generated sample to monitor quality. This simple loop is implemented in train_model_simple and can be extended with warmup, cosine decay, gradient clipping, etc.

Why does the example overfit, and how can I tell?

The dataset is tiny (~5k tokens) and the model trains for multiple epochs. Signs: - Training loss keeps dropping while validation loss stagnates or diverges. - Generated text contains verbatim passages from the training text. Mitigations: train on much larger data (preferred), reduce epochs, tune weight decay/dropout, use early stopping, or adjust evaluation protocol. Decoding tricks (temperature/top-k) can reduce verbatim outputs at inference but don’t fix overfitting.

How do temperature scaling and top-k sampling change text generation?

- Temperature: divide logits by T before softmax. T<1 makes distributions sharper (more deterministic), T>1 makes them flatter (more diverse but riskier). - Top-k: keep only the k highest-logit tokens and mask the rest before softmax; sampling then draws from a focused subset. Use both to balance coherence and novelty. For deterministic output: use greedy decoding (argmax) or set temperature=0 (or very low) with no sampling.

How do I save and load models (and resume training) in PyTorch?

- Save model weights: torch.save(model.state_dict(), "model.pth"). - Load weights: model.load_state_dict(torch.load("model.pth")); model.eval() for inference. - To resume training, also save optimizer state: torch.save({"model_state_dict": model.state_dict(), "optimizer_state_dict": optimizer.state_dict()}, "checkpoint.pth"), then load both and set model.train().

How are OpenAI’s GPT‑2 pretrained weights loaded into the custom GPTModel?

Process: - Download GPT‑2 weights and settings (e.g., 124M) via a helper script. - Update config to match GPT‑2: n_ctx=1024, proper emb_dim/n_layers/n_heads, and qkv_bias=True. - Initialize a fresh GPTModel with this config. - Map TensorFlow-style parameter tensors into the model using a load_weights_into_gpt function that assigns each corresponding module (embeddings, attention projections, MLP, layer norms, and output head). - Verify by generating coherent text.

Why was the context length reduced to 256 during training, and how can it be changed later?

Attention cost scales roughly quadratically with context length, so 256 reduces compute and memory, enabling laptop training. When using OpenAI’s pretrained GPT‑2 weights, update the model config to context_length=1024 to match the original training setup.

Why use AdamW instead of Adam in the training loop?

AdamW decouples weight decay from the gradient update, leading to more effective regularization and better generalization in practice for LLMs. In the chapter, AdamW with a small learning rate and weight_decay=0.1 is used as a sensible default.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $35.74

you save $19.25 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $35.74

you save $19.25 (35%)

include audio $24.99 $16.24

eBook

pdf, ePub, online

$54.99 $35.74

you save $19.25 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more