Overview

5 Pretraining on Unlabeled Data

This chapter shifts from building a GPT-style architecture to actually pretraining it on unlabeled text. It begins by setting up text generation and, crucially, introducing numerical evaluation so training can be guided by measurable objectives. Cross-entropy loss (and its companion metric, perplexity) is used to quantify next-token prediction quality on both training and validation splits prepared from tokenized text, providing a practical yardstick for progress and a foundation for later optimization.

With evaluation in place, the chapter implements a compact yet complete PyTorch training workflow: batching, forward and loss computation, backpropagation, and weight updates with AdamW. Utility routines compute average losses over data loaders, periodically assess train/validation performance, and print sample generations to monitor qualitative gains. Because the educational dataset is intentionally small, the chapter discusses overfitting and demonstrates decoding controls—temperature scaling and top-k sampling—to trade off determinism and diversity, reduce verbatim memorization, and produce more varied outputs via a refined text generation function.

Finally, the chapter covers persistence and reuse. It shows how to save and reload model parameters and optimizer state for continued training, then loads publicly released GPT-2 weights into the custom GPT implementation by aligning configurations and carefully mapping parameters to corresponding layers. Successful loading is validated through coherent text generation, providing a stronger starting point for future tasks such as finetuning for classification and instruction following in subsequent chapters.

A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset and finetuning it on a labeled dataset. This chapter focuses on pretraining the LLM, which includes implementing the training code, evaluating the performance, and saving and loading model weights.
An overview of the topics covered in this chapter. We begin by recapping the text generation from the previous chapter and implementing basic model evaluation techniques that we can use during the pretraining stage.
Generating text involves encoding text into token IDs that the LLM processes into logit vectors. The logit vectors are then converted back into token IDs, detokenized into a text representation.
For each of the 3 input tokens, shown on the left, we compute a vector containing probability scores corresponding to each token in the vocabulary. The index position of the highest probability score in each vector represents the most likely next token ID. These token IDs associated with the highest probability scores are selected and mapped back into a text that represents the text generated by the model.
We now implement the text evaluation function in the remainder of this section. In the next section, we apply this evaluation function to the entire dataset we use for model training.
Before training, the model produces random next-token probability vectors. The goal of model training is to ensure that the probability values corresponding to the highlighted target token IDs are maximized.
Calculating the loss involves several steps. Steps 1 to 3 calculate the token probabilities corresponding to the target tensors. These probabilities are then transformed via a logarithm and averaged in steps 4-6.
After computing the cross entropy loss in the previous section, we now apply this loss computation to the entire text dataset that we will use for model training.
When preparing the data loaders, we split the input text into training and validation set portions. Then, we tokenize the text (only shown for the training set portion for simplicity) and divide the tokenized text into chunks of a user-specified length (here 6). Finally, we shuffle the rows and organize the chunked text into batches (here, batch size 2), which we can use for model training.
We have recapped the text generation process and implemented basic model evaluation techniques to compute the training and validation set losses. Next, we will go to the training functions and pretrain the LLM.
A typical training loop for training deep neural networks in PyTorch consists of several steps, iterating over the batches in the training set for several epochs. In each loop, we calculate the loss for each training set batch to determine loss gradients, which we use to update the model weights so that the training set loss is minimized.
At the beginning of the training, we observe that both the training and validation set losses sharply decrease, which is a sign that the model is learning. However, the training set loss continues to decrease past the second epoch, whereas the validation loss stagnates. This is a sign that the model is still learning, but it's overfitting to the training set past epoch 2.
Our model can generate coherent text after implementing the training function. However, it often memorizes passages from the training set verbatim. The following section covers strategies to generate more diverse output texts.
A temperature of 1 represents the unscaled probability scores for each token in the vocabulary. Decreasing the temperature to 0.1 sharpens the distribution, so the most likely token (here "forward") will have an even higher probability score. Vice versa, increasing the temperature to 5 makes the distribution more uniform.
Using top-k sampling with k=3, we focus on the 3 tokens associated with the highest logits and mask out all other tokens with negative infinity (-inf) before applying the softmax function. This results in a probability distribution with a probability value 0 assigned to all non-top-k tokens.
After training and inspecting the model, it is often helpful to save the model so that we can use or continue training it later, which is the topic of this section before we load the pretrained model weights from OpenAI in the final section of this chapter.
GPT-2 LLMs come in several different model sizes, ranging from 124 million to 1,558 million parameters. The core architecture is the same, with the only difference being the embedding sizes and the number of times individual components like the attention heads and transformer blocks are repeated.

Summary

  • When LLMs generate text, they output one token at a time.
  • By default, the next token is generated by converting the model outputs into probability scores and selecting the token from the vocabulary that corresponds to the highest probability score, which is known as "greedy decoding."
  • Using probabilistic sampling and temperature scaling, we can influence the diversity and coherence of the generated text.
  • Training and validation set losses can be used to gauge the quality of text generated by LLM during training.
  • Pretraining an LLM involves changing its weights to minimize the training loss.
  • The training loop for LLMs itself is a standard procedure in deep learning, using a conventional cross entropy loss and AdamW optimizer.
  • Pretraining an LLM on a large text corpus is time- and resource-intensive so we can load openly available weights from OpenAI as an alternative to pretraining the model on a large dataset ourselves.

FAQ

How is cross-entropy loss computed for next-token prediction in this chapter?It measures how well the model predicts the correct next token. Practically: - Model outputs logits with shape [batch, tokens, vocab]. - Targets are token IDs with shape [batch, tokens] (inputs shifted by one). - Flatten to [batch*tokens, vocab] and [batch*tokens]. - Use torch.nn.functional.cross_entropy(logits_flat, targets_flat). This internally applies softmax, selects the correct target probabilities, takes the negative log, and averages. Lower loss means better predictions.
What is perplexity and how do I interpret it?Perplexity = exp(cross-entropy loss). It’s a more interpretable metric: lower is better. Roughly, it reflects the model’s effective uncertainty about the next token (e.g., an untrained model on a 50k-token vocab can show very high perplexity, while a trained model’s perplexity drops as it becomes more confident and accurate).
How are training and validation losses computed over the dataset?- calc_loss_batch: moves a batch to device, runs the model, and computes cross-entropy loss. - calc_loss_loader: iterates over all (or a limited number of) batches, sums and averages losses. - For evaluation: set model.eval() and wrap in torch.no_grad() to disable dropout and gradient tracking, then restore model.train() afterward.
What does the training loop do step by step?Each iteration: - model.train(); for each batch: optimizer.zero_grad(); compute loss; loss.backward(); optimizer.step(). - Track tokens seen and global steps. - Periodically evaluate (train/val losses) and print a generated sample to monitor quality. This simple loop is implemented in train_model_simple and can be extended with warmup, cosine decay, gradient clipping, etc.
Why does the example overfit, and how can I tell?The dataset is tiny (~5k tokens) and the model trains for multiple epochs. Signs: - Training loss keeps dropping while validation loss stagnates or diverges. - Generated text contains verbatim passages from the training text. Mitigations: train on much larger data (preferred), reduce epochs, tune weight decay/dropout, use early stopping, or adjust evaluation protocol. Decoding tricks (temperature/top-k) can reduce verbatim outputs at inference but don’t fix overfitting.
How do temperature scaling and top-k sampling change text generation?- Temperature: divide logits by T before softmax. T<1 makes distributions sharper (more deterministic), T>1 makes them flatter (more diverse but riskier). - Top-k: keep only the k highest-logit tokens and mask the rest before softmax; sampling then draws from a focused subset. Use both to balance coherence and novelty. For deterministic output: use greedy decoding (argmax) or set temperature=0 (or very low) with no sampling.
How do I save and load models (and resume training) in PyTorch?- Save model weights: torch.save(model.state_dict(), "model.pth"). - Load weights: model.load_state_dict(torch.load("model.pth")); model.eval() for inference. - To resume training, also save optimizer state: torch.save({"model_state_dict": model.state_dict(), "optimizer_state_dict": optimizer.state_dict()}, "checkpoint.pth"), then load both and set model.train().
How are OpenAI’s GPT‑2 pretrained weights loaded into the custom GPTModel?Process: - Download GPT‑2 weights and settings (e.g., 124M) via a helper script. - Update config to match GPT‑2: n_ctx=1024, proper emb_dim/n_layers/n_heads, and qkv_bias=True. - Initialize a fresh GPTModel with this config. - Map TensorFlow-style parameter tensors into the model using a load_weights_into_gpt function that assigns each corresponding module (embeddings, attention projections, MLP, layer norms, and output head). - Verify by generating coherent text.
Why was the context length reduced to 256 during training, and how can it be changed later?Attention cost scales roughly quadratically with context length, so 256 reduces compute and memory, enabling laptop training. When using OpenAI’s pretrained GPT‑2 weights, update the model config to context_length=1024 to match the original training setup.
Why use AdamW instead of Adam in the training loop?AdamW decouples weight decay from the gradient update, leading to more effective regularization and better generalization in practice for LLMs. In the chapter, AdamW with a small learning rate and weight_decay=0.1 is used as a sensible default.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Large Language Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Large Language Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Large Language Model (From Scratch) ebook for free