table of content

1 Understanding large language models

1.1 What is an LLM?

1.2 Applications of LLMs

1.3 Stages of building and using LLMs

1.4 Introducing the transformer architecture

1.5 Utilizing large datasets

1.6 A closer look at the GPT architecture

1.7 Building a large language model

2 Working with text data

2.1 Understanding word embeddings

2.2 Tokenizing text

2.3 Converting tokens into token IDs

2.4 Adding special context tokens

2.5 Byte pair encoding

2.6 Data sampling with a sliding window

2.7 Creating token embeddings

2.8 Encoding word positions

3 Coding attention mechanisms

3.1 The problem with modeling long sequences

3.2 Capturing data dependencies with attention mechanisms

3.3 Attending to different parts of the input with self-attention

3.3.1 A simple self-attention mechanism without trainable weights

3.3.2 Computing attention weights for all input tokens

3.4 Implementing self-attention with trainable weights

3.4.1 Computing the attention weights step by step

3.4.2 Implementing a compact self-attention Python class

3.5 Hiding future words with causal attention

3.5.1 Applying a causal attention mask

3.5.2 Masking additional attention weights with dropout

3.5.3 Implementing a compact causal attention class

3.6 Extending single-head attention to multi-head attention

3.6.1 Stacking multiple single-head attention layers

3.6.2 Implementing multi-head attention with weight splits

4 Implementing a GPT model from scratch to generate text

4.1 Coding an LLM architecture

4.2 Normalizing activations with layer normalization

4.3 Implementing a feed forward network with GELU activations

4.4 Adding shortcut connections

4.5 Connecting attention and linear layers in a transformer block

4.6 Coding the GPT model

4.7 Generating text

5 Pretraining on unlabeled data

5.1 Evaluating generative text models

5.1.1 Using GPT to generate text

5.1.2 Calculating the text generation loss

5.1.3 Calculating the training and validation set losses

5.2 Training an LLM

5.3 Decoding strategies to control randomness

5.3.1 Temperature scaling

5.3.2 Top-k sampling

5.3.3 Modifying the text generation function

5.4 Loading and saving model weights in PyTorch

5.5 Loading pretrained weights from OpenAI

6 Fine-tuning for classification

6.1 Different categories of fine-tuning

6.2 Preparing the dataset

6.3 Creating data loaders

6.4 Initializing a model with pretrained weights

6.5 Adding a classification head

6.6 Calculating the classification loss and accuracy

6.7 Fine-tuning the model on supervised data

6.8 Using the LLM as a spam classifier

7 Fine-tuning to follow instructions

7.1 Introduction to instruction fine-tuning

7.2 Preparing a dataset for supervised instruction fine-tuning

7.3 Organizing data into training batches

7.4 Creating data loaders for an instruction dataset

7.5 Loading a pretrained LLM

7.6 Fine-tuning the LLM on instruction data

7.7 Extracting and saving responses

7.8 Evaluating the fine-tuned LLM

7.9 Conclusions

7.9.1 What’s next?

7.9.2 Staying up to date in a fast-moving field

7.9.3 Final words

Appendixes

Appendix A: Introduction to PyTorch

A.1 What is PyTorch?

A.1.1 The three core components of PyTorch

A.1.2 Defining deep learning

A.1.3 Installing PyTorch

A.2 Understanding tensors

A.2.1 Scalars, vectors, matrices, and tensors

A.2.2 Tensor data types

A.2.3 Common PyTorch tensor operations

A.3 Seeing models as computation graphs

A.4 Automatic differentiation made easy

A.5 Implementing multilayer neural networks

A.6 Setting up efficient data loaders

A.7 A typical training loop

A.8 Saving and loading models

A.9 Optimizing training performance with GPUs

A.9.1 PyTorch computations on GPU devices

A.9.2 Single-GPU training

A.9.3 Training with multiple GPUs

Appendix B: References and further reading

Appendix C: Exercise solutions

Appendix D: Adding bells and whistles to the training loop

D.1 Learning rate warmup

D.2 Cosine decay

D.3 Gradient clipping

D.4 The modified training function

Appendix E: Parameter-efficient fine-tuning with LoRA

E.1 Introduction to LoRA

E.2 Preparing the dataset

E.3 Initializing the model

E.4 Parameter-efficient fine-tuning with LoRA

Overview

Appendix A. Introduction to PyTorch

This appendix is a practical primer that equips readers to put deep learning into practice with PyTorch as the core tool for building large language models from scratch. It introduces PyTorch’s goals and strengths—combining usability with flexibility—and focuses on the essential features needed throughout the book rather than exhaustive coverage. Readers are guided through environment setup and installation choices, including CPU/GPU variants and version compatibility, with the aim of establishing a reliable, reproducible workspace before moving on to model implementation and training.

The chapter centers on PyTorch’s three pillars: tensors, automatic differentiation, and deep learning utilities. It explains tensors as the fundamental data container (covering ranks, dtypes, shapes, reshaping, transposition, and matrix multiplication) and shows how PyTorch mirrors NumPy while adding GPU support. It then introduces computation graphs and autograd to compute gradients automatically via backward passes, enabling backpropagation without manual calculus. Building on this, the text demonstrates how to define neural networks by subclassing torch.nn.Module, structuring layers (often with Sequential), running forward passes to produce logits, selecting suitable losses and optimizers, and executing the canonical training loop (zero_grad → backward → step) with proper train/eval modes, no_grad for inference, and model persistence via state_dict save/load.

Efficient data handling is addressed through custom Dataset classes and DataLoader configuration, including batching, shuffling, drop_last, and num_workers trade-offs to avoid CPU bottlenecks. The chapter concludes with performance optimization: moving tensors and models across devices, running on a single GPU with minimal code changes, and scaling to multiple GPUs using DistributedDataParallel with DistributedSampler and synchronized gradients. Along the way, it stresses practical considerations—accuracy evaluation, logits-to-probabilities conversion when needed, and careful resource management—so readers are prepared to train, evaluate, save, and scale PyTorch models effectively.

Figure A.1 PyTorch's three main components include a tensor library as a fundamental building block for computing, automatic differentiation for model optimization, and deep learning utility functions, making it easier to implement and train deep neural network models.

Figure A.2 Deep learning is a subcategory of machine learning that is focused on the implementation of deep neural networks. In turn, machine learning is a subcategory of AI that is concerned with algorithms that learn from data. AI is the broader concept of machines being able to perform tasks that typically require human intelligence.

Figure A.3 The supervised learning workflow for predictive modeling consists of a training stage where a model is trained on labeled examples in a training dataset. The trained model can then be used to predict the labels of new observations.

Figure A.4 Access the PyTorch installation recommendation on https://pytorch.org to customize and select the installation command for your system.

Figure A.5 Select a GPU device for Google Colab under the Runtime/Change runtime type menu.

Figure A.6 An illustration of tensors with different ranks. Here 0D corresponds to rank 0, 1D to rank 1, and 2D to rank 2. Note that a 3D vector, which consists of 3 elements, is still a rank 1 tensor.

Figure A.7 A logistic regression forward pass as a computation graph. The input feature x1 is multiplied by a model weight w1 and passed through an activation function σ after adding the bias. The loss is computed by comparing the model output a with a given label y.

Figure A.8 The most common way of computing the loss gradients in a computation graph involves applying the chain rule from right to left, which is also called reverse-model automatic differentiation or backpropagation. It means we start from the output layer (or the loss itself) and work backward through the network to the input layer. This is done to compute the gradient of the loss with respect to each parameter (weights and biases) in the network, which informs how we update these parameters during training.

Figure A.9 An illustration of a multilayer perceptron with 2 hidden layers. Each node represents a unit in the respective layer. Each layer has only a very small number of nodes for illustration purposes.

Figure A.10 PyTorch implements a Dataset and a DataLoader class. The Dataset class is used to instantiate objects that define how each data record is loaded. The DataLoader handles how the data is shuffled and assembled into batches.

Figure A.11 Loading data without multiple workers (setting num_workers=0) will create a data loading bottleneck where the model sits idle until the next batch is loaded as illustrated in the left subpanel. If multiple workers are enabled, the data loader can already queue up the next batch in the background as shown in the right subpanel.

Figure A.12 The model and data transfer in DDP involves two key steps. First, we create a copy of the model on each of the GPUs. Then we divide the input data into unique minibatches that we pass on to each model copy.

Figure A.13 The forward and backward pass in DDP are executed independently on each GPU with its corresponding data subset. Once the forward and backward passes are completed, gradients from each model replica (on each GPU) are synchronized across all GPUs. This ensures that every model replica has the same updated weights.

FAQ

What is PyTorch, and why is it popular for deep learning?

PyTorch is an open-source, Python-based deep learning library. It is popular because it combines a user-friendly interface with strong performance and flexibility: you can write code that feels like NumPy, customize low-level components when needed, and seamlessly accelerate workloads on GPUs. It has been the leading framework in research for years and is widely adopted in industry.

What are the three core components of PyTorch?

- Tensor library: NumPy-like arrays with optional GPU acceleration.
- Automatic differentiation (autograd): Builds computation graphs and computes gradients for backpropagation.
- Deep learning utilities: Modules, layers, loss functions, optimizers, pretrained models, and more.

How do I install PyTorch (CPU or GPU) and verify the version?

- Basic install: pip install torch
- Pin to the book’s version for full compatibility: pip install torch==2.0.1
- For CUDA-enabled installs (NVIDIA GPUs), use the command recommended on https://pytorch.org for your OS/CUDA setup.
- Check the installed version in Python:

import torch
print(torch.__version__)

Tip: Many scientific packages lag the newest Python releases—use Python 3.10 or 3.11 if 3.13 is latest.

How can I tell whether my system can use a GPU (CUDA, ROCm, or Apple Silicon)?

- NVIDIA CUDA:

import torch
print(torch.cuda.is_available())  # True means CUDA GPU usable

- AMD ROCm (experimental): follow instructions on https://pytorch.org.
- Apple Silicon (M1/M2/M3+):

import torch
print(torch.backends.mps.is_available())  # True means MPS usable

- In Google Colab: Runtime → Change runtime type → select GPU.

What are tensors, ranks (orders), and dtypes in PyTorch?

- Tensors generalize scalars (rank 0), vectors (rank 1), and matrices (rank 2) to higher dimensions.
- Create tensors:

import torch
t0 = torch.tensor(1)              # 0D
t1 = torch.tensor([1, 2, 3])      # 1D
t2 = torch.tensor([[1, 2], [3, 4]])  # 2D

- Dtypes: ints default to torch.int64, floats default to torch.float32 (efficient for DL and GPUs). Change dtype with .to:

t = torch.tensor([1, 2, 3])       # int64
t = t.to(torch.float32)           # float32

What are the most common tensor operations I should know first?

- Shape: tensor.shape
- Reshape: tensor.reshape(new_h, new_w) or tensor.view(new_h, new_w)
- Transpose (2D): tensor.T
- Matrix multiply: tensor.matmul(other) or tensor @ other
Example:

x = torch.tensor([[1, 2, 3], [4, 5, 6]])
print(x.shape)      # torch.Size([2, 3])
print(x.T)          # transpose
print(x @ x.T)      # 2x3 @ 3x2 -> 2x2

How does autograd work, and how do I compute gradients?

PyTorch builds a computation graph as you do tensor ops. If a tensor has requires_grad=True, PyTorch tracks ops on it. After computing a scalar loss, call loss.backward() to populate .grad on leaf parameters. Example:

import torch, torch.nn.functional as F
x = torch.tensor([1.1])
w = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)
y = torch.tensor([1.0])

a = torch.sigmoid(x * w + b)
loss = F.binary_cross_entropy(a, y)
loss.backward()
print(w.grad, b.grad)  # gradients

How do I build a simple neural network in PyTorch, and what are logits vs probabilities?

- Subclass torch.nn.Module, define layers in __init__, wire them in forward. Example (MLP):

class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super().__init__()
        self.layers = torch.nn.Sequential(
            torch.nn.Linear(num_inputs, 30), torch.nn.ReLU(),
            torch.nn.Linear(30, 20),         torch.nn.ReLU(),
            torch.nn.Linear(20, num_outputs)
        )
    def forward(self, x):
        return self.layers(x)  # logits

- Logits are raw scores (no softmax). Many loss functions (e.g., cross_entropy) expect logits and apply softmax internally for stability.
- For probabilities at inference:

with torch.no_grad():
    probs = torch.softmax(model(x), dim=1)

- Use model.train() during training and model.eval() for evaluation/inference; wrap inference in torch.no_grad().

How do Dataset and DataLoader work, and what do batch size, shuffle, drop_last, and num_workers do?

- Create a custom Dataset with __len__ and __getitem__ to return one (features, label) pair.
- Wrap it in a DataLoader to handle batching and shuffling:

from torch.utils.data import DataLoader
loader = DataLoader(dataset=train_ds,
                    batch_size=32,
                    shuffle=True,
                    drop_last=True,   # avoid tiny last batch
                    num_workers=4)    # parallel data loading

- num_workers>0 can greatly speed up training by loading data in parallel; for tiny datasets or some notebook environments, 0 may be simpler.
- Labels for classification should start at 0 and be less than the number of output classes.

What does a typical PyTorch training loop look like, and why call zero_grad/backward/step?

Core steps per batch:
1) model.train()
2) Forward pass: logits = model(features)
3) Loss: e.g., loss = F.cross_entropy(logits, labels) (pass logits, not softmax)
4) Zero grads: optimizer.zero_grad() (prevents gradient accumulation)
5) Backprop: loss.backward()
6) Update: optimizer.step()
For evaluation: model.eval() and wrap inference in torch.no_grad(). To save and load a trained model:

# save
torch.save(model.state_dict(), "model.pth")
# load
model = NeuralNetwork(in_dim, out_dim)
model.load_state_dict(torch.load("model.pth"))

How do I accelerate training on GPUs (single and multi-GPU)?

- Single GPU: pick a device and move model and data to it (three key changes).

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
for features, labels in train_loader:
    features, labels = features.to(device), labels.to(device)

- All tensors in an op must be on the same device.
- Apple Silicon: use mps instead of cuda if available: torch.device("mps" if torch.backends.mps.is_available() else "cpu").
- Multi-GPU: use DistributedDataParallel (DDP). Each GPU gets a model replica and unique data shard (via DistributedSampler); gradients are synchronized across GPUs. Run DDP in a Python script (not inside notebooks) and initialize process groups; expect near-linear speedups with more GPUs (minus communication overhead).

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $35.74

you save $19.25 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $35.74

you save $19.25 (35%)

include audio $24.99 $16.24

eBook

pdf, ePub, online

$54.99 $35.74

you save $19.25 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more