table of content

1 Understanding large language models

1.1 What is an LLM?

1.2 Applications of LLMs

1.3 Stages of building and using LLMs

1.4 Introducing the transformer architecture

1.5 Utilizing large datasets

1.6 A closer look at the GPT architecture

1.7 Building a large language model

2 Working with text data

2.1 Understanding word embeddings

2.2 Tokenizing text

2.3 Converting tokens into token IDs

2.4 Adding special context tokens

2.5 Byte pair encoding

2.6 Data sampling with a sliding window

2.7 Creating token embeddings

2.8 Encoding word positions

3 Coding attention mechanisms

3.1 The problem with modeling long sequences

3.2 Capturing data dependencies with attention mechanisms

3.3 Attending to different parts of the input with self-attention

3.3.1 A simple self-attention mechanism without trainable weights

3.3.2 Computing attention weights for all input tokens

3.4 Implementing self-attention with trainable weights

3.4.1 Computing the attention weights step by step

3.4.2 Implementing a compact self-attention Python class

3.5 Hiding future words with causal attention

3.5.1 Applying a causal attention mask

3.5.2 Masking additional attention weights with dropout

3.5.3 Implementing a compact causal attention class

3.6 Extending single-head attention to multi-head attention

3.6.1 Stacking multiple single-head attention layers

3.6.2 Implementing multi-head attention with weight splits

4 Implementing a GPT model from scratch to generate text

4.1 Coding an LLM architecture

4.2 Normalizing activations with layer normalization

4.3 Implementing a feed forward network with GELU activations

4.4 Adding shortcut connections

4.5 Connecting attention and linear layers in a transformer block

4.6 Coding the GPT model

4.7 Generating text

5 Pretraining on unlabeled data

5.1 Evaluating generative text models

5.1.1 Using GPT to generate text

5.1.2 Calculating the text generation loss

5.1.3 Calculating the training and validation set losses

5.2 Training an LLM

5.3 Decoding strategies to control randomness

5.3.1 Temperature scaling

5.3.2 Top-k sampling

5.3.3 Modifying the text generation function

5.4 Loading and saving model weights in PyTorch

5.5 Loading pretrained weights from OpenAI

6 Fine-tuning for classification

6.1 Different categories of fine-tuning

6.2 Preparing the dataset

6.3 Creating data loaders

6.4 Initializing a model with pretrained weights

6.5 Adding a classification head

6.6 Calculating the classification loss and accuracy

6.7 Fine-tuning the model on supervised data

6.8 Using the LLM as a spam classifier

7 Fine-tuning to follow instructions

7.1 Introduction to instruction fine-tuning

7.2 Preparing a dataset for supervised instruction fine-tuning

7.3 Organizing data into training batches

7.4 Creating data loaders for an instruction dataset

7.5 Loading a pretrained LLM

7.6 Fine-tuning the LLM on instruction data

7.7 Extracting and saving responses

7.8 Evaluating the fine-tuned LLM

7.9 Conclusions

7.9.1 What’s next?

7.9.2 Staying up to date in a fast-moving field

7.9.3 Final words

Appendixes

Appendix A: Introduction to PyTorch

A.1 What is PyTorch?

A.1.1 The three core components of PyTorch

A.1.2 Defining deep learning

A.1.3 Installing PyTorch

A.2 Understanding tensors

A.2.1 Scalars, vectors, matrices, and tensors

A.2.2 Tensor data types

A.2.3 Common PyTorch tensor operations

A.3 Seeing models as computation graphs

A.4 Automatic differentiation made easy

A.5 Implementing multilayer neural networks

A.6 Setting up efficient data loaders

A.7 A typical training loop

A.8 Saving and loading models

A.9 Optimizing training performance with GPUs

A.9.1 PyTorch computations on GPU devices

A.9.2 Single-GPU training

A.9.3 Training with multiple GPUs

Appendix B: References and further reading

Appendix C: Exercise solutions

Appendix D: Adding bells and whistles to the training loop

D.1 Learning rate warmup

D.2 Cosine decay

D.3 Gradient clipping

D.4 The modified training function

Appendix E: Parameter-efficient fine-tuning with LoRA

E.1 Introduction to LoRA

E.2 Preparing the dataset

E.3 Initializing the model

E.4 Parameter-efficient fine-tuning with LoRA

Overview

2 Working with Text Data

This chapter explains how to turn raw text into the numeric inputs required to pretrain decoder-only transformer language models. It outlines a complete data pipeline: preparing and tokenizing text, converting tokens to integer IDs, sampling input–target pairs for next-token prediction, and transforming those IDs into learnable vector representations. Along the way, it clarifies why neural networks cannot operate on categorical text directly, motivates embeddings as continuous representations, and sets up the components that will feed the model built in later chapters.

The chapter begins with simple tokenization, splitting words and punctuation into tokens, building a vocabulary, and implementing encode/decode mappings between text and token IDs. It introduces special tokens for unknown words and segment boundaries, then transitions to byte pair encoding (BPE), which avoids unknown tokens by decomposing any string into subwords or bytes—an approach used by GPT-style models with large vocabularies. Practical guidance includes using a fast BPE tokenizer, keeping capitalization, and understanding when to include or omit whitespace tokens depending on the task.

Next, the text is organized into training examples via a sliding window that creates input sequences and targets shifted by one position, with stride controlling overlap. A PyTorch Dataset and DataLoader turn these into batches of token ID tensors. Finally, token IDs are mapped to dense vectors through an embedding layer, and absolute positional embeddings are added to encode order, yielding input embeddings of shape batch size by context length by embedding dimension. These learned representations, initialized randomly and optimized during training, form the ready-to-use inputs for the transformer layers that follow.

A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter will explain and code the data preparation and sampling pipeline that provides the LLM with the text data for pretraining.

Deep learning models cannot process data formats like video, audio, and text in their raw form. Thus, we use an embedding model to transform this raw data into a dense vector representation that deep learning architectures can easily understand and process. Specifically, this figure illustrates the process of converting raw data into a three-dimensional numerical vector.

If word embeddings are two-dimensional, we can plot them in a two-dimensional scatterplot for visualization purposes as shown here. When using word embedding techniques, such as Word2Vec, words corresponding to similar concepts often appear close to each other in the embedding space. For instance, different types of birds appear closer to each other in the embedding space compared to countries and cities.

A view of the text processing steps covered in this section in the context of an LLM. Here, we split an input text into individual tokens, which are either words or special characters, such as punctuation characters. In upcoming sections, we will convert the text into token IDs and create token embeddings.

The tokenization scheme we implemented so far splits text into individual words and punctuation characters. In the specific example shown in this figure, the sample text gets split into 10 individual tokens.

We build a vocabulary by tokenizing the entire text in a training dataset into individual tokens. These individual tokens are then sorted alphabetically, and duplicate tokens are removed. The unique tokens are then aggregated into a vocabulary that defines a mapping from each unique token to a unique integer value. The depicted vocabulary is purposefully small for illustration purposes and contains no punctuation or special characters for simplicity.

Starting with a new text sample, we tokenize the text and use the vocabulary to convert the text tokens into token IDs. The vocabulary is built from the entire training set and can be applied to the training set itself and any new text samples. The depicted vocabulary contains no punctuation or special characters for simplicity.

Tokenizer implementations share two common methods: an encode method and a decode method. The encode method takes in the sample text, splits it into individual tokens, and converts the tokens into token IDs via the vocabulary. The decode method takes in token IDs, converts them back into text tokens, and concatenates the text tokens into natural text.

We add special tokens to a vocabulary to deal with certain contexts. For instance, we add an <|unk|> token to represent new and unknown words that were not part of the training data and thus not part of the existing vocabulary. Furthermore, we add an <|endoftext|> token that we can use to separate two unrelated text sources.

When working with multiple independent text source, we add <|endoftext|> tokens between these texts. These <|endoftext|> tokens act as markers, signaling the start or end of a particular segment, allowing for more effective processing and understanding by the LLM.

BPE tokenizers break down unknown words into subwords and individual characters. This way, a BPE tokenizer can parse any word and doesn't need to replace unknown words with special tokens, such as <|unk|>.

Given a text sample, extract input blocks as subsamples that serve as input to the LLM, and the LLM's prediction task during training is to predict the next word that follows the input block. During training, we mask out all words that are past the target. Note that the text shown in this figure would undergo tokenization before the LLM can process it; however, this figure omits the tokenization step for clarity.

To implement efficient data loaders, we collect the inputs in a tensor, x, where each row represents one input context. A second tensor, y, contains the corresponding prediction targets (next words), which are created by shifting the input by one position.

When creating multiple batches from the input dataset, we slide an input window across the text. If the stride is set to 1, we shift the input window by 1 position when creating the next batch. If we set the stride equal to the input window size, we can prevent overlaps between the batches.

Preparing the input text for an LLM involves tokenizing text, converting text tokens to token IDs, and converting token IDs into vector embedding vectors. In this section, we consider the token IDs created in previous sections to create the token embedding vectors.

Embedding layers perform a look-up operation, retrieving the embedding vector corresponding to the token ID from the embedding layer's weight matrix. For instance, the embedding vector of the token ID 5 is the sixth row of the embedding layer weight matrix (it is the sixth instead of the fifth row because Python starts counting at 0). For illustration purposes, we assume that the token IDs were produced by the small vocabulary we used in section 2.3.

The embedding layer converts a token ID into the same vector representation regardless of where it is located in the input sequence. For example, the token ID 5, whether it's in the first or third position in the token ID input vector, will result in the same embedding vector.

Positional embeddings are added to the token embedding vector to create the input embeddings for an LLM. The positional vectors have the same dimension as the original token embeddings. The token embeddings are shown with value 1 for simplicity.

As part of the input processing pipeline, input text is first broken up into individual tokens. These tokens are then converted into token IDs using a vocabulary. The token IDs are converted into embedding vectors to which positional embeddings of a similar size are added, resulting in input embeddings that are used as input for the main LLM layers.

Summary

LLMs require textual data to be converted into numerical vectors, known as embeddings since they can't process raw text. Embeddings transform discrete data (like words or images) into continuous vector spaces, making them compatible with neural network operations.
As the first step, raw text is broken into tokens, which can be words or characters. Then, the tokens are converted into integer representations, termed token IDs.
Special tokens, such as <|unk|> and <|endoftext|>, can be added to enhance the model's understanding and handle various contexts, such as unknown words or marking the boundary between unrelated texts.
The byte pair encoding (BPE) tokenizer used for LLMs like GPT-2 and GPT-3 can efficiently handle unknown words by breaking them down into subword units or individual characters.
We use a sliding window approach on tokenized data to generate input-target pairs for LLM training.
Embedding layers in PyTorch function as a lookup operation, retrieving vectors corresponding to token IDs. The resulting embedding vectors provide continuous representations of tokens, which is crucial for training deep learning models like LLMs.
While token embeddings provide consistent vector representations for each token, they lack a sense of the token's position in a sequence. To rectify this, two main types of positional embeddings exist: absolute and relative. OpenAI's GPT models utilize absolute positional embeddings that are added to the token embedding vectors and are optimized during the model training.

FAQ

Why can’t large language models (LLMs) process raw text, and what are embeddings?

LLMs operate on numbers, not categorical text. Embeddings map discrete tokens (like words or subwords) to continuous vectors so neural networks can compute on them. These vectors are learned during training and capture semantic relationships. GPT-style models typically use high-dimensional embeddings (for example, 768 in small GPT-2, 12,288 in GPT-3).

What does “tokenization” mean, and why keep punctuation and casing?

Tokenization splits text into pieces (tokens) that models consume. Simple schemes split into words and punctuation; keeping punctuation and case helps the model learn sentence structure, proper nouns, and realistic generation. More advanced schemes use subword tokenization (like byte pair encoding) to balance vocabulary size and coverage.

How do vocabularies and token IDs work?

A vocabulary maps each unique token in the training data to a unique integer (token ID). Encoding converts text → tokens → IDs; decoding performs the reverse. This integer representation is the bridge between text and the embedding layer that produces numeric vectors for the model.

What happens with words not seen during training?

With a simple word-level tokenizer, unseen words are out-of-vocabulary. A common fix is an unknown token, like <|unk|>, to stand in for any unseen token. Another special token, <|endoftext|>, is often inserted between unrelated documents to signal boundaries.

Do GPT tokenizers use an unknown token?

No. GPT models use a byte pair encoding (BPE) tokenizer that decomposes any word into subwords or characters, so an <|unk|> token isn’t needed. GPT’s tokenizer does use <|endoftext|> (ID 50256 in the GPT‑2/3 vocabulary of 50,257 tokens) and typically relies on attention masking when padding.

What is byte pair encoding (BPE), and why is it useful?

BPE builds a subword vocabulary by iteratively merging frequent character pairs. At inference, unknown words are split into known subwords or characters, ensuring every input is representable. This provides strong coverage while keeping the vocabulary size manageable and efficient.

How are next-word prediction training pairs created?

For a tokenized sequence, inputs x are chunks of length N (context size), and targets y are the same chunks shifted by one token. The model learns to predict the next token at each position, which is the core pretraining objective for decoder-only LLMs.

What is a sliding window, and what do max_length, stride, and batch_size control?

- max_length (context size): how many tokens per training sample the model sees at once. - stride: how far the window moves between samples; stride 1 maximizes overlap, stride = max_length removes overlap. - batch_size: how many samples are processed in parallel. Larger batches use more memory but can stabilize training updates.

How do token IDs become vectors, and what is an embedding layer?

An embedding layer is a learned lookup table: each token ID indexes a row (the embedding vector). Stacking these for a sequence yields a tensor of shape [batch, sequence_length, embedding_dim]. The embedding weights start random and are optimized during training.

Why add positional embeddings, and which kind do GPT models use?

Self-attention is position-agnostic; positional embeddings inject order information. GPT models use learned absolute positional embeddings added elementwise to token embeddings. Relative schemes also exist, emphasizing distances between tokens, but GPTs use learned absolute positions.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $33.59

you save $14.40 (30%)

include audio $24.99 $17.49

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more