2 Working with Text Data
This chapter explains how to turn raw text into the numeric inputs required to pretrain decoder-only transformer language models. It outlines a complete data pipeline: preparing and tokenizing text, converting tokens to integer IDs, sampling input–target pairs for next-token prediction, and transforming those IDs into learnable vector representations. Along the way, it clarifies why neural networks cannot operate on categorical text directly, motivates embeddings as continuous representations, and sets up the components that will feed the model built in later chapters.
The chapter begins with simple tokenization, splitting words and punctuation into tokens, building a vocabulary, and implementing encode/decode mappings between text and token IDs. It introduces special tokens for unknown words and segment boundaries, then transitions to byte pair encoding (BPE), which avoids unknown tokens by decomposing any string into subwords or bytes—an approach used by GPT-style models with large vocabularies. Practical guidance includes using a fast BPE tokenizer, keeping capitalization, and understanding when to include or omit whitespace tokens depending on the task.
Next, the text is organized into training examples via a sliding window that creates input sequences and targets shifted by one position, with stride controlling overlap. A PyTorch Dataset and DataLoader turn these into batches of token ID tensors. Finally, token IDs are mapped to dense vectors through an embedding layer, and absolute positional embeddings are added to encode order, yielding input embeddings of shape batch size by context length by embedding dimension. These learned representations, initialized randomly and optimized during training, form the ready-to-use inputs for the transformer layers that follow.
A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter will explain and code the data preparation and sampling pipeline that provides the LLM with the text data for pretraining.
Deep learning models cannot process data formats like video, audio, and text in their raw form. Thus, we use an embedding model to transform this raw data into a dense vector representation that deep learning architectures can easily understand and process. Specifically, this figure illustrates the process of converting raw data into a three-dimensional numerical vector.
If word embeddings are two-dimensional, we can plot them in a two-dimensional scatterplot for visualization purposes as shown here. When using word embedding techniques, such as Word2Vec, words corresponding to similar concepts often appear close to each other in the embedding space. For instance, different types of birds appear closer to each other in the embedding space compared to countries and cities.
A view of the text processing steps covered in this section in the context of an LLM. Here, we split an input text into individual tokens, which are either words or special characters, such as punctuation characters. In upcoming sections, we will convert the text into token IDs and create token embeddings.
The tokenization scheme we implemented so far splits text into individual words and punctuation characters. In the specific example shown in this figure, the sample text gets split into 10 individual tokens.
We build a vocabulary by tokenizing the entire text in a training dataset into individual tokens. These individual tokens are then sorted alphabetically, and duplicate tokens are removed. The unique tokens are then aggregated into a vocabulary that defines a mapping from each unique token to a unique integer value. The depicted vocabulary is purposefully small for illustration purposes and contains no punctuation or special characters for simplicity.
Starting with a new text sample, we tokenize the text and use the vocabulary to convert the text tokens into token IDs. The vocabulary is built from the entire training set and can be applied to the training set itself and any new text samples. The depicted vocabulary contains no punctuation or special characters for simplicity.
Tokenizer implementations share two common methods: an encode method and a decode method. The encode method takes in the sample text, splits it into individual tokens, and converts the tokens into token IDs via the vocabulary. The decode method takes in token IDs, converts them back into text tokens, and concatenates the text tokens into natural text.
We add special tokens to a vocabulary to deal with certain contexts. For instance, we add an <|unk|> token to represent new and unknown words that were not part of the training data and thus not part of the existing vocabulary. Furthermore, we add an <|endoftext|> token that we can use to separate two unrelated text sources.
When working with multiple independent text source, we add <|endoftext|> tokens between these texts. These <|endoftext|> tokens act as markers, signaling the start or end of a particular segment, allowing for more effective processing and understanding by the LLM.
BPE tokenizers break down unknown words into subwords and individual characters. This way, a BPE tokenizer can parse any word and doesn't need to replace unknown words with special tokens, such as <|unk|>.
Given a text sample, extract input blocks as subsamples that serve as input to the LLM, and the LLM's prediction task during training is to predict the next word that follows the input block. During training, we mask out all words that are past the target. Note that the text shown in this figure would undergo tokenization before the LLM can process it; however, this figure omits the tokenization step for clarity.
To implement efficient data loaders, we collect the inputs in a tensor, x, where each row represents one input context. A second tensor, y, contains the corresponding prediction targets (next words), which are created by shifting the input by one position.
When creating multiple batches from the input dataset, we slide an input window across the text. If the stride is set to 1, we shift the input window by 1 position when creating the next batch. If we set the stride equal to the input window size, we can prevent overlaps between the batches.
Preparing the input text for an LLM involves tokenizing text, converting text tokens to token IDs, and converting token IDs into vector embedding vectors. In this section, we consider the token IDs created in previous sections to create the token embedding vectors.
Embedding layers perform a look-up operation, retrieving the embedding vector corresponding to the token ID from the embedding layer's weight matrix. For instance, the embedding vector of the token ID 5 is the sixth row of the embedding layer weight matrix (it is the sixth instead of the fifth row because Python starts counting at 0). For illustration purposes, we assume that the token IDs were produced by the small vocabulary we used in section 2.3.
The embedding layer converts a token ID into the same vector representation regardless of where it is located in the input sequence. For example, the token ID 5, whether it's in the first or third position in the token ID input vector, will result in the same embedding vector.
Positional embeddings are added to the token embedding vector to create the input embeddings for an LLM. The positional vectors have the same dimension as the original token embeddings. The token embeddings are shown with value 1 for simplicity.
As part of the input processing pipeline, input text is first broken up into individual tokens. These tokens are then converted into token IDs using a vocabulary. The token IDs are converted into embedding vectors to which positional embeddings of a similar size are added, resulting in input embeddings that are used as input for the main LLM layers.
Summary
- LLMs require textual data to be converted into numerical vectors, known as embeddings since they can't process raw text. Embeddings transform discrete data (like words or images) into continuous vector spaces, making them compatible with neural network operations.
- As the first step, raw text is broken into tokens, which can be words or characters. Then, the tokens are converted into integer representations, termed token IDs.
- Special tokens, such as <|unk|> and <|endoftext|>, can be added to enhance the model's understanding and handle various contexts, such as unknown words or marking the boundary between unrelated texts.
- The byte pair encoding (BPE) tokenizer used for LLMs like GPT-2 and GPT-3 can efficiently handle unknown words by breaking them down into subword units or individual characters.
- We use a sliding window approach on tokenized data to generate input-target pairs for LLM training.
- Embedding layers in PyTorch function as a lookup operation, retrieving vectors corresponding to token IDs. The resulting embedding vectors provide continuous representations of tokens, which is crucial for training deep learning models like LLMs.
- While token embeddings provide consistent vector representations for each token, they lack a sense of the token's position in a sequence. To rectify this, two main types of positional embeddings exist: absolute and relative. OpenAI's GPT models utilize absolute positional embeddings that are added to the token embedding vectors and are optimized during the model training.
FAQ
Why can’t large language models (LLMs) process raw text, and what are embeddings?
LLMs operate on numbers, not categorical text. Embeddings map discrete tokens (like words or subwords) to continuous vectors so neural networks can compute on them. These vectors are learned during training and capture semantic relationships. GPT-style models typically use high-dimensional embeddings (for example, 768 in small GPT-2, 12,288 in GPT-3).What does “tokenization” mean, and why keep punctuation and casing?
Tokenization splits text into pieces (tokens) that models consume. Simple schemes split into words and punctuation; keeping punctuation and case helps the model learn sentence structure, proper nouns, and realistic generation. More advanced schemes use subword tokenization (like byte pair encoding) to balance vocabulary size and coverage.How do vocabularies and token IDs work?
A vocabulary maps each unique token in the training data to a unique integer (token ID). Encoding converts text → tokens → IDs; decoding performs the reverse. This integer representation is the bridge between text and the embedding layer that produces numeric vectors for the model.What happens with words not seen during training?
With a simple word-level tokenizer, unseen words are out-of-vocabulary. A common fix is an unknown token, like<|unk|>, to stand in for any unseen token. Another special token, <|endoftext|>, is often inserted between unrelated documents to signal boundaries.Do GPT tokenizers use an unknown token?
No. GPT models use a byte pair encoding (BPE) tokenizer that decomposes any word into subwords or characters, so an<|unk|> token isn’t needed. GPT’s tokenizer does use <|endoftext|> (ID 50256 in the GPT‑2/3 vocabulary of 50,257 tokens) and typically relies on attention masking when padding.
Build a Large Language Model (From Scratch) ebook for free