Overview

7 Finetuning to Follow Instructions

This chapter shifts from building and pretraining a text-completion LLM to adapting it to reliably follow human instructions. It introduces supervised instruction finetuning on a compact dataset of instruction–response pairs, emphasizing prompt design and choosing a consistent format (here, the Alpaca style) so the model learns to produce responses in a predictable structure. The data is split into training, validation, and test sets, and the chapter motivates why instruction finetuning is essential for practical assistants and chatbots that must execute tasks like editing, converting styles, or answering direct queries.

The implementation centers on preparing data the model can learn from efficiently. Entries are formatted into prompt–response text, tokenized, and organized into batches with a custom collate function that pads per batch (using the end-of-text token) to minimize unnecessary padding. Targets are created by shifting inputs by one position, and padding is masked with an ignore index so it does not influence the loss; optionally, instruction tokens can be masked as well. With these pieces in place, PyTorch data loaders provide shuffled, device-ready batches for training.

For finetuning, a pretrained GPT-2 medium model is loaded to ensure sufficient capacity. A quick baseline shows poor instruction following before training; after two epochs with AdamW, training and validation losses drop and responses improve noticeably. The chapter then demonstrates how to extract test-set generations, store them, and score them automatically using a local evaluator LLM (Llama 3 via Ollama), yielding an average score a bit above 50 on a 0–100 scale. It closes with guidance on improving results—tuning hyperparameters, scaling data or model size, experimenting with prompt styles, and considering follow-up steps such as preference tuning—rounding out a complete, from-scratch workflow for instruction finetuning.

A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it. This chapter focuses on finetuning a pretrained LLM to follow human instructions.
This figure shows examples of instructions that are processed by an LLM to generate desired responses.
Illustration of the three-stage process for instruction finetuning the LLM in this chapter. Stage 1 involves dataset preparation. Stage 2 focuses on model setup and finetuning. Stage 3 covers the evaluation of the model.
Comparison of prompt styles for instruction finetuning in LLMs. The Alpaca style (left) uses a structured format with defined sections for instruction, input, and response, while the Phi-3 style (right) employs a simpler format with designated <|user|> and <|assistant|> tokens.
After downloading the dataset and implementing text formatting utility function in the previous section, this section focuses on assembling the training batches.
An illustration of the five substeps involved in implementing the batching process: applying the prompt template defined in the previous section, using tokenization from previous chapters, adding padding tokens, creating target token IDs, and replacing -100 placeholder tokens to mask padding tokens in the loss function.
This diagram shows how entries are first formatted using a specific prompt template and then tokenized, resulting in a sequence of token IDs that the model can process.
This figure showed the padding of training examples in batches using token ID 50256 to ensure uniform length within each batch. Each batch may have different lengths, as shown by the first and second batches in this figure.
An illustration of the five substeps involved in implementing the batching process. We are now focusing on step 2.4, the creation of target token IDs. This step is essential as it enables the model to learn and predict the tokens it needs to generate.
This figure illustrates the input and target token alignment used in the instruction finetuning process of an LLM. For each input sequence, the corresponding target sequence is created by shifting the token IDs one position to the right, omitting the first token of the input, and appending an end-of-text token.
This figure illustrates step 2.5 in the token replacement process we apply to the data batches. After creating the target sequence by shifting token IDs one position to the right and appending an end-of-text token, step 2.5 focuses on replacing end-of-text padding tokens with a placeholder value (-100).
This figure illustrates step 2.4 in the token replacement process in the target batch for the training data preparation. It shows the replacement of all but the first instance of the end-of-text token, which we use as padding, with the placeholder value -100, while keeping the initial end-of-text token in each target sequence.
The left side shows the formatted input text we tokenize and then feed to the LLM during training. The right side shows the target text we prepare for the LLM where we can optionally mask out the instruction section, which means replacing the corresponding token IDs with the -100 ignore_index value.
In previous sections, we prepared the dataset and implemented a custom collate function for batching the instruction dataset. In this section, we create and apply the data loaders to the training, validation, and test sets that we need for the LLM instruction finetuning and evaluation.
After the dataset preparation, the process of finetuning an LLM for instruction-following begins with loading a pretrained LLM, which serves as the foundation for subsequent training. This pretrained model, having already learned general language patterns and knowledge from vast amounts of text data, is then adapted for instruction following through the finetuning process in the next section.
In step 5 of finetuning the LLM for instruction-following, we train the pretrained model loaded in the previous section on the instruction dataset prepared earlier in this chapter.
A plot showing the training and validation loss trends over two epochs. The solid line represents the training loss, showing a sharp decrease before stabilizing, while the dotted line represents the validation loss, which follows a similar pattern.
This section is focused on extracting and collecting the model responses on the held-out test dataset for further analysis. The next section covers model evaluation to quantify the performance of the instruction-finetuned LLM.
In this last step of the instruction finetuning pipeline, we implement a method to quantify the performance of the finetuned model by scoring the responses it generated for the test.
Two options for running Ollama. The left panel illustrates starting Ollama using ollama serve. The right panel shows a second option in macOS, running the Ollama application in the background instead of using the ollama serve command to start the application.
An overview of the different stages of implementing, pretraining, and finetuning an LLM covered in this book.

Summary

  • The instruction finetuning process adapts a pretrained LLM to follow human instructions and generate desired responses.
  • Preparing the dataset involves downloading an instruction-response dataset, formatting the entries, and splitting it into train, validation and test sets.
  • Training batches are constructed using a custom collate function that pads sequences, creates target token IDs, and masks padding tokens.
  • We load a pretrained GPT-2 medium model with 355M parameters to serve as the starting point for instruction finetuning.
  • The pretrained model is finetuned on the instruction dataset using a training loop similar to pretraining.
  • Evaluation involves extracting model responses on a test set and scoring them, e.g. using another LLM.
  • The Ollama application with an 8B parameter Llama model can be used to automatically score the finetuned model's responses on the test set, providing an average score to quantify performance.

FAQ

What is instruction finetuning, and how does it differ from classification finetuning?Instruction finetuning (supervised instruction finetuning) trains a pretrained LLM to follow natural-language instructions by learning from instruction–input–output pairs. Unlike classification finetuning, which adds a head and optimizes for class labels, instruction finetuning teaches next-token prediction over formatted instruction-response text so the model learns to produce coherent task-oriented replies.
How is the instruction dataset structured, and what prompt style is used?The dataset entries are dictionaries with keys: instruction, input (optional), and output. The chapter formats them using the Alpaca prompt style: - Preamble: “Below is an instruction…” - Section headers: “### Instruction:” and optionally “### Input:” - Target section: “### Response:” followed by the expected output If input is empty, the “### Input:” section is omitted.
Why do we need a custom collate function for batching, and what does it do?Sequences have varying lengths, so a custom collate function: - Applies per-batch padding to the longest sequence in that batch (minimizes waste) - Uses the end-of-text token ID 50256 for padding - Creates inputs and targets by shifting targets one token to the right - Optionally truncates to a maximum context length (for GPT‑2, 1024)
Why replace padding tokens with -100 in the targets, and why keep the first end-of-text?PyTorch’s cross_entropy ignores targets equal to ignore_index (default -100). Replacing padding tokens with -100 prevents padding from contributing to the loss. One 50256 token (end-of-text) is kept in each target sequence so the model learns when to end a response.
Should we mask instruction tokens in the targets?Masking instruction tokens (setting their target IDs to -100) makes the loss focus on the response and can reduce overfitting. However, research is mixed; some work (“Instruction Tuning With Loss Over Instructions”) suggests not masking can help. The chapter leaves instruction masking as an optional experiment.
How do we handle long samples that exceed the model’s context window?Use the allowed_max_length option in the custom collate function to truncate inputs and targets to the model’s context size (1024 for GPT‑2). This prevents out-of-range sequences and helps avoid memory issues.
Which pretrained model should I start with, and what about hardware limits?The chapter finetunes GPT‑2 medium (355M) because GPT‑2 small (124M) is often too limited for good instruction following. If you face hardware constraints, you can: - Switch to GPT‑2 small - Reduce batch size - Lower allowed_max_length (e.g., 512 or 256) - Use a GPU for speed; runtimes improve dramatically on GPUs
Why does generated text include the prompt, and how do I extract only the response?The generate function returns prompt plus completion (text completion behavior). To isolate the response, remove the input prefix from the generated text and optionally strip the “### Response:” header. This yields only the model’s answer for evaluation.
How can I automatically evaluate the finetuned model’s responses?The chapter uses a larger local LLM (Llama 3 via Ollama) to score responses: - Install and start Ollama - Query the model via its REST API with deterministic settings (temperature=0, fixed seed) - Prompt it to score each response from 0–100 compared to the reference output - Average scores across the test set for a concise performance metric
What are practical ways to improve instruction-following performance further?Try: - Hyperparameter tuning (learning rate, batch size, epochs) - Larger or more diverse instruction datasets (e.g., Alpaca 52k) - Different prompt styles (e.g., Phi‑3 style) or better response formatting - Larger base models (if resources allow) - Parameter-efficient finetuning (LoRA) - Preference tuning (e.g., DPO) after SFT to better align outputs with preferences

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Large Language Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Large Language Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Large Language Model (From Scratch) ebook for free