Overview

5 How do we constrain the behavior of LLMs?

Constraining LLM behavior makes them more useful because base models simply continue text and can drift off-topic, produce undesirable content, or violate strict formatting needs. The chapter explains why constraints are essential and outlines four levers for control: curate training data, alter the base training process, fine-tune after pretraining, and post-process outputs with code. Fine-tuning is emphasized as the most practical and impactful approach, turning a general “base” or “foundation” model into an instruction-following system tailored to specific tasks. Motivations include keeping models safe and on-task, coping with missing or new information, and meeting rigid output formats that probabilistic decoding alone cannot guarantee. No single method is perfect, so practitioners typically layer techniques to achieve reliability.

Supervised fine-tuning (SFT) extends next-token training on high-quality, task-specific examples to inject domain knowledge and style, but it does not change the model’s incentives and can suffer from catastrophic forgetting and privacy risks. Reinforcement Learning from Human Feedback (RLHF) tackles abstract goals like helpfulness and harmlessness by training a reward model from human-rated examples, then optimizing the LLM to maximize predicted quality while staying close to base behavior via an explicit similarity constraint. This balance stabilizes learning and reduces reward hacking, but RLHF is data- and compute-intensive, works best on known issues, and does not add new reasoning capabilities. In practice it is often combined with SFT and careful prompt design to create usable chatbots that avoid many base-model failure modes, while still requiring continuous evaluation.

Beyond fine-tuning, behavior can be shaped by curating data (quality, diversity, and tokenization choices), modifying base training to protect privacy (such as with differential privacy), and enforcing constraints at inference time via decoding rules, guardrails, and schema-aware validators that regenerate tokens on parse errors. Practical systems also integrate LLMs into broader workflows, notably Retrieval-Augmented Generation, which retrieves relevant documents and conditions the model on them to improve factuality and transparency. Emerging tools for LLM “programming” help orchestrate multi-step pipelines, automate prompt construction and tuning, and make it easier to swap models or data sources. The overarching theme is to combine data, training, and runtime controls with rigorous testing to align outputs to task, safety, and formatting requirements.

There are four places where one may intervene to change or constrain an LLM’s behavior. The two stages of model training are shown in the middle of the diagram, where the model’s parameters are altered. On the left, one could also alter the training data before model training. On the right, one could intercept the model outputs after model training and write code to handle specific situations.
figure
Commercial LLMs like ChatGPT are designed to follow instructions (within some limits) and can perform a lot of low-cognition or pattern-matching tasks with very high efficacy. This includes stylized writing, such as pattern matching, or instruction following, such as roleplaying as a care salesman.
figure
Supervised Fine Tuning (SFT) is a simple approach to improving model results. You repeat the same process used to build the base model. Once the base model is trained on a large amount of general data, you continue training on the smaller specialized data collection.
figure
RL is about iterative interactions, where the “reward” for your actions may not materialize for a long time and requires multiple steps to achieve. For a chatbot like GPT, the “environment” is the conversation with a user, and the “actions” are the infinite possible texts that GPT might complete. The reward becomes, in some sense, the user’s satisfaction with the chatbot at the end of the conversation.
figure
RLHF is quite good at getting LLMs to avoid known, specific issues. However, it does not endow the model with new tools to handle novel issues. The desire to talk about the Miami Dolphins as the logical thing to say next after asking about football in Miami violates the first request to avoid ever mentioning dolphins.
figure
A naive and incomplete version of RLHF. The dashed lines represent text being sent from one component to another. Since text is incompatible with gradient decedent, a more difficult RL algorithm must be used instead. This allows us to alter the weights of the LLM based on a quality score for the LLM’s outputs.
figure
The reward model is trained like a standard supervised classification algorithm. A neural network, which could be an LLM itself or another simpler network like a convolutional or recurrent neural network, is trained to predict how a human would score a prompt completion pair. Because neural networks are differentiable, this training works and provides a tool that stands in as the “human” in RLHF.
figure
The full version of RLHF. The dashed lines are text and require reinforcement learning to update the parameters. The Original LLM is the base model without any alterations, while the LLM to fine-tune starts as the base model but is altered to improve the quality of its outputs. The similarity and quality reward components are provided with word probabilities to improve calculation. RL adjusts the parameters by combining the quality and similarity scores.
figure
In addition to fine-tuning, one can change the model’s behavior by altering the training data, altering the base model training process, or modifying the model outputs by writing code to handle specific situations.
figure
By writing code that enforces a format specification, you can catch invalid output from an LLM as it is being generated. Once detected, simply having the LLM produce the next most likely token until a valid output is found is a simple way to improve the situation.
figure
On the left, we show the normal use of an LLM of a user asking about how to write JSON. LLMs naturally have the chance of producing errant outputs, which we want to minimize. On the right, we show the RAG approach. By using a search engine, we can find documents that are relevant to a query and combine them into a new prompt, giving the LLM more information and context to produce a better answer.
figure

Summary

  • There are four places you can intervene to change a model’s behavior: the data collection/tokenization, training the initial base model, fine-tuning the base model, and intercepting the predicted tokens. All four places are important, but fine-tuning is the most effective place for most users to make a change for both lower cost and the ability to change the model’s goals.
  • Supervised Fine-Tuning (SFT) performs the normal training process on a smaller bespoke data collection and is useful for refining the model’s knowledge of a particular domain.
  • Reinforcement Learning from Human Feedback (RLHF) requires more data, but allows us to specify objectives more complex than “predict the next token”.
  • You can use existing tools like syntax checkers to detect incorrect LLM outputs in cases where the output format must be strict, such as for JSON or XML. Generation and syntax checking can be run in a loop until the output satisfies the necessary syntax constraints.
  • Retrieval Augmented Generation is a popular method of augmenting the input of an LLM by first finding relevant content via a search engine or database and inserting it into the prompt.
  • Coding frameworks like DSPy are beginning to emerge that separate the specific LLM, vectorization, and prompt definition from the logic of how inputs and outputs from the LLM are modified for a specific task. This allows you to build more reliable and repeatable LLM solutions that can quickly adapt to new models and methods.

FAQ

Why is it necessary to constrain an LLM’s behavior?LLMs are trained to continue text, not to follow goals. Without constraints, they can go off-topic, produce unsafe or legally risky content, or fail to meet task requirements. Constraining behavior aligns outputs with intended use (e.g., a car-sales bot staying on script).
What are the four places we can constrain an LLM?There are four intervention points: 1) curate training data before pre-training, 2) alter the base model training process, 3) fine-tune the model (e.g., SFT, RLHF), and 4) post-process or intercept outputs with code after training.
Why is fine-tuning the primary method for changing behavior?Fine-tuning updates model parameters to add knowledge and align behavior using far less data and cost than pre-training. It is widely supported (open- and closed-source) and can be layered on top of base models to achieve instruction following and domain specificity.
How does supervised fine-tuning (SFT) work, and what is it good for?SFT continues next-token training on high-quality, domain-specific text (manuals, transcripts, scripts). It’s excellent for injecting new knowledge and adapting to a domain but is less effective for abstract rules like “be polite” or “refuse unsafe requests.”
What are the pitfalls of fine-tuning?Key risks include catastrophic forgetting (new training overwrites prior knowledge), data leakage or privacy exposure (the model may reproduce fine-tuning data), and the need to balance specialization with general capabilities. Fine-tuning is not purely additive.
What is RLHF and how does it constrain behavior?RLHF uses human feedback (via a learned reward model) to score outputs and adjust the LLM toward helpful, safe, and instruction-following behavior. A second “similarity” objective keeps the model close to base behavior to prevent reward hacking and gibberish. It’s powerful but data- and compute-intensive and can be brittle on novel cases.
Why aren’t base models very usable out of the box?Base models are optimized only for next-token prediction; they aren’t trained to be chatbots, stay on topic, or avoid harmful content. Without additional alignment (e.g., fine-tuning/RLHF), they can be unhelpful or unsafe.
How can we enforce strict output formats like JSON?Use decoding-time constraints and validators that parse partial output and force regeneration on errors. You can also intercept tokens, apply “go/no-go” word filters, and delay responses to run checks before sending content to users.
How do data curation and base training choices affect behavior and safety?Careful curation reduces harmful language and misinformation but must preserve enough examples to recognize and reject bad content. Tokenization choices are locked in at training. Techniques like differential privacy can mitigate training-data leakage at some performance and cost trade-off.
What is Retrieval Augmented Generation (RAG), and when should I use it?RAG retrieves relevant documents and feeds them with the query to the LLM, improving factuality and enabling citations. It reduces hallucinations by grounding answers in sources, but its quality depends on the search/index. Tools like DSPy help build robust multi-step pipelines that combine retrieval, prompting, and validation.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • How Large Language Models Work ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • How Large Language Models Work ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • How Large Language Models Work ebook for free