1 Evaluations and alignments for AI
This chapter introduces evaluation and alignment as complementary disciplines that determine whether AI systems work and how to make them work as intended. Evaluation measures actual behavior—accuracy, reliability, robustness—while alignment steers behavior toward goals, constraints, and human values. Because modern AI produces open-ended language rather than deterministic outputs, the chapter explains why traditional software testing and classical ML metrics are insufficient, and argues that mastering both evaluation and alignment is essential for dependable, production-ready systems. Framed for practitioners, the book grounds these ideas in seminal papers and hands-on techniques that translate directly into engineering practice.
The chapter organizes evaluation around a spectrum from verifiable to open-ended tasks. Verifiable tasks permit deterministic checks (e.g., unit tests, numeric answers), yet still pose pitfalls such as answer extraction, verifier reliability, test coverage gaps, and “right answer, wrong reasoning.” Open-ended tasks require judgments of quality using reference-based metrics (e.g., word overlap and semantic similarity) and reference-free methods (intrinsic metrics, human raters, and LLM-as-a-judge), typically combined with production A/B tests and ongoing monitoring. Hallucination is highlighted as the dominant silent failure mode; the chapter reframes it as claim verification, introduces faithfulness as a quantification target, and surveys mitigations such as detection pipelines, retrieval grounding, constitutional training, and calibrated uncertainty—each with trade-offs and limits.
On alignment, the chapter presents three interacting pillars: principled (values and ethics), policy-based (explicit rules and instruction hierarchies), and personality (tone, brand voice, and persona consistency). It then unifies evaluation and alignment in an iterative loop: define success metrics, measure, analyze errors, align, and repeat across prompts, model selection, RAG quality, fine-tuning, and production operations. Because evaluations often become reward signals (e.g., RLHF, verifiable rewards, self-improving frameworks), the chapter warns about Goodhart’s Law and advocates working backwards from clear outcomes to robust, task-relevant metrics. The result is a practical methodology for designing scalable evaluation infrastructure and alignment strategies that balance helpfulness, safety, compliance, and user experience.
A taxonomy of AI evaluation methods, distinguishing between verifiable tasks and open-ended tasks.
AI’s answer is correct but requires parsing. The model produces a chain-of-thought (COT) response where the final answer must be extracted from the surrounding text.
Open-ended evaluation tasks require navigating a landscape of subjective quality judgements and are broadly classified into Reference-based and reference-free evaluations.
Reference-based evaluation workflow. Human experts create reference answers from a test set. The AI system generates candidate responses from the same inputs. The evaluator then compares each candidate's response with the corresponding reference response, producing scores.
Reference-free evaluation workflow. Quality signals are defined for the task. The AI generates outputs for test inputs. Metrics assess each output’s intrinsic properties without reference answers. Scores reveal quality patterns and failure modes.
Example task evaluation instructions provided to Amazon Mechanical Turk data labeler.
Training progression showing how models learn format before knowledge. Source: “Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability” by Chang et al, 2023.
Example of hallucinated authorship in an AI response.
Three pillars of alignment and how they relate to each other.
The iterative cycle of AI engineering. Define establishes what success looks like, Evaluate measures the current system against the designed evaluation metrics, Error analysis identifies failure modes, and then Align optimizes the system using different methods. The cycle then repeats.
Summary
- Evaluation measures whether a model, agent, or AI system performs the intended task accurately, reliably, and across edge cases. Alignment ensures the model’s behavior matches human goals, intentions, and values. Together, they ensure AI systems follow user intent and complete tasks safely.
- AI systems span a spectrum from verifiable to open-ended tasks.
- Verifiable tasks have objectively checkable outputs—coding and math are common examples.
- Open-ended tasks lack objective verification; examples include translation, report generation, and creative writing.
- There are two broad approaches to evaluating open-ended tasks, both aiming to approximate human judgment as closely as possible.
- Reference-based evaluations compare model outputs to expert-produced gold standards.
- Reference-free evaluations rely on intrinsic properties of generated content, sometimes guided by expert criteria.
- Evaluations may use heuristic, statistical, semantic, or LLM-as-a-Judge methods to align with human evaluators, even though humans often disagree.
- Hallucinations occur because AI systems learn structure and form before learning factual grounding—they master how to sound correct before knowing what is correct. Hallucination can be measured using faithfulness metrics.
- The three pillars of alignment are principled, policy-based, and personality-based.
- Principled alignment aligns AI with foundational human values.
- Policy-based alignment aligns AI behavior with specific rules, norms, or institutional policies.
- Personality-based alignment shapes consistent traits and communication styles in AI behavior.
- Evaluation and alignment are interdependent. Evaluation metrics become optimization targets, creating feedback loops that can lead to emergent or unintended behaviors.
- Working backward from desired outcomes distinguishes AI engineering from research and other forms of engineering — we define metrics first, then build systems to optimize them.
- There is no one-size-fits-all solution. Every evaluation and alignment technique involves trade-offs that must be carefully balanced for specific use cases.
Evaluation and Alignment, The Seminal Papers ebook for free