Overview

1 Evaluations and alignments for AI

This chapter introduces evaluation and alignment as complementary disciplines that determine whether AI systems work and how to make them work as intended. Evaluation measures actual behavior—accuracy, reliability, robustness—while alignment steers behavior toward goals, constraints, and human values. Because modern AI produces open-ended language rather than deterministic outputs, the chapter explains why traditional software testing and classical ML metrics are insufficient, and argues that mastering both evaluation and alignment is essential for dependable, production-ready systems. Framed for practitioners, the book grounds these ideas in seminal papers and hands-on techniques that translate directly into engineering practice.

The chapter organizes evaluation around a spectrum from verifiable to open-ended tasks. Verifiable tasks permit deterministic checks (e.g., unit tests, numeric answers), yet still pose pitfalls such as answer extraction, verifier reliability, test coverage gaps, and “right answer, wrong reasoning.” Open-ended tasks require judgments of quality using reference-based metrics (e.g., word overlap and semantic similarity) and reference-free methods (intrinsic metrics, human raters, and LLM-as-a-judge), typically combined with production A/B tests and ongoing monitoring. Hallucination is highlighted as the dominant silent failure mode; the chapter reframes it as claim verification, introduces faithfulness as a quantification target, and surveys mitigations such as detection pipelines, retrieval grounding, constitutional training, and calibrated uncertainty—each with trade-offs and limits.

On alignment, the chapter presents three interacting pillars: principled (values and ethics), policy-based (explicit rules and instruction hierarchies), and personality (tone, brand voice, and persona consistency). It then unifies evaluation and alignment in an iterative loop: define success metrics, measure, analyze errors, align, and repeat across prompts, model selection, RAG quality, fine-tuning, and production operations. Because evaluations often become reward signals (e.g., RLHF, verifiable rewards, self-improving frameworks), the chapter warns about Goodhart’s Law and advocates working backwards from clear outcomes to robust, task-relevant metrics. The result is a practical methodology for designing scalable evaluation infrastructure and alignment strategies that balance helpfulness, safety, compliance, and user experience.

A taxonomy of AI evaluation methods, distinguishing between verifiable tasks and open-ended tasks.
AI’s answer is correct but requires parsing. The model produces a chain-of-thought (COT) response where the final answer must be extracted from the surrounding text.
Open-ended evaluation tasks require navigating a landscape of subjective quality judgements and are broadly classified into Reference-based and reference-free evaluations.
Reference-based evaluation workflow. Human experts create reference answers from a test set. The AI system generates candidate responses from the same inputs. The evaluator then compares each candidate's response with the corresponding reference response, producing scores.
Reference-free evaluation workflow. Quality signals are defined for the task. The AI generates outputs for test inputs. Metrics assess each output’s intrinsic properties without reference answers. Scores reveal quality patterns and failure modes.
Example task evaluation instructions provided to Amazon Mechanical Turk data labeler.
Training progression showing how models learn format before knowledge. Source: “Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability” by Chang et al, 2023.
Example of hallucinated authorship in an AI response.
Three pillars of alignment and how they relate to each other.
The iterative cycle of AI engineering. Define establishes what success looks like, Evaluate measures the current system against the designed evaluation metrics, Error analysis identifies failure modes, and then Align optimizes the system using different methods. The cycle then repeats.

Summary

  • Evaluation measures whether a model, agent, or AI system performs the intended task accurately, reliably, and across edge cases. Alignment ensures the model’s behavior matches human goals, intentions, and values. Together, they ensure AI systems follow user intent and complete tasks safely.
  • AI systems span a spectrum from verifiable to open-ended tasks.
    • Verifiable tasks have objectively checkable outputs—coding and math are common examples.
    • Open-ended tasks lack objective verification; examples include translation, report generation, and creative writing.
  • There are two broad approaches to evaluating open-ended tasks, both aiming to approximate human judgment as closely as possible.
    • Reference-based evaluations compare model outputs to expert-produced gold standards.
    • Reference-free evaluations rely on intrinsic properties of generated content, sometimes guided by expert criteria.
  • Evaluations may use heuristic, statistical, semantic, or LLM-as-a-Judge methods to align with human evaluators, even though humans often disagree.
  • Hallucinations occur because AI systems learn structure and form before learning factual grounding—they master how to sound correct before knowing what is correct. Hallucination can be measured using faithfulness metrics.
  • The three pillars of alignment are principled, policy-based, and personality-based.
    • Principled alignment aligns AI with foundational human values.
    • Policy-based alignment aligns AI behavior with specific rules, norms, or institutional policies.
    • Personality-based alignment shapes consistent traits and communication styles in AI behavior.
  • Evaluation and alignment are interdependent. Evaluation metrics become optimization targets, creating feedback loops that can lead to emergent or unintended behaviors.
  • Working backward from desired outcomes distinguishes AI engineering from research and other forms of engineering — we define metrics first, then build systems to optimize them.
  • There is no one-size-fits-all solution. Every evaluation and alignment technique involves trade-offs that must be carefully balanced for specific use cases.

FAQ

What’s the difference between evaluation and alignment in AI systems?Evaluation measures what a model or system actually does—its accuracy, reliability, and behavior across edge cases. Alignment ensures it behaves as intended—following instructions, respecting constraints, and reflecting human values. Applied together across models, agents, and full stacks, they form a feedback loop: measure behavior, then apply corrections to make systems production-ready.
How do verifiable and open-ended tasks differ, and why does that shape evaluation?Verifiable tasks are convergent: they have objectively correct answers that can be checked programmatically with clear criteria (e.g., unit-tested code, math). Open-ended tasks are divergent: many responses can be valid and quality is partly subjective (e.g., summaries, creative writing). For open-ended work, choose between reference-based (compare to gold answers) and reference-free (judge intrinsic quality) strategies.
What are the main approaches to evaluating open-ended tasks?Three complementary approaches are used: reference-based metrics (e.g., BLEU, ROUGE, BERTScore/COMET) that compare to human references; reference-free intrinsic metrics (readability, coherence, task-specific checks); and human or LLM-as-a-Judge evaluations that approximate human judgments at scale. In practice, teams blend these to balance cost, speed, and judgment quality.
What challenges arise even for “verifiable” evaluations?Common pitfalls include answer extraction from free-form outputs (mitigate with structured outputs/JSON or constrained decoding), verifier reliability (AI verifiers can be gamed—ensure the verification step is more deterministic than the task), coverage gaps in test suites, correct answers via wrong reasoning, and multiple valid solutions where quality dimensions (readability, efficiency) aren’t captured by binary checks.
How can we measure and track hallucinations?View outputs as sets of claims, extract them, verify each against evidence, then compute faithfulness: supported claims divided by total claims. Challenges include ambiguous or contextual statements, inference chains, half-truths, and “scheming” where incorrect reasoning yields a correct conclusion. Faithfulness rises as unsupported claims fall and helps quantify progress.
Why do LLMs hallucinate in the first place?During training, models learn the format and tone of language before fully learning tasks and facts. They assemble plausible patterns even when evidence is missing, and they don’t “know what they don’t know.” This can yield confident, fluent fabrications with real-world costs, so systems must detect, mitigate, and communicate uncertainty.
What are the three pillars of alignment?Principled alignment encodes core values (helpful, harmless, fair, honest). Policy-based alignment enforces concrete rules and instruction hierarchies (e.g., safety over user overrides, tool outputs don’t supersede system goals). Personality alignment shapes tone and style for consistent user experience. These pillars interact and can conflict, requiring careful trade-off design.
How do practitioners combine evaluation methods in real products?Use a composite strategy: automated metrics for fast iteration, sampled human reviews for calibration, LLM-as-a-Judge to scale between human cycles, A/B tests for real user preference, and error analysis to uncover failure modes. Monitor online after launch; strong offline metrics don’t always predict user satisfaction.
What is the iterative cycle connecting evaluation and alignment?Define success criteria, evaluate against them, analyze errors, align the system (via prompts, retrieval, policies, training), then repeat. Apply this loop to model selection, prompt optimization, RAG, fine-tuning, and production monitoring. When evaluations become rewards (e.g., RLHF/RLVR), beware Goodhart’s Law—flawed metrics yield misaligned behavior.
What tools and setup do I need to follow the book’s examples?Use Python 3.10+ with standard scientific tooling (NumPy, pandas) and a package manager. For LLM-as-a-Judge, obtain API access (e.g., OpenAI, Anthropic, Gemini). Most lexical metrics run on any laptop; semantic metrics or local models benefit from hardware acceleration (Apple M-series or modest GPUs). Fine-tuning exercises can use a T4-class GPU via Colab or cloud.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Evaluation and Alignment, The Seminal Papers ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Evaluation and Alignment, The Seminal Papers ebook for free