Domain-Specific Small Language Models you own this product

Guglielmo Iozzia

MEAP began April 2025
Last updated November 2025
Publication in Spring 2026 (estimated)

ISBN 9781633436701
300 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Software Development / Software Engineering / Technology and Computing / Language Models

table of content

PART 1: FIRST STEPS

1 Large Language Models

1.1 10000 feet overview

1.2 The Transformer Architecture

1.3 Evolutions of Transformers

1.4 Areas of application

1.5 The Open Source revolution

1.6 Risks and challenges with generalist LLMs

1.7 When do domain specific LLMs provide a greater business value than generalist ones?

1.8 Prerequisites

1.9 Summary

PART 2: CORE DOMAIN-SPECIFIC LLMS

2 Tuning for a Specific Domain

2.1 Data Preparation

2.1.1 Data Preparation for BERT Fine Tuning

2.1.2 Data Preparation for GPT Fine Tuning

2.1.3 Data Preparation for RAG

2.2 Retrieval Augmented Generation

2.3 Fine tuning

2.4 LoRA

2.5 RAG or fine tuning?

2.6 Summary

3 End-to-end Transformer Fine Tuning

3.1 Data preparation

3.2 Fine tuning

3.3 Testing the fine-tuned model

3.3.1 Domain-specific evaluation

3.4 Summary

4 Running Inference

4.1 How to generate content

4.1.1 Text completion

4.1.2 Few-shot learning

4.1.3 Code generation

4.1.4 Evaluating the generated content

4.2 Inference cost calculation

4.3 Areas for improvement (cost savings and performance)

4.3.1 Get the most from your GPU

4.3.2 Batching

4.3.3 Estimating the generation time

4.3.4 Optimizing GPU usage with DeepSpeed

4.4 Summary

5 Exploring ONNX

5.1 The ONNX format

5.2 ONNX operators and types

5.3 The ONNX runtime

5.4 ONNX runtime providers

5.5 ONNX for LLMs on CPU

5.6 ONNX for LLMs on GPU

5.6.1 ONNX for GPT on GPU

5.6.2 I/O binding

5.7 Summary

6 Quantizing for Your Production Environment

6.1 Transformers precision formats

6.2 8-bit quantization

6.2.1 Hands-on 8-bit quantization

6.2.2 LLM.int8() and quantization

6.3 8-bit quantization with ONNX

6.4 4-bit quantization

6.4.1 4-bit quantization with GPTQ

6.4.2 4-bit quantization with ggml

6.5 Summary

PART 3: REAL-WORLD USE CASES

7 Generating Python Code

7.1 Transformers for programming language generation

7.2 Hands-on with Python language generation using a Transformer architecture

7.2.1 Python code generation with CodeGen

7.2.2 ONNX conversion and quantization of models not supported by Optimum

7.2.3 Model evaluation

7.2.4 Python code generation with better models

7.3 Inference (coding assistance) on commodity hardware

7.4 Summary

8 Generating Protein Structures

8.1 Application of Transformers in Chemistry

8.2 From natural language to protein structures

8.3 Antibody generation with a small language model

8.4 From CIF files to crystal structures

8.5 Summary

PART 4: ADVANCED CONCEPTS

9 Advanced Quantization Techniques

9.1 What if a domain-specific model isn’t small?

9.2 FlexGen

9.3 SmoothQuant

9.4 BitNet

9.4.1 BitNet and Python

9.5 Summary

10 Profiling Insights

10.1 Profiling ONNX ported LLMs

10.2 Transforming raw ONNX profiling data into insights

10.3 Optimization of ONNX graphs for LLMs

10.4 Summary

11 Deployment and Serving

11.1 vLLM

11.1.1 Offline serving

11.1.2 Online serving

11.2 FastAPI

11.2.1 Benchmarking various models

11.2.2 Deploy most performance model with FastAPI

11.3 MLC LLM

11.4 Deployment and inference on Android devices

11.4.1 MLC LLM Framework

11.4.2 MLLM Framework

11.4.3 HF’s Transformers

11.5 Summary

12 Running on Your Laptop

12.1 Why a personal local assistant

12.2 Running an LLM locally with Ollama

12.2.1 Importing a custom model into Ollama

12.2.2 User privacy in Ollama

12.3 Running an LLM locally with LM Studio

12.3.1 The LM Studio Python SDK

12.4 Running an LLM locally with Jan

12.4.1 The Cortex local LLM engine

12.5 Summary

13 Creating End-to-end LLM Applications

13.1 Why LLMs alone aren’t enough

13.2 Combining a domain-specific small language model with RAG

13.2.1 Using a vector database

13.3 Building an Agent

13.4 Summary

14 Advanced Components for LLM Applications

14.1 Graph RAG

14.1.1 Microsoft’s OS GraphRAG capabilities

14.1.2 Evaluation metrics

14.2 RAG + Agentic AI

14.3 Long- and short-term memory management

14.4 Summary

15 Test-time Compute and Small Language Models

15.1 Test-time compute

15.2 The OptiLLM inference proxy

15.3 SLMs with embedded test-time compute

15.4 Building a reasoning domain-specific SLM

15.5 Summary

Overview

1 Large Language Models

Large Language Models have been swept up in a familiar hype cycle, obscuring a clear view of what they can and cannot do. This chapter cuts through that noise by tracing today’s LLMs back to the 2017 breakthrough of the Transformer architecture and explaining the shift from costly, label-hungry supervised learning toward self-supervised training at scale. By predicting the next token across vast corpora, LLMs internalize syntax and semantics and exhibit emergent abilities, from summarization and translation to code completion and basic reasoning, marking the transition from earlier RNN-era NLP to broadly capable generative models. ChatGPT’s 2022 launch accelerated commercialization, but the foundations predate it.

The chapter contrasts Transformers with RNNs—highlighting self-attention, parallelism, and embeddings—and outlines the evolution into encoder-only (BERT-like) and decoder-only (GPT-like) families suited to discrimination versus generation. It also introduces RLHF as a technique to refine interactive behavior. Beyond language-centric tasks, the models generalize to symbolic formats and diverse applications such as classification, Q&A, document summarization, and code generation. A major theme is the open-source surge: organizations can reduce costs and time-to-value by starting from pretrained models rather than training from scratch. While development and training burdens drop with fine-tuning, deployment and inference remain nontrivial, motivating techniques to optimize models for performance and efficiency in constrained environments.

Relying on closed generalist LLMs entails risks—data leaving organizational boundaries, potential leakage, opacity about model internals and training data, and hallucinations that are hard to audit—making them ill-suited for sensitive or highly regulated uses. Domain-specific LLMs, adapted via transfer learning on high-quality proprietary and public domain data and run privately, can deliver higher accuracy, compliance, and control, often at smaller scales that are more cost- and energy-efficient. The chapter closes by emphasizing that alignment should prioritize helpful, truthful, and harmless intent within domain constraints, and by listing practical prerequisites for readers: working knowledge of Python and PyTorch, familiarity with Transformer fundamentals and training basics, and comfort with common tooling for experimentation.

Some examples of diverse content an LLM can generate

The timeline of LLMs since 2019 (image taken from paper [2])

Order of magnitude of costs for each phase of LLM implementation from scratch

Order of magnitude of costs for each phase of LLM implementation when starting from a pretrained model

Ratios of data source types used to train some popular existing LLMs

Generic model specialization to a given domain

A LLM trained for tasks on molecule structures (generation and captioning)

Summary

The next chapter introduces you to techniques to specialize pre-trained models on your own data using Hugging Face libraries. In this chapter you have learned:

What is a Transformer.
Which risks and challenges to consider with generalist and/or closed source LLMs.
How to decide when it is the case to prefer a domain specific LLM.
Techniques to train and optimize an Open Source LLM on your own data exist and are the core topic of this book.

References

FAQ

What is the purpose of Chapter 1 and how does it address the LLM hype?

This chapter cuts through the hype around Large Language Models by explaining what LLMs can and cannot do, outlining their pros, risks, and challenges, and contrasting generalist models with domain-specific ones to help readers decide what fits their business needs.

How did Transformers spark the LLM revolution, and how are they different from RNNs?

Transformers, introduced in 2017, replaced sequential processing with self-attention and removed recurrence, enabling full input parallelism and faster training. Unlike RNNs (including LSTM/GRU), which are slow and limited by sequential data flow, Transformers scale better and capture long-range dependencies more effectively.

Why do LLMs use self-supervised learning, and how does next-word prediction training work?

Self-supervised learning avoids costly human labeling by generating labels from raw text. A common approach removes the next word from a sequence and trains the model to predict it. Comparing predictions with the original word provides the learning signal, enabling training on massive unlabeled corpora.

What role do word embeddings play in Transformers?

Word embeddings map tokens to high-dimensional vectors that capture syntactic and semantic relationships. Operating in this numerical space lets models reason about meaning and context, treating words as interrelated rather than isolated items.

What’s the difference between encoder-only and decoder-only Transformers (BERT vs. GPT)?

The original Transformer used an encoder-decoder. Later, encoder-only models like BERT excel at understanding tasks (e.g., classification/prediction), while decoder-only models like GPT excel at generative tasks (e.g., text/code generation). The choice depends on the target task.

What is RLHF and why is it used in LLMs like ChatGPT?

RLHF (Reinforcement Learning from Human Feedback) fine-tunes a model to maximize a reward aligned with human preferences. Applied to GPT-based systems behind ChatGPT, it improves helpfulness, safety, and overall performance beyond pretraining alone.

What can LLMs do beyond natural language text?

LLMs handle a wide range of tasks, including language understanding, classification, generation, question answering, summarization, semantic parsing, pattern recognition, basic math, code generation, dialogue, general knowledge, and logical reasoning. They can also work with symbolic formats (e.g., programming code or specialized domain notations).

Why consider open-source LLMs, and how do costs compare to building from scratch?

Open-source models offer choice, control, and cost savings. Starting from a pretrained model reduces development and training costs significantly compared to building a custom LLM from scratch. However, deployment and inference still carry substantial costs and operational challenges. Always review licenses before commercial use.

What are the main risks of using closed-source generalist LLMs?

- Data leaves your network (prompted inputs may be exposed or reused per provider terms)
- Potential data leakage due to provider-side security incidents
- Lack of transparency (implementation details, training data, version changes)
- Unknown training data can introduce bias and legal/copyright concerns
- Hallucinations (intrinsic and extrinsic) with limited mitigation
- Code-generation misuse risks and guardrail bypasses

When do domain-specific LLMs provide greater value, and how is specialization achieved?

They excel in regulated, high-stakes, or knowledge-intensive domains where accuracy, interpretability, and privacy are critical. Specialization typically uses transfer learning: fine-tuning a general model on domain data (public and/or private). This can run privately on-prem, reduce leakage risks, and, with optimization, fit resource-constrained environments. Smaller, few-billion-parameter models also improve sustainability by lowering compute and CO₂ impact.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $28.79

you save $19.20 (40%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $28.79

you save $19.20 (40%)

eBook

pdf, ePub, online

$47.99 $28.79

you save $19.20 (40%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more