Domain-Specific Small Language Models you own this product

Efficient AI for local deployment

Guglielmo Iozzia

MEAP began April 2025
Last updated December 2025
Publication in April 2026 (estimated)

ISBN 9781633436701
347 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Japanese, Russian

catalog / Software Development / Software Engineering / Technology and Computing / Language Models

resources: Source code Book forum Source code on Github

table of content

PART 1: FIRST STEPS

1 Small Language Models

1.1 What are Small Language Models?

1.2 10000 feet overview

1.3 The Transformer Architecture

1.4 Evolutions of Transformers

1.5 Areas of application

1.6 The Open Source revolution

1.7 Risks and challenges with generalist LLMs

1.8 When do domain specific LLMs provide greater business value than generalist ones?

1.9 Prerequisites

1.10 References

1.11 Summary

PART 2: CORE DOMAIN-SPECIFIC LLMS

2 Tuning for a Specific Domain

2.1 Data Preparation

2.1.1 Data Preparation for BERT Fine Tuning

2.1.2 Data Preparation for GPT Fine Tuning

2.1.3 Data Preparation for RAG

2.2 Retrieval Augmented Generation

2.3 Fine tuning

2.4 LoRA

2.5 RAG or fine tuning?

2.6 Summary

3 End-to-end Transformer Fine Tuning

3.1 Data preparation

3.2 Fine tuning

3.3 Testing the fine-tuned model

3.3.1 Domain-specific evaluation

3.4 Summary

4 Running Inference

4.1 How to generate content

4.1.1 Text completion

4.1.2 Few-shot learning

4.1.3 Code generation

4.1.4 Evaluating the generated content

4.2 Inference cost calculation

4.3 Areas for improvement (cost savings and performance)

4.3.1 Get the most from your GPU

4.3.2 Batching

4.3.3 Estimating the generation time

4.3.4 Optimizing GPU usage with DeepSpeed

4.4 Summary

5 Exploring ONNX

5.1 The ONNX format

5.2 ONNX operators and types

5.3 The ONNX runtime

5.4 ONNX runtime providers

5.5 ONNX for LLMs on CPU

5.6 ONNX for LLMs on GPU

5.6.1 ONNX for GPT on GPU

5.6.2 I/O binding

5.7 Summary

6 Quantizing for Your Production Environment

6.1 Transformers precision formats

6.2 8-bit quantization

6.2.1 Hands-on 8-bit quantization.

6.2.2 LLM.int8() and quantization

6.3 8-bit quantization with ONNX

6.4 4-bit quantization

6.4.1 4-bit quantization with GPTQ

6.4.2 4-bit quantization with ggml

6.5 Summary

PART 3: REAL-WORLD USE CASES

7 Generating Python Code

7.1 Transformers for programming language generation

7.2 Hands-on with Python language generation using a Transformer architecture

7.2.1 Python code generation with CodeGen

7.2.2 ONNX conversion and quantization of models not supported by Optimum

7.2.3 Model evaluation

7.2.4 Python code generation with better models

7.3 Inference (coding assistance) on commodity hardware

7.4 Summary

8 Generating Protein Structures

8.1 Application of Transformers in Chemistry

8.2 From natural language to protein structures

8.3 Antibody generation with a small language model

8.4 From CIF files to crystal structures

8.5 Summary

PART 4: ADVANCED CONCEPTS

9 Advanced Quantization Techniques

9.1 What if a domain-specific model isn’t small?

9.2 FlexGen

9.3 SmoothQuant

9.4 BitNet

9.4.1 BitNet and Python

9.5 Summary

10 Profiling Insights

10.1 Profiling ONNX ported LLMs

10.2 Transforming raw ONNX profiling data into insights

10.3 Optimization of ONNX graphs for LLMs

10.4 Summary

11 Deployment and Serving

11.1 vLLM

11.1.1 Offline serving

11.1.2 Online serving

11.2 FastAPI

11.2.1 Benchmarking various models

11.2.2 Deploy most performance model with FastAPI

11.3 MLC LLM

11.4 Deployment and inference on Android devices

11.4.1 MLC LLM Framework

11.4.2 MLLM Framework

11.4.3 HF’s Transformers

11.5 Summary

12 Running on Your Laptop

12.1 Why a personal local assistant

12.2 Running an LLM locally with Ollama

12.2.1 Importing a custom model into Ollama

12.2.2 User privacy in Ollama

12.3 Running an LLM locally with LM Studio

12.3.1 The LM Studio Python SDK

12.4 Running an LLM locally with Jan

12.4.1 The Cortex local LLM engine

12.5 Summary

13 Creating End-to-end LLM applications

13.1 Why LLMs alone aren’t enough

13.2 Combining a domain-specific small language model with RAG

13.2.1 Using a vector database

13.3 Building an Agent

13.4 Summary

14 Advanced Components for LLM Applications

14.1 Graph RAG

14.1.1 Microsoft’s OS GraphRAG capabilities

14.1.2 Evaluation metrics

14.2 RAG + Agentic AI

14.3 Long- and short-term memory management

14.4 Summary

15 Test-time Compute and Small Language Models

15.1 Test-time compute

15.2 The OptiLLM inference proxy

15.3 SLMs with embedded test-time compute

15.4 Building a reasoning domain-specific SLM

15.5 Summary

Overview

2 Tuning for a Specific Domain

This chapter provides a practical roadmap for adapting open-source foundation models to a specific domain. It begins by framing the core options for specialization—full fine-tuning, parameter-efficient techniques, and retrieval augmented generation—and emphasizes the importance of disciplined data preparation regardless of approach. The material is hands-on, showing how to structure datasets, tokenize consistently, manage sequence lengths, and set up training and evaluation pipelines, while also introducing retrieval workflows that pair external knowledge with pretrained models.

Concrete data-preparation examples span both encoder-only and decoder-only models. For a BERT-based classifier, the chapter walks through labeling, tokenization with special tokens, consistent padding and truncation, label encoding, train/validation splits, dataset classes, and data loaders, culminating in loading a sequence-classification head. For GPT-style completion, it mirrors the same flow—pairing context and target, tokenizing, padding to a uniform length, and batching—before loading a causal language model. When fine-tuning is not ideal, the chapter shows how to prepare for RAG by embedding a knowledge base, indexing it in a vector store, and retrieving semantically similar content efficiently (illustrated end to end with sentence embeddings and nearest-neighbor search), highlighting scalability to production workloads.

The fine-tuning section demonstrates full task adaptation with a DistilBERT question-answering example: selecting a dataset with question, context, and answer spans; preprocessing to map character offsets to token positions; configuring tokenization and batching; and training with standard trainer utilities before running inference. The chapter then introduces parameter-efficient fine-tuning via LoRA using a small FLAN-T5 model quantized for efficiency, configuring low-rank adapters, training only a tiny fraction of parameters, and loading adapters for inference. It closes with guidance on choosing between RAG and fine-tuning: RAG excels for up-to-date, source-grounded answers with lower training cost but added infrastructure considerations, while fine-tuning (and PEFT) delivers superior task performance and domain nuance for complex patterns at higher engineering cost. The key takeaway is to match the strategy to the use case—there is no one-size-fits-all solution.

The typical RAG workflow.

The typical fine tuning workflow.

The LoRA fine tuning process

Summary

BERT models require tokenization with special tokens CLS and SEP to mark sequence boundaries for classification tasks.
GPT models use context-target pairs where the model learns to generate the target text given the context input.
Tokenizers convert text into numerical representations that models can process during training and inference.
FAISS creates vector indexes for efficient similarity search across large datasets of text embeddings.
RAG combines information retrieval with text generation to provide models with external knowledge sources.
RAG retrieves relevant information from vector databases before generating responses to improve accuracy.
Fine tuning adjusts all parameters of a pre-trained model using task-specific labeled datasets.
LoRA fine tunes only a small subset of model parameters while keeping the original weights frozen.
PEFT techniques like LoRA reduce computational costs compared to full fine tuning while maintaining performance.
RAG works better for up-to-date information and reduces hallucinations through source traceability.
Fine tuning produces better results for complex domain-specific tasks requiring specialized knowledge.
Vector databases store embeddings and enable fast similarity search for RAG implementations.

FAQ

How do I prepare data for fine-tuning BERT on a classification task?

- Gather a labeled dataset as (text, label) pairs.
- Load the model-specific tokenizer (bert-base-uncased).
- Insert special tokens ([CLS] at start, [SEP] as separator) as required by BERT and tokenize with padding/truncation and a max_length.
- Produce tensors for input_ids and attention_mask; encode labels (e.g., with a label encoder).
- Split into train/validation (e.g., scikit-learn train_test_split).
- Build a CustomDataset returning input_ids, attention_mask, labels; wrap with DataLoader.
- Load BertForSequenceClassification with num_labels set to your label set size, then train.

Why are special tokens, padding, and truncation important in data preparation?

- Special tokens help the model understand sequence structure: [CLS] marks the start, [SEP] separates/ends segments (especially needed by encoder-only models like BERT).
- Padding ensures all sequences have the same length in a batch, enabling efficient tensor operations.
- Truncation keeps inputs within the model’s maximum context window, controlling memory/latency.
- Tokenization choices directly affect vocabulary handling, OOV behavior, and computational cost.

How do I split data and create PyTorch DataLoaders for fine-tuning?

- Use train_test_split to create training and validation sets for inputs, masks, and labels.
- Implement a torch.utils.data.Dataset that returns a dict with input_ids, attention_mask, and labels.
- Wrap datasets with DataLoader; set batch_size and shuffle (True for training, False for validation).
- Feed these loaders into your training loop or a Hugging Face Trainer.

What changes when preparing data for GPT-2 text completion versus BERT classification?

- Task format: GPT-2 learns to predict a target completion given a context; pair each (context, target).
- Concatenation: Join context and target with clear separators/special tokens consistent with the tokenizer.
- Tokenization: Use GPT2Tokenizer; encode the combined sequence and pad to a uniform length.
- Dataset items: Typically only input_ids (causal language modeling infers labels by shifting).
- Model: Load GPT2LMHeadModel rather than a sequence classification head.

What is Retrieval Augmented Generation (RAG) and what are its phases?

- RAG augments an LLM’s prompt with retrieved, relevant external context to improve accuracy and freshness.
- Two phases:
1) Retrieval: Fetch relevant chunks/embeddings from private/public sources (often via a vector DB).
2) Generation: Provide the retrieved context to the LLM to generate answers, often with source attribution.

How do I build a simple similarity search pipeline with FAISS?

- Prepare a small corpus with labels (e.g., texts about superheroes).
- Encode texts into embeddings using sentence-transformers (e.g., paraphrase-mpnet-base-v2).
- Create an IndexFlatL2 in FAISS; L2-normalize and add vectors to the index.
- Encode the query text, normalize it, then call index.search with k neighbors.
- Join results (distances and indices) with the original data to inspect nearest matches.
- FAISS scales to million-to-billion vector collections and supports GPU via CUDA.

How does full fine-tuning differ from PEFT, and what does LoRA change?

- Full fine-tuning updates all model weights on a task-specific dataset; it’s accurate but compute/data heavy.
- PEFT updates a small subset of parameters, retaining most base weights frozen; it’s faster and cheaper.
- LoRA (Low-Rank Adaptation) injects low-rank adapters (e.g., into attention q/v layers), keeping base weights frozen and training only adapter params—often under 1% of total—while preserving quality.

What are the key steps to fine-tune DistilBERT for question answering?

- Load a QA dataset containing question, context, and answer spans.
- Tokenize questions and contexts with truncation (only_second for long contexts) and return offset mappings.
- Map character-level answer spans to token start/end positions using offsets; handle cases where the answer falls outside the truncated window.
- Remove unused columns and create tokenized train/test datasets.
- Load AutoModelForQuestionAnswering, set up a DefaultDataCollator, and define TrainingArguments.
- Instantiate Trainer and call train().
- For inference, compute start/end argmax over logits and decode the span.

What LoRA configuration and training flow does the chapter demonstrate?

- Model: FLAN-T5-small loaded in 8-bit for efficiency.
- Data: samsum (dialogue → summary). Preprocess with max lengths, tokenize inputs with “summarize: ” prefix, mask label padding with -100.
- LoRA config: r (rank), lora_alpha, lora_dropout, target_modules (e.g., “q”, “v”), task_type set to seq2seq.
- Prepare model for k-bit training; wrap with get_peft_model to add adapters.
- Use DataCollatorForSeq2Seq and Seq2SeqTrainer with modest epochs (e.g., 3).
- Save only adapter weights; at inference, load base model and attach PEFT adapters.

How should I choose between RAG and fine-tuning (including PEFT) for a domain-specific task?

- Prefer RAG when:
- You need up-to-date, traceable answers with source attribution.
- You have limited labeling budget and want to avoid training costs.
- The base model already performs reasonably; extra context can close the gap.
- Prefer fine-tuning/PEFT when:
- The task requires learning domain-specific patterns/jargon or complex reasoning.
- You need higher accuracy on in-domain generation beyond what context can provide.
- Consider costs: RAG requires vector DB and pipelines; FT needs compute and ML ops. Hallucinations: FT/PEFT mitigates domain-specific ones; RAG helps with general knowledge grounding. No one-size-fits-all—evaluate per use case.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $31.19

you save $16.80 (35%)

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more