Domain-Specific Small Language Models you own this product

Efficient AI for local deployment

Guglielmo Iozzia
Foreword by Matthew R. Versaggi

May 2026
ISBN 9781633436701
376 pages

Included with a Manning Online subscription

printed in black & white

available in Japanese, Russian

catalog / Software Development / Software Engineering / Technology and Computing / Language Models

print book available May 19, 2026

ePub + liveBook available May 19, 2026

resources: Source code Book forum Source code on Github Register your pBook for a free eBook

table of content

PART 1: FIRST STEPS

1 Small Language Models

1.1 What are Small Language Models?

1.2 10000 feet overview

1.3 The Transformer Architecture

1.4 Evolutions of Transformers

1.5 Areas of application

1.6 The Open Source revolution

1.7 Risks and challenges with generalist LLMs

1.8 When do domain specific LLMs provide greater business value than generalist ones?

1.9 Prerequisites

1.10 References

1.11 Summary

PART 2: CORE DOMAIN-SPECIFIC LLMS

2 Tuning for a Specific Domain

2.1 Data Preparation

2.1.1 Data Preparation for BERT Fine Tuning

2.1.2 Data Preparation for GPT Fine Tuning

2.1.3 Data Preparation for RAG

2.2 Retrieval Augmented Generation

2.3 Fine tuning

2.4 LoRA

2.5 RAG or fine tuning?

2.6 Summary

3 End-to-end Transformer Fine Tuning

3.1 Data preparation

3.2 Fine tuning

3.3 Testing the fine-tuned model

3.3.1 Domain-specific evaluation

3.4 Summary

4 Running Inference

4.1 How to generate content

4.1.1 Text completion

4.1.2 Few-shot learning

4.1.3 Code generation

4.1.4 Evaluating the generated content

4.2 Inference cost calculation

4.3 Areas for improvement (cost savings and performance)

4.3.1 Get the most from your GPU

4.3.2 Batching

4.3.3 Estimating the generation time

4.3.4 Optimizing GPU usage with DeepSpeed

4.4 Summary

5 Exploring ONNX

5.1 The ONNX format

5.2 ONNX operators and types

5.3 The ONNX runtime

5.4 ONNX runtime providers

5.5 ONNX for LLMs on CPU

5.6 ONNX for LLMs on GPU

5.6.1 ONNX for GPT on GPU

5.6.2 I/O binding

5.7 Summary

6 Quantizing for Your Production Environment

6.1 Transformers precision formats

6.2 8-bit quantization

6.2.1 Hands-on 8-bit quantization.

6.2.2 LLM.int8() and quantization

6.3 8-bit quantization with ONNX

6.4 4-bit quantization

6.4.1 4-bit quantization with GPTQ

6.4.2 4-bit quantization with ggml

6.5 Summary

PART 3: REAL-WORLD USE CASES

7 Generating Python Code

7.1 Transformers for programming language generation

7.2 Hands-on with Python language generation using a Transformer architecture

7.2.1 Python code generation with CodeGen

7.2.2 ONNX conversion and quantization of models not supported by Optimum

7.2.3 Model evaluation

7.2.4 Python code generation with better models

7.3 Inference (coding assistance) on commodity hardware

7.4 Summary

8 Generating Protein Structures

8.1 Application of Transformers in Chemistry

8.2 From natural language to protein structures

8.3 Antibody generation with a small language model

8.4 From CIF files to crystal structures

8.5 Summary

PART 4: ADVANCED CONCEPTS

9 Advanced Quantization Techniques

9.1 What if a domain-specific model isn’t small?

9.2 FlexGen

9.3 SmoothQuant

9.4 BitNet

9.4.1 BitNet and Python

9.5 Summary

10 Profiling Insights

10.1 Profiling ONNX ported LLMs

10.2 Transforming raw ONNX profiling data into insights

10.3 Optimization of ONNX graphs for LLMs

10.4 Summary

11 Deployment and Serving

11.1 vLLM

11.1.1 Offline serving

11.1.2 Online serving

11.2 FastAPI

11.2.1 Benchmarking various models

11.2.2 Deploy most performance model with FastAPI

11.3 MLC LLM

11.4 Deployment and inference on Android devices

11.4.1 MLC LLM Framework

11.4.2 MLLM Framework

11.4.3 HF’s Transformers

11.5 Summary

12 Running on Your Laptop

12.1 Why a personal local assistant

12.2 Running an LLM locally with Ollama

12.2.1 Importing a custom model into Ollama

12.2.2 User privacy in Ollama

12.3 Running an LLM locally with LM Studio

12.3.1 The LM Studio Python SDK

12.4 Running an LLM locally with Jan

12.4.1 The Cortex local LLM engine

12.5 Summary

13 Creating End-to-end LLM applications

13.1 Why LLMs alone aren’t enough

13.2 Combining a domain-specific small language model with RAG

13.2.1 Using a vector database

13.3 Building an Agent

13.4 Summary

14 Advanced Components for LLM Applications

14.1 Graph RAG

14.1.1 Microsoft’s OS GraphRAG capabilities

14.1.2 Evaluation metrics

14.2 RAG + Agentic AI

14.3 Long- and short-term memory management

14.4 Summary

15 Test-time Compute and Small Language Models

15.1 Test-time compute

15.2 The OptiLLM inference proxy

15.3 SLMs with embedded test-time compute

15.4 Building a reasoning domain-specific SLM

15.5 Summary

Overview

7 Generating Python Code

This chapter introduces practical ways to build and run domain-specific small language models for Python code generation. It contrasts closed, commercial assistants with open-source options and shows why specialized, compact models can rival much larger generalist systems on coding tasks. Benchmarks such as SWE-bench highlight that even strong models still require expert oversight, and the text argues that programmers remain central: coding assistants amplify skilled engineers rather than replace them. Specialization trims irrelevant knowledge, reduces cost and risk, and makes on-device deployment feasible.

A hands-on workflow starts with Salesforce’s CodeGen (350M mono) as a baseline, demonstrating prompts from function headers to natural language and comments. The chapter then optimizes inference by exporting to ONNX with Optimum and applying dynamic quantization, yielding sizable latency reductions, higher throughput, and more consistent runtimes; when a model isn’t supported, it shows a mid-level path using the Transformers ONNX exporter and ONNX Runtime quantization. For quality, it adopts a robust evaluation pipeline inspired by ReCode on HumanEval: tokenize and generate, validate syntax via Python’s AST, execute provided unit tests for correctness, and aggregate pass rates—while emphasizing safe execution practices.

Next, the chapter explores stronger and still-efficient models: CodeGen 2.5 (7B) with infill and FlashAttention for faster sampling, and BigCode’s StarCoder2 (3B) with Grouped Query Attention and long context windows. It demonstrates lightweight deployment through BitsAndBytes (8/4-bit) and GGUF formats, and shows how to run a 4-bit StarCoder2 locally on a MacBook Air (M1) using llama.cpp/llama-cpp-python, achieving usable, sub–half-second CPU-only latencies. It also covers serving via an OpenAI-compatible endpoint for IDE integration and converting HF checkpoints to GGUF when needed. The overarching message: with careful optimization, quantization, and disciplined evaluation, capable Python coding assistants can run efficiently on commodity hardware.

The SWE-bench leaderboard.

Comparison of the most popular code assistant LLMs’ size when CodeGen was released.

Average inference time on CPU for the vanilla CodeGen model and its ONNX converted version.

Comparison about latency and throughput metrics related to the CodeGen 350M mono vanilla model and its ONNX converted version benchmarks

Comparison between duration times for 100 inference runs with the CodeGen 350M mono vanilla model and its ONNX converted version

Average inference time on CPU for the ONNX CodeGen model and its quantized version.

Comparison about latency and throughput metrics related to the CodeGen 350M mono ONNX model and its quantized version benchmarks.

Summary

Specialized open source language models for code generation require smaller parameter counts than generalist models while achieving comparable performance.
CodeGen models support multiple programming languages or Python-only code generation through the mono family.
ONNX conversion of code generation models can reduce average inference time by approximately 19% compared to vanilla PyTorch implementations.
8-bit quantization of ONNX models can provide additional 60% reduction in inference time while maintaining code quality.
Models not supported by Optimum require mid-level conversion using the transformers.onnx package with manual configuration.
Abstract Syntax Tree parsing validates generated Python code for syntactic correctness before execution.
HumanEval dataset provides 164 Python programming problems for evaluating code generation model performance.
Code generation models support various prompt formats including function signatures, natural language comments, and multiline docstrings.
StarCoder2 uses Grouped Query Attention to achieve faster inference times with reduced computational overhead.
llama-cpp-python provides Python bindings for running GGUF models locally without GPU requirements.
Code infilling requires specific token formatting with mask and separator tokens for proper context understanding.
4-bit quantization reduces model size by approximately 75% while maintaining acceptable code generation quality.
Local deployment of code generation models enables offline coding assistance without external API dependencies.
Metal Performance Shaders backend enables optimized inference on Apple Silicon processors.
Code generation evaluation requires both syntactic validation and functional correctness testing against unit tests.
ReCode evaluation framework provides robustness metrics for code generation under different prompt perturbations.

FAQ

What kinds of models does the chapter focus on, and why is Python the target language?

The chapter focuses on open-source, domain-specific language models specialized for coding assistance and code generation. It targets Python because most readers are familiar with it, which makes evaluating model output easier and more meaningful.

Will coding LLMs replace software engineers?

No. Benchmarks like SWE-bench show that even leading models still need expert supervision. Many software engineering tasks—requirements understanding, design, coordination across files and systems, and iterative polishing—are human-centric and go beyond token prediction. LLMs can make good engineers more productive but do not replace core engineering expertise.

What is CodeGen, and which versions and sizes are available?

CodeGen is Salesforce’s Transformer-based code generation family with two training families: “mono” (Python-only) and “multi” (several languages). First-generation model sizes include 350M, 2.7B, 6.1B, and 16.1B parameters. Later releases include CodeGen 2.0 and CodeGen 2.5 (7B, mono and multi). CodeGen can complete functions from headers, respond to natural-language prompts (larger models), generate unit tests, and more.

How do I speed up CodeGen inference using ONNX and Optimum?

- Export to ONNX with Optimum’s ORTModelForCausalLM and run on CPU or other providers.
- Apply dynamic quantization with Optimum’s ORTQuantizer.
- Observed benefits in the chapter’s CPU setup: average latency dropped from ~1186 ms (PyTorch) to ~958 ms (ONNX), and further improved with 8-bit quantization (about 60% average latency reduction vs. ONNX fp32). Model size also shrank from ~1.33 GB to ~346 MB after 8-bit quantization.

What if my model architecture isn’t supported by Optimum?

Use a mid-level route: export to ONNX with the Transformers ONNX CLI (feature “causal-lm”), then quantize with onnxruntime’s quantize_dynamic (e.g., QInt8). This path was useful before Optimum added support for some architectures (like early CodeGen generations).

How can I evaluate the quality of generated Python code?

- Use HumanEval (164 Python tasks with prompts and tests).
- Generate code with a consistent config (e.g., max length, caching).
- Validate syntax by parsing to an AST (ast.parse); discard invalid code.
- Run the provided unit tests to check correctness (with caution: executing arbitrary code has risks).
- The chapter draws on Amazon Science’s ReCode approach to assess robustness under prompt perturbations.

What’s new in CodeGen 2.5 compared to earlier releases?

CodeGen 2.5 (7B) supports robust code infilling (looks at both left and right context) and is optimized for fast sampling via Flash Attention, improving efficiency for training and inference. Mono and multi variants use a permissive Apache 2.0 license; the “instruct” variant is research-only, not for commercial use.

Why consider StarCoder2-3B, and what are its strengths?

StarCoder2-3B is a compact, high-performing coding SLM trained on The Stack v2 and curated text, with Grouped Query Attention for faster inference, a 16k-token context window (with a 4k sliding window), and strong multi-language support including Python. It can be used via Transformers and quantized with bitsandbytes to 8-bit or 4-bit to reduce memory and improve speed.

How can I run a Python code LLM on a laptop with limited resources?

- Use a GGUF-quantized model (e.g., StarCoder2-3B Q4_K_M) and llama.cpp via the llama-cpp-python bindings.
- Download the GGUF from Hugging Face, load it with a small CPU thread count, and generate locally (even offline after the first download).
- On a MacBook Air M1 (8 GB), the chapter reports ~432 ms average latency and ~2.31 TPS using 4 threads and CPU-only.

How do I create my own GGUF-quantized model if none is available or performance is poor?

Download the original HF checkpoints, then convert and quantize with llama.cpp’s convert-hf-to-gguf.py (specify outtype for the desired quantization). You can upload the resulting GGUF to the Hugging Face Hub and run it locally with llama-cpp-python.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $33.59

you save $14.40 (30%)

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more