Domain-Specific Small Language Models you own this product

Efficient AI for local deployment

Guglielmo Iozzia

MEAP began April 2025
Last updated December 2025
Publication in May 2026 (estimated)

ISBN 9781633436701
376 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Japanese, Russian

catalog / Software Development / Software Engineering / Technology and Computing / Language Models

resources: Source code Book forum Source code on Github

table of content

PART 1: FIRST STEPS

1 Small Language Models

1.1 What are Small Language Models?

1.2 10000 feet overview

1.3 The Transformer Architecture

1.4 Evolutions of Transformers

1.5 Areas of application

1.6 The Open Source revolution

1.7 Risks and challenges with generalist LLMs

1.8 When do domain specific LLMs provide greater business value than generalist ones?

1.9 Prerequisites

1.10 References

1.11 Summary

PART 2: CORE DOMAIN-SPECIFIC LLMS

2 Tuning for a Specific Domain

2.1 Data Preparation

2.1.1 Data Preparation for BERT Fine Tuning

2.1.2 Data Preparation for GPT Fine Tuning

2.1.3 Data Preparation for RAG

2.2 Retrieval Augmented Generation

2.3 Fine tuning

2.4 LoRA

2.5 RAG or fine tuning?

2.6 Summary

3 End-to-end Transformer Fine Tuning

3.1 Data preparation

3.2 Fine tuning

3.3 Testing the fine-tuned model

3.3.1 Domain-specific evaluation

3.4 Summary

4 Running Inference

4.1 How to generate content

4.1.1 Text completion

4.1.2 Few-shot learning

4.1.3 Code generation

4.1.4 Evaluating the generated content

4.2 Inference cost calculation

4.3 Areas for improvement (cost savings and performance)

4.3.1 Get the most from your GPU

4.3.2 Batching

4.3.3 Estimating the generation time

4.3.4 Optimizing GPU usage with DeepSpeed

4.4 Summary

5 Exploring ONNX

5.1 The ONNX format

5.2 ONNX operators and types

5.3 The ONNX runtime

5.4 ONNX runtime providers

5.5 ONNX for LLMs on CPU

5.6 ONNX for LLMs on GPU

5.6.1 ONNX for GPT on GPU

5.6.2 I/O binding

5.7 Summary

6 Quantizing for Your Production Environment

6.1 Transformers precision formats

6.2 8-bit quantization

6.2.1 Hands-on 8-bit quantization.

6.2.2 LLM.int8() and quantization

6.3 8-bit quantization with ONNX

6.4 4-bit quantization

6.4.1 4-bit quantization with GPTQ

6.4.2 4-bit quantization with ggml

6.5 Summary

PART 3: REAL-WORLD USE CASES

7 Generating Python Code

7.1 Transformers for programming language generation

7.2 Hands-on with Python language generation using a Transformer architecture

7.2.1 Python code generation with CodeGen

7.2.2 ONNX conversion and quantization of models not supported by Optimum

7.2.3 Model evaluation

7.2.4 Python code generation with better models

7.3 Inference (coding assistance) on commodity hardware

7.4 Summary

8 Generating Protein Structures

8.1 Application of Transformers in Chemistry

8.2 From natural language to protein structures

8.3 Antibody generation with a small language model

8.4 From CIF files to crystal structures

8.5 Summary

PART 4: ADVANCED CONCEPTS

9 Advanced Quantization Techniques

9.1 What if a domain-specific model isn’t small?

9.2 FlexGen

9.3 SmoothQuant

9.4 BitNet

9.4.1 BitNet and Python

9.5 Summary

10 Profiling Insights

10.1 Profiling ONNX ported LLMs

10.2 Transforming raw ONNX profiling data into insights

10.3 Optimization of ONNX graphs for LLMs

10.4 Summary

11 Deployment and Serving

11.1 vLLM

11.1.1 Offline serving

11.1.2 Online serving

11.2 FastAPI

11.2.1 Benchmarking various models

11.2.2 Deploy most performance model with FastAPI

11.3 MLC LLM

11.4 Deployment and inference on Android devices

11.4.1 MLC LLM Framework

11.4.2 MLLM Framework

11.4.3 HF’s Transformers

11.5 Summary

12 Running on Your Laptop

12.1 Why a personal local assistant

12.2 Running an LLM locally with Ollama

12.2.1 Importing a custom model into Ollama

12.2.2 User privacy in Ollama

12.3 Running an LLM locally with LM Studio

12.3.1 The LM Studio Python SDK

12.4 Running an LLM locally with Jan

12.4.1 The Cortex local LLM engine

12.5 Summary

13 Creating End-to-end LLM applications

13.1 Why LLMs alone aren’t enough

13.2 Combining a domain-specific small language model with RAG

13.2.1 Using a vector database

13.3 Building an Agent

13.4 Summary

14 Advanced Components for LLM Applications

14.1 Graph RAG

14.1.1 Microsoft’s OS GraphRAG capabilities

14.1.2 Evaluation metrics

14.2 RAG + Agentic AI

14.3 Long- and short-term memory management

14.4 Summary

15 Test-time Compute and Small Language Models

15.1 Test-time compute

15.2 The OptiLLM inference proxy

15.3 SLMs with embedded test-time compute

15.4 Building a reasoning domain-specific SLM

15.5 Summary

Overview

8 Generating Protein Structures

This chapter broadens the focus from code-generation SLMs to chemistry and drug discovery, showing how Transformer models adapted to sequential chemical representations can generate meaningful biological and materials structures. It motivates domain-specific small language models as practical alternatives to large closed systems in settings where specialized data, task tuning, and unusual data formats limit general LLMs. Across three case studies—de novo protein design, therapeutic antibody generation, and crystal structure synthesis—the chapter blends modeling guidance with hands-on inference workflows, validation strategies tailored to each domain, and lightweight performance optimizations suitable for accessible hardware.

For proteins, ProtGPT2 is a 738M-parameter, decoder-only GPT-2 variant trained self-supervised on UniRef50 to predict next “oligomer” tokens, producing de novo sequences that preserve natural propensities, secondary structure, and globularity. Inference is straightforward via a standard pipeline and is fast enough on commodity CPUs; sequence quality can be triaged with perplexity. The Antibody Generator (AntibodyGPT) fine-tunes ProGen2 models (small/medium/large) on 5,000 antibody–antigen crystal structures to propose therapeutic antibody sequences. Inference typically needs a GPU for reasonable latency; enabling expandable CUDA allocator segments and applying 8-bit quantization via BitsAndBytes substantially reduce memory and time (e.g., shrinking a small model from ~617 MB to ~157 MB and cutting average latency by about 40%). Biological plausibility is assessed with ANARCI for numbering and receptor classification, with expert review advised.

For materials, CrystaLLM treats crystallographic information files (CIF) as token sequences and performs GPT-style next-token prediction to generate valid inorganic crystal structures conditioned on prompts such as a target composition (for example, data_Na2Cl2). A small 25M-parameter model runs efficiently on CPU: the workflow creates a prompt, samples raw CIF content, post-processes it, and validates outputs for space-group and atomic consistency and reasonable bond lengths; an MCTS-based decoding option enhances structural sampling. The chapter also outlines how to wrap pretrained checkpoints for use with common tooling by adding a config, converting weights, and packaging for distribution. Overall, it demonstrates that compact, task-tuned SLMs—combined with domain-aware validation and modest inference optimizations—can deliver practical pipelines for protein and crystal structure generation.

The different options available for the ProGen2 and the Antibody Generator model families.

The output of the online ANARCI tool for one of the antibody sequences generated in our example.

The CrystaLLM training (a) and generation with MCTS decoding (b) processes.

Summary

Chemistry language models adapt NLP techniques to represent molecular structures as text sequences for drug discovery tasks.
ProtGPT2 generates protein sequences using a decoder-only GPT-2 architecture trained on 87162 protein sequences from UniRef50 database.
Protein generation requires prompting with endoftext tokens and produces amino acid sequences that follow natural protein patterns.
Perplexity scoring evaluates generated protein sequences with lower scores indicating better quality outputs.
AntibodyGPT fine-tunes ProGen2 models on 5000 experimentally resolved crystal structures for therapeutic antibody generation.
ProGenForCausalLM class enables loading and inference with ProGen2-based models from custom repositories.
CUDA memory allocation settings can improve inference latency by enabling expandable segments for better memory management.
8-bit quantization reduces AntibodyGPT model size from 617MB to 156MB with 40% improvement in average latency.
ANARCI tool validates generated antibody sequences for conformity to known antibody structural patterns.
CrystaLLM generates crystal structures from textual representations of crystallographic information files.
CIF file generation involves tokenizing crystallographic data and using autoregressive prediction for structure completion.
Monte Carlo Tree Search algorithm improves structural sampling quality for crystal structure generation.
Chemical structure validation requires domain-specific tools rather than standard text generation metrics.
Domain-specific models achieve effective results with smaller parameter counts than generalist language models.
Specialized chemistry models require subject matter expert validation for biological relevance and functional viability.

FAQ

What does Chapter 8 focus on?

It demonstrates how small, domain-specific language models can generate biological and chemical structures and how to run them efficiently. Concretely, it covers generating protein sequences, creating crystal structures from CIF-like text, and practical inference optimizations tailored to chemistry and drug discovery workflows.

How are Transformers adapted to chemistry tasks?

Chemistry problems are reframed as sequence modeling. Molecules, reactions, and lab procedures are expressed as text (e.g., SMILES), enabling models to learn from open datasets. A notable example is the Molecular Transformer, which treats reaction prediction as machine translation of SMILES, achieving top-1 accuracy above 90% on common benchmarks without handcrafted rules.

Why prefer small open-source models for drug discovery?

Large closed models often underperform on specialized, proprietary chemistry tasks and are costly or infeasible to fine-tune. Small open-source models offer full control, can be trained or adapted on domain-specific data, and are released with permissive licenses and datasets, enabling production-grade performance at reasonable cost.

What is ProtGPT2 and how is it trained?

ProtGPT2 is a 738M-parameter, decoder-only GPT-2 model trained to predict the next token (amino acid/oligomer) on 87,162 raw protein sequences from UniRef50 (UR50_2021_04) without annotations. It learns natural protein features (amino acid propensities, secondary structure content, globularity) and explores novel regions of protein space for de novo design.

How do I generate protein sequences with ProtGPT2 and how fast is it?

You can load it from the Hugging Face Hub using a text-generation pipeline and sample multiple sequences with parameters like top_k, temperature, and repetition_penalty. On a Colab free tier CPU, generating a batch of 10 sequences typically completes in under 2 minutes, and you can seed generation with any starting amino acid, fragment, or protein prompt.

How is the quality of ProtGPT2 outputs assessed?

A simple, widely used proxy is perplexity: lower perplexity indicates the model finds a sequence more likely under its learned distribution. While helpful, perplexity should be complemented with domain checks when possible (e.g., downstream structure predictions or expert review) depending on the use case.

What is AntibodyGPT (Antibody Generator) and what models exist?

AntibodyGPT fine-tunes Salesforce’s ProGen2 (decoder-only) on 5,000 experimentally resolved antibody–antigen crystal structures to generate therapeutic antibody sequences. Three variants—small (151M), medium (765M), and large (2.7B)—are available on the Hugging Face Hub.

How do I run AntibodyGPT efficiently?

Use the custom ProGenForCausalLM AutoClass and tokenizer from the referenced GitHub repo to load checkpoints, then condition generation on a target antigen sequence. A GPU is recommended; on CPU, generation can take hours. Enabling PyTorch’s CUDA caching allocator expandable_segments and quantizing to 8-bit with BitsAndBytes can reduce average latency by roughly 40% versus the baseline.

How do I validate generated antibody sequences?

Use ANARCI (web tool or Python API) for antibody numbering and receptor classification. It checks conformity to known sequence patterns and structural frameworks, helping ensure outputs are biologically plausible and potentially functional; expert review remains essential.

What is CrystaLLM and how do I generate and validate CIFs?

CrystaLLM is a GPT-2–style, decoder-only model trained on textual CIF representations to generate physically plausible inorganic crystal structures. Workflow: clone the repo, download small (25M) or large (200M) weights, create a prompt (e.g., “data_Na2Cl2\n”), sample sequences to produce raw CIFs, post-process them, and validate with the provided script (checks include space-group and bond-length reasonableness). The small model runs on CPU; generating two CIFs with 3,000 max tokens typically takes under a minute. An optional MCTS-based decoding script improves structural sampling. You can also convert checkpoints to the Transformers format and push to the Hugging Face Hub.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $33.59

you save $14.40 (30%)

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more