Overview

Appendix C. Qwen3 LLM source code

This appendix presents the concise, readable source code for the Qwen3 model used throughout the book, clarifying that “from scratch” refers to the reasoning methods rather than building an LLM end to end. The implementation mirrors a GPT-2–style, decoder-only transformer while adopting modern architectural updates common in contemporary LLMs. Instead of a step-by-step deep dive, the appendix offers a guided overview that connects design choices to code, so readers can see how the major components fit together and reuse the model via the reasoning_from_scratch package.

Key updates over a classic GPT-2 include RMSNorm in place of LayerNorm for lower cost and stable training, and a SwiGLU (SiLU-activated GLU) feed-forward that improves expressivity while often using fewer parameters than the standard two-layer MLP. Positional information is injected with rotary position embeddings (RoPE), implemented in a clear “two-halves” style that rotates query and key vectors. Attention uses grouped query attention (GQA), which shares keys and values across groups of query heads to reduce parameters and KV-cache bandwidth without hurting quality; Qwen3 additionally applies optional QK normalization (RMSNorm on queries/keys) to further steady optimization. KV-cache support is integrated so generation can be accelerated in streaming or incremental decoding scenarios.

The transformer block combines RMSNorm, RoPE-applied masked GQA, and the SwiGLU feed-forward module with residual connections, and is stacked repeatedly (28 times in the 0.6B variant). The Qwen3Model wraps these blocks with token embeddings, a final RMSNorm, and an output projection, and precomputes RoPE buffers while managing cache-aware causal masks; a small KVCache utility holds per-layer keys and values during generation. A flexible configuration system supports multiple sizes; the 0.6B setup uses 1024-dimensional embeddings, 16 heads with 128 head dimension, 28 layers, a 3072-wide intermediate, 8 KV groups, a large vocabulary, and a long context window. The tokenizer reimplementation mirrors the official behavior, handling numerous special and chat tokens and a hybrid “thinking” mode with intentionally nuanced prompt rules, enabling consistent formatting for both standard and reasoning-style interactions.

Figure C.1 Architectural comparison between Qwen3 and GPT-2. Both models process text through embedding layers and stacked transformer blocks, but they differ in certain design choices.
Figure C.2 Comparison of LayerNorm (used in GPT-2) and RMSNorm (used in Qwen3). LayerNorm (left) normalizes activations so that their average value (mean) is exactly zero and their spread (variance) is exactly one. RMSNorm (right) instead scales activations based on their root mean square, which does not enforce zero mean or unit variance, but still keeps the mean and variance within a reasonable range for stable training.
Figure C.3 In GPT-2 (top), the feed forward module consists of two fully connected (linear) layers separated by a non-linear activation function. In Qwen3 (bottom), this module is a gated linear unit (GLU) variant, which adds a third linear layer (linear layer 3) and multiplies the output of this linear layer 3 elementwise with the activated output of linear layer 1.
Figure C.4 Different activation functions that can be used in a feed forward module (neural network). GELU and SiLU (Swish) offer smooth alternatives to ReLU, which has a sharp kink at input zero.
Figure C.5 A comparison between MHA and GQA. Here, the group size is 2, where a key and value pair is shared among 2 queries.
Figure C.6 The Structure of the transformer block in Qwen3. Each block includes RMSNorm, RoPE, masked grouped-query attention, and a feed-forward module, and is repeated 28 times in the 0.6B-parameter model.
Figure C.7 Architecture of the Qwen3 0.6B model. The model consists of a token embedding layer followed by 28 transformer blocks, each containing RMSNorm, RoPE, QKNorm, masked grouped-query attention with 16 heads, and a feed-forward module with an intermediate size of 3,072.

C.9 Using the model

Let's now instantiate and use the model to confirm that the code works by reusing the text generation approach from chapter 2.

First, we instantiate the model using the pre-trained model weights:

The output shows the structure of the instantiated model, which should match the values we used in the configuration file in listing C.7:

Next, we re-use the text generation functions from chapter 2 to generate text:

Since we used the same prompt as in chapter 2, the generated text matches the generated text from chapter 2 exactly:

While the main chapters use the 0.6-billion-parameter variant of Qwen3 to lower the resource requirements for this book, interested readers can find more information on how to use the larger models in appendix D.

FAQ

What does “from scratch” mean in this book’s context?It refers to building reasoning techniques and training/evaluation utilities from the ground up, not implementing a full LLM from raw primitives. The full “implement an LLM from scratch” topic is covered in the author’s separate Build a Large Language Model (From Scratch) book.
How is Qwen3 similar to and different from GPT-2?Both are decoder-only transformer architectures with token embeddings, stacked transformer blocks, and a final projection. Qwen3 adopts modern choices absent in GPT-2: RMSNorm instead of LayerNorm, a SwiGLU feed-forward module, rotary position embeddings (RoPE), grouped query attention (GQA), and optional QKNorm on queries/keys.
Why does Qwen3 use RMSNorm instead of LayerNorm?RMSNorm rescales activations using their root mean square without mean-centering. It is slightly cheaper, removes the bias term by default, and halves cross-feature reductions compared to LayerNorm, reducing GPU communication and improving training efficiency while maintaining stability.
How does the SwiGLU feed-forward module work, and why can it use fewer parameters?SwiGLU replaces the standard 2-layer MLP with three linear layers and a gated interaction: SiLU(fc1(x)) * fc2(x), followed by fc3. In practice, fc1 and fc2 are each half-width of a standard MLP’s expansion, so the total parameter count is often lower while the multiplicative gate increases expressivity and performance.
What are Rotary Position Embeddings (RoPE), and how are they implemented here?RoPE encodes position by rotating attention queries and keys with position-dependent cos/sin phases, avoiding added positional vectors. The appendix implements the split-halves variant (cosine half + sine half), which is equivalent to the interleaved even/odd style used in some repos.
What is Grouped Query Attention (GQA), and why is it used?GQA shares key/value projections across groups of attention heads, reducing parameters and KV-cache bandwidth during inference. Empirically it matches standard multi-head attention quality. Qwen3 also supports QKNorm (RMSNorm on queries/keys) to improve stability.
What is the KV cache, and how does it speed up generation?The KV cache stores past keys/values per layer so the model doesn’t recompute attention over prior tokens at each decoding step. During generation, new keys/values are appended to the cache, the causal mask is sliced to the active window, and only the latest tokens are processed, yielding significant speedups.
What does a Qwen3 transformer block contain?Each block applies RMSNorm → masked attention (with RoPE and GQA) → residual add, then RMSNorm → SwiGLU feed-forward → residual add. In the 0.6B model, this block is repeated 28 times.
What are the main components of the Qwen3Model forward pass?The model embeds tokens, builds a causal mask (with special handling when a KV cache is present), applies precomputed RoPE cos/sin buffers inside attention, passes through the stack of transformer blocks, applies a final RMSNorm, and projects to vocabulary logits. With caching, it tracks the current position and updates per-layer cached K/V tensors.
How does the Qwen3 tokenizer differ between base and reasoning modes?It supports many special tokens and a chat template. The effective end-of-sequence token differs between base and chat/reasoning variants. A noteworthy quirk: when add_thinking=True, no explicit “” block is inserted; when add_thinking=False, that block is added. This mirrors the hybrid behavior of the official Qwen3 0.6B reasoning-capable model.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free