Overview

1 Introduction to DeepSeek

Large language models now write, code, and reason at near-human levels—but this chapter invites you to go a step further: build one yourself to truly understand how it works. It frames DeepSeek as a pivotal moment for open-source AI, showing that freely available models can rival proprietary systems in capability and efficiency. Grounded in a spirit of democratization, the book sets the goal of reconstructing DeepSeek’s core ideas from first principles so you gain both intuition and hands-on skill, not just surface-level familiarity.

At a high level, the roadmap centers on four pillars that distinguish DeepSeek from a standard Transformer: Multi-Head Latent Attention to ease attention’s speed and memory bottlenecks, a Mixture-of-Experts layer to expand capacity efficiently, Multi-Token Prediction to accelerate learning and inference, and FP8 quantization to push computational efficiency. These are complemented by an optimized training pipeline that overlaps workloads for high device utilization, and a multi-step post-training process that blends supervised fine-tuning, pure reinforcement learning, rejection sampling and self-labeling, and a final RL phase to cultivate strong reasoning. The chapter also highlights knowledge distillation, compressing a massive teacher into compact student models (spanning roughly 1.5B to 70B parameters) that retain much of the performance while remaining practical.

The book is organized into four progressive stages: foundations for efficient inference (including the KV cache), core architectural innovations (implementing MLA and MoE), advanced training techniques (MTP, FP8, and pipeline scheduling), and post-training methods (SFT, RL, and distillation) that culminate in a cohesive mini-DeepSeek. The scope prioritizes clarity and reproducibility over reproducing proprietary data, exact weights, or massive distributed infrastructure, and it omits production deployment concerns. To follow along, readers need Python proficiency, basic deep learning and PyTorch familiarity, and a working grasp of Transformers; experiments are designed to run on modest hardware—from CPU-only laptops (slower) to single GPUs with 8–12GB VRAM, with larger GPUs simply enabling more ambitious MoE exercises.

A simple interaction with the DeepSeek chat interface.
The title and abstract of the DeepSeek-R1 research paper.
A detailed view of a standard Transformer block, the foundational architecture used in models like LLaMA and the GPT series. It is composed of a multi-head attention block and a feed-forward network (NN).
A simplified view of the DeepSeek model architecture. It modifies the standard Transformer by replacing the core components with Multi-Head Latent Attention (MLA) and a Mixture-of-Experts (MoE) layer. This design also utilizes RMS Norm (Root Mean Square Normalization) and a specialized Decoupled RoPE (Rotary Position Embedding).
An illustration of the DualPipe training pipeline on a single device. By overlapping the forward pass (the initial blocks), backward pass (the hatched blocks), and combined computations, this scheduling strategy minimizes GPU idle time and maximizes hardware utilization during large-scale training.
The multi-step post-training pipeline used to create DeepSeek-R1 from the DeepSeek-V3 base model. This process involves a combination of reinforcement learning (Pure RL), data generation (Rejection sampling), and fine-tuning to instill advanced reasoning capabilities.
Benchmark performance of DeepSeek-R1 against other leading models (as of January 2025).
The concept of knowledge distillation. A large, powerful "teacher" model (like DeepSeek-R1) is used to generate training data to teach a much smaller, more efficient "student" model, transferring its capabilities without the high computational cost.
The four-stage roadmap for building a mini-DeepSeek model in this book. We will progress from foundational concepts (Stage 1) and core architecture (Stage 2) to advanced training (Stage 3) and post-training techniques (Stage 4), implementing each key innovation along the way.

Summary

  • Large Language Models (LLMs) have become a dominant force in technology, but the knowledge to build them has often been confined to a few large labs.
  • DeepSeek marked a pivotal moment by releasing open-source models with performance that rivaled the best proprietary systems, demonstrating that cutting-edge AI could be developed and shared openly.
  • This book will guide you through a hands-on process of building a mini-DeepSeek model, focusing on its key technical innovations to provide a deep, practical understanding of modern LLM architecture and training.
  • The core innovations we will implement are divided into four stages: (1) KV Cache Foundation, (2) Core Architecture (MLA & MoE), (3) Advanced Training Techniques (MTP & FP8), and (4) Post-training (RL & Distillation).
  • By building these components yourself, you will gain not just theoretical knowledge but also the practical skills to implement and adapt state-of-the-art AI techniques.

FAQ

What makes DeepSeek a turning point in open‑source AI?DeepSeek showed that openly released models can rival top proprietary systems. With DeepSeek‑R1 (early 2025), the team released weights and methods that achieved state‑of‑the‑art reasoning performance while training at a fraction of the cost of closed models—narrowing the open vs. closed gap more than ever before.
What exactly will we build in this book?A mini‑DeepSeek model that implements DeepSeek’s core innovations: Multi‑Head Latent Attention (MLA), Mixture‑of‑Experts (MoE), Multi‑Token Prediction (MTP), FP8 quantization, an efficient DualPipe training pipeline, and post‑training techniques (RL and distillation).
How does DeepSeek’s architecture differ from a standard Transformer?It replaces standard multi‑head attention with Multi‑Head Latent Attention (MLA) and the feed‑forward block with a DeepSeek‑style Mixture‑of‑Experts (MoE). It also uses RMSNorm and a specialized decoupled RoPE for position encoding, targeting speed, memory, and capacity bottlenecks.
What problems do MLA, MoE, MTP, and FP8 each address?- MLA: mitigates attention speed and memory bottlenecks for long sequences. - MoE: increases effective model capacity without linearly increasing compute. - MTP: predicts multiple future tokens to accelerate learning and inference. - FP8: 8‑bit floating‑point quantization to boost efficiency and reduce resource usage.
What is the DualPipe training pipeline and why does it matter?DualPipe overlaps the forward pass of the next batch with the backward pass of the current batch, keeping GPUs busy and minimizing idle time. By coordinating data loading, preprocessing, and computation, it raises hardware utilization during large‑scale training.
How was DeepSeek‑R1 created from the DeepSeek‑v3 base model?A five‑step post‑training process: (1) start from a lightly fine‑tuned DeepSeek‑v3, (2) apply pure RL to develop reasoning skills, (3) use rejection sampling for self‑labeling high‑quality outputs, (4) blend synthetic and supervised data, and (5) perform a final RL phase over diverse prompts to improve robustness.
How does DeepSeek‑R1 perform compared to proprietary models, and at what cost?At release, R1 matched or outperformed leading models like OpenAI’s o1‑1217 on tough reasoning benchmarks (e.g., AIME 2024, competitive coding on Codeforces) while being trained at a fraction of the cost reported for top closed models.
What is knowledge distillation here, and which checkpoints are available?A large “teacher” (DeepSeek‑R1) generates signals to train smaller “student” models that are cheaper to run. DeepSeek released distilled checkpoints around 1.5B, 7B, 8B, 14B, 32B, and 70B parameters (based on Qwen2.5 and Llama3 series), offering strong performance with practical resource needs.
How is the book structured, and what’s in or out of scope?Four stages: 1) Foundations (KV cache) 2) Core architecture (MLA, MoE) 3) Training (MTP, FP8, DualPipe) 4) Post‑training (SFT, RL, distillation). In scope: theory plus hands‑on implementations of these components. Out of scope: proprietary data, exact DeepSeek weights, massive distributed training for 100B+ models, and production deployment concerns.
What background and hardware do I need to follow along?Prereqs: Python, basic deep learning and backprop, some PyTorch, and a high‑level understanding of Transformers. Hardware: a CPU‑only laptop can run examples (slow); a single 8–12 GB VRAM GPU is recommended; 24–48 GB helps for MoE experiments. Colab configs and small datasets are provided—no supercomputer required.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a DeepSeek Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a DeepSeek Model (From Scratch) ebook for free