Overview

1 Introduction to DeepSeek

Large language models have reshaped how we interact with technology, and this chapter invites you to go beyond using them to actually build one. It frames DeepSeek as a watershed in open-source AI—an openly available model that rivals closed, proprietary systems—and sets the tone for a hands-on journey where theory and code advance together. The chapter introduces the goals of the book, explains why DeepSeek is the ideal case study, and previews the path you will take to understand and implement its most important ideas.

At the core of DeepSeek’s leap are four technical pillars: Multi-Head Latent Attention (MLA) to relieve attention’s speed and memory bottlenecks, a Mixture-of-Experts (MoE) layer to expand capacity efficiently, Multi-Token Prediction (MTP) to accelerate learning and inference, and FP8 quantization to push compute and memory efficiency. These are complemented by an optimized training pipeline that overlaps tasks to maximize hardware utilization and a post-training recipe that layers reinforcement learning, rejection sampling, and fine-tuning to instill strong reasoning skills. The chapter also highlights DeepSeek-R1’s headline results—top-tier benchmark performance at markedly lower training cost—and the team’s commitment to democratization through open weights and distilled checkpoints from small (1.5B) to large (70B) models.

The book is organized into four stages: foundational inference mechanics (including the KV cache), architectural advances (MLA and MoE), training efficiency (MTP, FP8, and pipeline scheduling), and post-training methods (supervised fine-tuning, RL, and distillation). You will learn to reason from first principles and implement each component, but the scope stops short of reproducing proprietary data, training trillion-token corpora, or production-scale serving. To follow along, you should be comfortable with Python, basic deep learning and PyTorch, and have a general grasp of transformers; a CPU is sufficient for most exercises, while a single consumer GPU (8–12 GB VRAM) makes experimentation smoother. Overall, the chapter sets expectations, tools, and mindset for building a mini-DeepSeek and understanding modern LLM design end to end.

A simple interaction with the DeepSeek chat interface.
The title and abstract of the DeepSeek-R1 research paper.
A detailed view of a standard Transformer block, the foundational architecture used in models like LLaMA and the GPT series. It is composed of a multi-head attention block and a feed-forward network (NN).
A simplified view of the DeepSeek model architecture. It modifies the standard Transformer by replacing the core components with Multi-Head Latent Attention (MLA) and a Mixture-of-Experts (MoE) layer. This design also utilizes RMS Norm (Root Mean Square Normalization) and a specialized Decoupled RoPE (Rotary Position Embedding).
An illustration of the DualPipe training pipeline on a single device. By overlapping the forward pass (the initial blocks), backward pass (the hatched blocks), and combined computations, this scheduling strategy minimizes GPU idle time and maximizes hardware utilization during large-scale training.
The multi-step post-training pipeline used to create DeepSeek-R1 from the DeepSeek-V3 base model. This process involves a combination of reinforcement learning (Pure RL), data generation (Rejection sampling), and fine-tuning to instill advanced reasoning capabilities.
Benchmark performance of DeepSeek-R1 against other leading models (as of January 2025).
The concept of knowledge distillation. A large, powerful "teacher" model (like DeepSeek-R1) is used to generate training data to teach a much smaller, more efficient "student" model, transferring its capabilities without the high computational cost.
The four-stage roadmap for building a mini-DeepSeek model in this book. We will progress from foundational concepts (Stage 1) and core architecture (Stage 2) to advanced training (Stage 3) and post-training techniques (Stage 4), implementing each key innovation along the way.

Summary

  • Large Language Models (LLMs) have become a dominant force in technology, but the knowledge to build them has often been confined to a few large labs.
  • DeepSeek marked a pivotal moment by releasing open-source models with performance that rivaled the best proprietary systems, demonstrating that cutting-edge AI could be developed and shared openly.
  • This book will guide you through a hands-on process of building a mini-DeepSeek model, focusing on its key technical innovations to provide a deep, practical understanding of modern LLM architecture and training.
  • The core innovations we will implement are divided into four stages: (1) KV Cache Foundation, (2) Core Architecture (MLA & MoE), (3) Advanced Training Techniques (MTP & FP8), and (4) Post-training (RL & Distillation).
  • By building these components yourself, you will gain not just theoretical knowledge but also the practical skills to implement and adapt state-of-the-art AI techniques.

FAQ

What is DeepSeek, and why is it considered a turning point in open-source AI?DeepSeek is an open-source large language model initiative whose R1 model matched or surpassed leading proprietary systems on tough reasoning benchmarks, while being released openly and reportedly trained at a fraction of typical costs. It significantly narrowed the gap between open and closed AI.
Which core innovations of DeepSeek will this book help me build and understand?The book focuses on four pillars: Multi-Head Latent Attention (MLA), Mixture-of-Experts (MoE), Multi-Token Prediction (MTP), and FP8 quantization, plus an efficient training pipeline (DualPipe) and post-training methods (RL and distillation).
How does Multi-Head Latent Attention (MLA) differ from standard attention?MLA replaces standard multi-head attention to address speed and memory bottlenecks, especially for long sequences. It reduces memory usage while maintaining quality and builds on concepts like the Key-Value (KV) cache.
What problem does Mixture-of-Experts (MoE) solve in DeepSeek?MoE replaces the standard feed-forward network with expert sub-networks. Tokens are routed to specialized experts, increasing effective model capacity and scaling efficiency without proportionally increasing compute for every token.
What is Multi-Token Prediction (MTP) and why does it matter?MTP trains the model to predict multiple future tokens at once. This improves learning efficiency and can speed up both training and inference compared to next-token-only objectives.
What does FP8 quantization achieve in DeepSeek?FP8 is an 8-bit floating-point format used to compress weights and activations, improving computational efficiency and reducing memory and bandwidth demands while preserving capability.
How is DeepSeek trained efficiently, and what is DualPipe?DeepSeek uses an optimized pipeline that overlaps tasks to minimize GPU idle time. DualPipe overlaps forward passes of new batches with backward passes of previous ones, keeping hardware utilization high.
What post-training steps produced DeepSeek-R1 from DeepSeek-v3?A five-step process: (1) start from a lightly fine-tuned base, (2) pure RL to discover reasoning patterns, (3) rejection sampling for self-labeled data, (4) blend synthetic and supervised data, and (5) a final broad RL phase to boost robustness and generalization.
What is knowledge distillation in DeepSeek, and which model sizes were released?Distillation transfers capabilities from a large “teacher” (DeepSeek-R1) to smaller “student” models. DeepSeek released efficient checkpoints around 1.5B, 7B, 8B, 14B, 32B, and 70B parameters based on Qwen2.5 and Llama3 series.
What is the book’s scope, structure, and prerequisites?The book progresses through four stages: foundational inference (KV cache), core architecture (MLA, MoE), advanced training (MTP, FP8, DualPipe), and post-training (SFT, RL, distillation). You’ll need Python, basic ML/backprop, some PyTorch and transformer familiarity. Most examples run on CPU (slow); a single 8–12GB GPU is recommended, with 24–48GB helpful for larger MoE experiments. Configs for Colab are provided. The book does not replicate proprietary data, massive-scale distributed training, or production deployment.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a DeepSeek Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a DeepSeek Model (From Scratch) ebook for free