Overview

2 Tokenizers: How Large Language Models See The World

Large language models cannot work directly with raw text; they rely on tokenization to convert sentences into numeric sequences the model can process. Tokens are typically sub-words that balance the granularity of letters with the semantic coherence of words, allowing models to generalize to unfamiliar terms while keeping vocabularies manageable. Because models “see” only token IDs—without inherent relationships like capitalization or morphology—tokenization becomes the core feature engineering step for LLMs, shaping both model size and the kinds of patterns it can learn.

The tokenization pipeline involves four steps: receiving text, normalizing it, segmenting it into tokens, and mapping each token to a unique identifier while building the vocabulary. Normalization (e.g., lowercasing, punctuation handling) can shrink vocabularies and reduce ambiguity from typos, but it may also erase meaning (such as proper nouns) and reduce a model’s ability to correct errors. Modern systems therefore use minimal normalization and data-driven segmentation, most commonly via Byte-Pair Encoding, which learns frequent sub-word units to optimize coverage and efficiency. Vocabulary design includes tradeoffs in accuracy, speed, and memory, handling special tokens, and guarding against pitfalls like inconsistent tokenization, homoglyphs, and other encoding quirks that can create security and reliability issues.

Tokenization choices have direct consequences for model capabilities and equity. Tasks that depend on exact character sequences—word games, rare or misspelled drug names, poetry constraints—are challenging when models lack letter-level visibility. Math improves when digits are tokenized individually, illustrating how tailored token design can boost reasoning performance. Across languages, tokenization efficiency varies, affecting latency and cost: languages underrepresented in tokenizer training often require more tokens for the same content, increasing usage and potentially amplifying economic inequities. In practice, tokenization determines what an LLM can represent, how well it learns, and how fairly and efficiently it operates.

To understand text, LLMs must break text into tokens. Each unique token has a numeric identifier associated with it.
figure
Generically, tokenization involves processing input to produce numeric identifiers for tokens.
figure
The normalization process commonly involves changing text to remove upper-case characters and punctuation.
figure
The segmentation process breaks normalized text into words or tokens so that each can be processed independently
figure
A simplified Byte-Pair Encoding algorithm for creating tokens: First, find the most frequent pair of characters “ng”. Next, replace all instances of “ng” with a placeholder token “T”, and “ng” to the vocabulary. Repeat the process until no common byte pairs remain.
figure
Tokenizing two different sentences.
figure
The tokenization approach means that GPT can not really “see” single characters or word lengths. If you ask questions that require sub-character identification and change them in a unique and unusual way, GPT starts to fail. The correct middle character is “a”, but GPT insists that the letter “e”. What GPT sees is three tokens, representing P, ine, and apple, respectively.
figure
A comparison of how two LLMs learn to perform arithmetic computations over time. Time is shown on the x-axis. The upper curve is a typical BPE tokenizer, while the lower curve is the same tokenizer modified to use tokens that represent individual digits. The y-axis describes the ability of the LLM to perform accurately, where a smaller number means fewer errors. The bottom line is that LLMs that use digit-level tokenization can learn how to do math better and faster.
figure

Summary

  • Tokenization is the fundamental process that Large Language Models use to understand text by converting sentences into tokens.
  • Tokens are the smallest units of information in text that represent content. Sometimes, they correspond to full words, but often, they represent pieces of words or sub-words.
  • Tokenization involves normalizing text into a standard representation, which may involve converting characters to lowercase or translating the byte encoding of Unicode characters so that visibly identical characters employ the same encoding.
  • Tokenization also involves segmentation, which is breaking up text into words or sub-words. Algorithms like Byte-Pair Encoding (BPE) provide a mechanism to automatically learn how to efficiently segment text based on the statistical occurrence of combinations of letters in a training data set.
  • The result of building a tokenizer is known as a vocabulary, which is the unique collection of word and sub-word tokens that a tokenizer can use to represent text that it has processed.
  • The size of a tokenizer’s vocabulary affects the LLM’s ability to accurately represent data and the storage and computational resources required to understand and predict text.
  • Internally to the LLM, tokens are represented using numbers. As a result, there is no understanding of relationships between tokens, such as prefixes and suffixes, or the fact that two tokens share a similar set of letters.
  • To support specific domains of knowledge, tokenizers trained automatically may be augmented to provide tokens that are important to their application.
  • Tokenizers that do not understand individual letters or digits will have issues with arithmetic operations or simple word games.

FAQ

Why do large language models tokenize text into numbers?Neural networks operate on numbers, not raw text. Tokenization converts text into a sequence of discrete tokens, each mapped to a unique numeric identifier, so models can learn statistical relationships between them. Tokens are the basic units LLMs “see” and manipulate during training and inference.
What are sub-word tokens, and why not use only letters or whole words?Sub-word tokens sit between letters and words, capturing meaningful chunks like prefixes, roots, and suffixes (for example, “schoolhouse” → “school”, “house”; “thoughtful” → “thought”, “ful”). This balances efficiency and flexibility: frequent words can be single tokens, while rare or new words can be composed from known parts. It mirrors how humans often parse unfamiliar terms by breaking them into recognizable components.
What are the main steps of the tokenization process?The pipeline typically has four steps: (1) receive a text string, (2) normalize it (optional transformations like lowercasing or removing punctuation), (3) segment it into tokens, and (4) map tokens to unique numeric IDs. The first and last steps are straightforward necessities; the key design choices live in normalization and segmentation. Those choices shape vocabulary size, model capability, and robustness.
How does vocabulary size affect performance, capability, and deployability?A larger vocabulary captures more nuance and reduces ambiguity, often improving capability. However, it increases storage, memory use, and compute cost, potentially slowing the model and complicating deployment. Building vocabularies is a trade-off between expressiveness and resource constraints; even publicly available models can require many gigabytes just to store token vocabularies.
What is normalization, and when does lowercasing or stripping punctuation help or hurt?Normalization reduces variability by transforming text (for example, lowercasing, removing punctuation), shrinking the vocabulary and simplifying learning. This can unify typos and casing variants, but it can also erase meaningful distinctions (for example, “Bill” the name versus “bill” the invoice). Modern LLMs often keep case and punctuation to preserve nuance and enable capabilities like typo correction, accepting a larger vocabulary as a trade-off.
Why isn’t whitespace-based segmentation enough?Splitting on spaces is simple but leads to issues: punctuation-bound tokens like “hello,” differ from “hello”, inflating the vocabulary and ambiguity. Handwritten rules struggle with edge cases and languages that lack spaces (for example, Chinese). Data-driven sub-word methods better capture consistent, reusable token pieces across languages and contexts.
How does Byte-Pair Encoding (BPE) build token vocabularies?BPE starts from characters and repeatedly merges the most frequent adjacent pairs into new tokens, creating common sub-words efficiently. It typically uses minimal normalization and a learned segmenter, then authors often add domain-specific and special tokens (for unknowns, system prompts, or multimodal boundaries). Alternatives like WordPiece and SentencePiece exist with different counting and whitespace handling, but BPE is widely used.
Can a longer string use fewer tokens? Why does tokenizer choice matter?Yes. If a longer string matches a frequent sub-word (for example, “running”), it may map to a single token, whereas a shorter variant (“runnin”) might split into multiple tokens. Different tokenizer versions can segment the same text differently, affecting costs, model behavior, and evaluation consistency—important considerations for testing and production.
What are homoglyphs, and why are they risky?Homoglyphs are visually identical characters with different byte encodings (for example, Latin “H” vs. Cyrillic “Н”, or zero-width space). Tokenizers treat them as different symbols, which can inflate token counts, alter meaning, or enable adversarial inputs. Many systems mitigate this with normalization that removes or canonicalizes suspicious characters before tokenization.
How does tokenization shape strengths and weaknesses in word games, rare terms, and math?Because models operate on sub-words, they don’t directly track letters or exact word lengths, making letter-by-letter puzzles and rhyme schemes challenging. Rare or misspelled domain terms can tokenize inconsistently (for example, drug names), increasing confusion risk; adding domain tokens or improving input hygiene can help. For math, digit-level tokens often improve arithmetic by letting models reason over individual digits; some systems normalize numbers (for example, spacing digits) to encourage this behavior.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • How Large Language Models Work ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • How Large Language Models Work ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • How Large Language Models Work ebook for free