Module 02 · Section 2.1

Why Tokenization Matters

The first and most consequential decision in any language model pipeline

"Tokenization is easy," I said, right before the Thai sentence without spaces reduced me to individual Unicode code points.

Naive Splitter, a whitespace-trusting tokenizer

Introduction: The Invisible Gateway

In Module 01, you learned how to represent words as vectors using techniques from Bag-of-Words to Word2Vec. But all those methods assumed that the "words" were already given to you. How does a model decide where one word ends and the next begins? How does it handle misspellings, compound words, or languages that do not use spaces? That is the problem of tokenization.

When you type a prompt into ChatGPT, Claude, or any other language model, your text does not enter the model as characters or words. Instead, it passes through a tokenizer, a preprocessing step that chops your input into discrete units called tokens. These tokens are the atoms of the model's universe: every parameter, every computation, and every output is defined in terms of them. Yet tokenization is often treated as a footnote, a plumbing detail that receives far less attention than attention heads or loss functions.

This section argues that tokenization deserves center stage. The way you split text into tokens determines how large your vocabulary is, how long your sequences become, how much each API call costs, and what kinds of errors the model makes. A poor tokenization scheme can cripple an otherwise excellent model; a thoughtful one can quietly improve everything from multilingual performance to arithmetic reasoning.

Big Picture

Think of tokenization as choosing the alphabet for your model's language. If your alphabet has too few symbols, you need long strings to express simple ideas. If it has too many, you waste memory storing symbols you rarely use. Every modern LLM navigates this tradeoff, and the choices have real consequences for users.

The Vocabulary Size Tradeoff

At one extreme, you could tokenize text one character at a time. English has roughly 100 printable characters, so your vocabulary would be tiny and your embedding table would fit on a smart watch. But the sequence "machine learning" would become 16 tokens, forcing the model to spend precious context window space and computation just to reconstruct familiar words.

At the other extreme, you could give every word in the language its own token. English has hundreds of thousands of distinct word forms (including conjugations, pluralizations, and compounds), so your embedding table would balloon to gigabytes. Worse, any word not in your vocabulary (a typo, a new brand name, a word from another language) would be unrepresentable.

Modern tokenizers live between these extremes by using subword units. Common words like "the" and "machine" get their own tokens, while rarer words are broken into recognizable pieces: "tokenization" might become ["token", "ization"], and "unhelpfulness" might become ["un", "help", "ful", "ness"]. This strategy keeps the vocabulary manageable (typically 32,000 to 128,000 tokens) while ensuring that any string can be encoded.

The Core Equation

The fundamental relationship is simple:

Larger vocabulary ⇒ fewer tokens per text ⇒ shorter sequences ⇒ more text fits in context window
Smaller vocabulary ⇒ more tokens per text ⇒ longer sequences ⇒ less text fits in context window

But vocabulary size also affects model parameters. Every token in the vocabulary needs an embedding vector (typically 4,096 to 12,288 dimensions in modern LLMs). A vocabulary of 128,000 tokens with 4,096-dimensional embeddings consumes about 2 GB of parameters just for the embedding and output layers. That is not free.

The Vocabulary Size Spectrum Character Subword Whole Word Sweet Spot Character-Level Subword (BPE/WP) Whole-Word Vocab: ~256 Sequences: Very long Embed table: Tiny OOV words: None "hello" = 5 tokens ['h','e','l','l','o'] Slow, hard to learn word boundaries Vocab: 32K to 128K Sequences: Moderate Embed table: Medium OOV words: None "hello" = 1 token ['hello'] Best balance of efficiency and coverage Vocab: 200K+ Sequences: Short Embed table: Huge OOV words: Many "hello" = 1 token "xyzabc" = [UNK] Wastes parameters, can't handle new words
Figure 2.1: The vocabulary size spectrum. Subword tokenization occupies the sweet spot between character and word tokenization.

Seeing the Tradeoff in Numbers

Let us make the tradeoff concrete with a quick Python experiment. We will compare how many tokens different granularities produce for the same English sentence.

# Comparing tokenization granularities
text = "Tokenization determines the model's vocabulary and sequence length."

# Character-level
char_tokens = list(text)
print(f"Character tokens: {len(char_tokens)} tokens")
print(f"  Sample: {char_tokens[:10]}...")

# Whitespace word-level
word_tokens = text.split()
print(f"\nWord tokens: {len(word_tokens)} tokens")
print(f"  Tokens: {word_tokens}")

# Subword-level (using tiktoken, GPT-4's tokenizer)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
subword_tokens = enc.encode(text)
print(f"\nSubword tokens (GPT-4): {len(subword_tokens)} tokens")
print(f"  Decoded: {[enc.decode([t]) for t in subword_tokens]}")
Character tokens: 66 tokens Sample: ['T', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i']... Word tokens: 9 tokens Tokens: ['Tokenization', 'determines', 'the', "model's", 'vocabulary', 'and', 'sequence', 'length.'] Subword tokens (GPT-4): 10 tokens Decoded: ['Token', 'ization', ' determines', ' the', ' model', "'s", ' vocabulary', ' and', ' sequence', ' length', '.']

Notice that the subword tokenizer produces about 10 tokens, compared to 66 for characters and 9 for words. The subword approach is nearly as compact as word-level, yet it handles the possessive "'s" and the suffix "ization" as separate reusable pieces. It can also handle any misspelling or novel word by falling back to smaller subword fragments.

Context Window and Cost Impact

Modern LLMs have a fixed context window measured in tokens: 4,096 tokens for early GPT-3, 128,000 for GPT-4 Turbo, and up to 1,000,000 for Gemini 1.5 Pro. The tokenizer determines how much raw text fits into that window. A tokenizer that is inefficient (uses too many tokens per word) effectively shrinks the model's context window from the user's perspective.

The Token Tax on Different Languages

This becomes especially important for non-English languages. Most popular tokenizers were trained primarily on English text, so English words tend to get their own tokens while words in other languages are split into many small pieces. The same semantic content in Japanese, Hindi, or Thai might consume 2x to 5x as many tokens as the English equivalent. This means non-English users get a smaller effective context window and pay more per API call for the same amount of meaning.

# Demonstrating the "token tax" across languages
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

texts = {
    "English":  "Artificial intelligence is transforming the world.",
    "Spanish":  "La inteligencia artificial está transformando el mundo.",
    "Japanese": "人工知能は世界を変えつつある。",
    "Hindi":    "कृत्रिम बुद्धिमत्ता दुनिया को बदल रही है।",
}

for lang, text in texts.items():
    tokens = enc.encode(text)
    ratio = len(tokens) / len(text.split())
    print(f"{lang:10s}: {len(tokens):3d} tokens, "
          f"{len(text.split()):2d} words, "
          f"ratio = {ratio:.1f} tokens/word")
English : 8 tokens, 7 words, ratio = 1.1 tokens/word Spanish : 11 tokens, 7 words, ratio = 1.6 tokens/word Japanese : 14 tokens, 1 words, ratio = 14.0 tokens/word Hindi : 28 tokens, 7 words, ratio = 4.0 tokens/word
Warning: The Multilingual Token Tax

A user writing in Hindi effectively pays 3 to 4 times more per API call than an English user expressing the same idea. This is not a flaw in the model architecture; it is a direct consequence of tokenizer training data being skewed toward English. Newer models (Llama 3, GPT-4o) are addressing this by training tokenizers on more balanced multilingual corpora, but the gap has not been fully closed.

Cost Arithmetic

API providers charge per token. As of 2024, GPT-4 Turbo charges roughly $10 per million input tokens. If your application processes 10 million words per day, the choice of tokenizer directly affects your monthly bill:

Tokenizer Efficiency Tokens per Word Tokens / Day Monthly Cost (approx.)
Efficient (English text) 1.2 12M $3,600
Average (mixed languages) 2.0 20M $6,000
Inefficient (CJK heavy) 3.5 35M $10,500

The difference between 1.2 and 3.5 tokens per word is nearly a 3x cost multiplier. Understanding your tokenizer's behavior on your specific data is not an academic exercise; it has direct financial implications.

Same Content, Different Token Counts "Hello, how are you?" in 4 languages within a 20-token context window English 5 tokens 75% context remaining Spanish 7 tokens 65% context remaining Japanese 14 tokens 30% context remaining Hindi 18 tokens 10% context remaining Context used Context available Token counts are illustrative; exact values vary by tokenizer version.
Figure 2.2: The same greeting consumes vastly different amounts of the context window depending on language, due to tokenizer efficiency differences.

Tokenization Artifacts and Their Downstream Effects

Tokenization is not a lossless compression of text. The boundaries where the tokenizer decides to split (or not split) create artifacts that propagate through the model's behavior. Some of these artifacts are subtle; others cause spectacular failures.

Artifact 1: Inconsistent Splitting

The same word can be tokenized differently depending on context. Leading spaces, capitalization, and surrounding punctuation all affect how a subword tokenizer segments text. Consider how GPT-4's tokenizer handles the word "token" in different contexts:

# Demonstrating context-sensitive tokenization
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

examples = [
    "token",           # bare word
    " token",          # with leading space
    "Token",           # capitalized
    "TOKEN",           # all caps
    "tokenization",    # as part of longer word
    " tokenization",   # with leading space, longer word
]

for ex in examples:
    ids = enc.encode(ex)
    pieces = [enc.decode([i]) for i in ids]
    print(f"  {repr(ex):25s} => {pieces}")
'token' => ['token'] ' token' => [' token'] 'Token' => ['Token'] 'TOKEN' => ['TOKEN'] 'tokenization' => ['token', 'ization'] ' tokenization' => [' token', 'ization']

Notice that "token" and " token" (with a leading space) are entirely different tokens in the vocabulary. This is by design: leading spaces are attached to the following word so that the tokenizer can reconstruct the original text faithfully. But it means the model sees different input IDs for what a human would consider the same word. The model must learn that these represent the same concept, which requires extra training data and capacity.

Artifact 2: Arithmetic Failures

One of the most widely discussed tokenization artifacts is the difficulty LLMs have with arithmetic. Numbers are tokenized inconsistently: "380" might be a single token, "381" might be split into ["38", "1"], and "3810" might become ["38", "10"]. The model has no built-in notion that these tokens represent digits in a positional number system. It must learn addition, subtraction, and other operations from patterns in the training data, and the inconsistent tokenization makes this much harder.

Note: Why LLMs Struggle with Math

When a model sees "What is 1234 + 5678?", the tokenizer might produce ["12", "34", " +", " ", "56", "78"]. The model does not see individual digits aligned in columns the way a human would when doing manual addition. It must learn to parse multi-digit numbers from arbitrary token boundaries, align them mentally, and compute carries. This is one reason why tool-use (calling a calculator) is so important for production LLM systems.

Artifact 3: The "Trailing Space" Problem

Because many tokenizers attach leading whitespace to tokens, the model treats " Hello" and "Hello" as fundamentally different inputs. This can cause unexpected behavior when building prompts programmatically. If you accidentally include or omit a space before a key word, the model may interpret it differently. This is especially tricky in few-shot prompting, where consistent formatting is critical.

Artifact 4: Tokenization of Code

Programming languages create unique challenges. Indentation is semantically meaningful in Python, yet a tokenizer may split indentation inconsistently. Four spaces might be one token in one context and two tokens in another. Variable names in camelCase or snake_case get split at different points. Modern tokenizers (like those used in code-focused models such as Codex or StarCoder) address this by including common indentation patterns and code-specific tokens in their vocabulary.

# How code gets tokenized (using tiktoken)
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

code = """def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)"""

tokens = enc.encode(code)
pieces = [enc.decode([t]) for t in tokens]

print(f"Total tokens: {len(tokens)}")
print(f"Token pieces: {pieces}")
Total tokens: 29 Token pieces: ['def', ' fibonacci', '(n', '):\n', ' ', ' if', ' n', ' <=', ' ', '1', ':\n', ' ', ' return', ' n', '\n', ' ', ' return', ' fibonacci', '(n', '-', '1', ')', ' +', ' fibonacci', '(n', '-', '2', ')']

Notice how indentation, newlines, and even the function name get merged into multi-character tokens. The four-space indentation appears as a single token in some lines but might be split differently in others, depending on what follows.

How Tokenization Artifacts Cascade Raw Text "What is 128+256?" Tokenizer ["What"," is"," 12", "8","+","256","?"] Model Sees "12" and "8" as separate concepts Output "384" (correct) or "374" (wrong) Inconsistent Splitting 128 = ["12","8"] 256 = ["256"] Digits split unpredictably Lost Structure No column alignment No carry propagation cues Model must learn from data Unreliable Results Correct for common sums Fails on unusual numbers Tool use mitigates this
Figure 2.3: Tokenization artifacts propagate through the model pipeline, causing unexpected failures in downstream tasks like arithmetic.

Practical Implications for Builders

If you are building applications on top of LLMs, tokenization behavior should inform several design decisions:

Key Insight

Tokenization is the lens through which your model sees the world. Understanding that lens, including its distortions, is essential for building reliable AI applications. Every time a model behaves unexpectedly, ask yourself: how did the tokenizer represent this input?

Check Your Understanding

1. Why do modern LLMs use subword tokenization instead of word-level or character-level tokenization?
Reveal Answer
Subword tokenization balances the vocabulary size tradeoff. Word-level tokenization creates enormous vocabularies and cannot handle out-of-vocabulary words. Character-level tokenization produces extremely long sequences, wasting context window space and making it hard for the model to learn word-level patterns. Subword tokenization keeps common words as single tokens while breaking rare words into reusable pieces, achieving both compact sequences and complete coverage.
2. A model has a 4,096-token context window. You want to process a 3,000-word English document and a 3,000-word Japanese document. Will both fit?
Reveal Answer
Probably not both. The English document will likely produce roughly 3,600 to 4,200 tokens (about 1.2 to 1.4 tokens per word), which is already close to the limit. The Japanese document may produce 6,000 to 15,000 tokens depending on the tokenizer, because Japanese text typically has a much higher token-to-word ratio in tokenizers trained primarily on English. You would need to measure with the specific tokenizer and possibly truncate or summarize the documents.
3. Why do LLMs sometimes make arithmetic mistakes, and how does tokenization contribute to the problem?
Reveal Answer
LLMs make arithmetic mistakes partly because numbers are tokenized inconsistently. A number like "1234" might become tokens ["12", "34"] or ["1", "234"] or even a single token, depending on the number and tokenizer. The model never sees individual digits aligned in positional columns the way humans do when computing by hand. It must learn arithmetic from statistical patterns in training data, and the inconsistent digit boundaries make this task much harder.
4. You notice your multilingual chatbot costs 3x more when users write in Thai compared to English. What is the likely cause, and what could you do about it?
Reveal Answer
The likely cause is tokenizer fertility: Thai text gets split into many more tokens per semantic unit than English, because the tokenizer was trained primarily on English data. Possible mitigations include: (1) switching to a model with a more balanced multilingual tokenizer, (2) using a model with a larger vocabulary that includes more Thai-specific tokens, (3) translating Thai inputs to English before processing (though this adds latency and may lose nuance), or (4) using a different pricing tier or provider that is more cost-effective for multilingual workloads.
5. In the code example, why are " token" (with a leading space) and "token" (without) different tokens in GPT-4's vocabulary?
Reveal Answer
Most modern tokenizers attach leading whitespace to the following word so that the original text can be perfectly reconstructed from the token sequence. Without this convention, the tokenizer would lose information about where spaces appeared in the original text. The consequence is that " token" and "token" occupy different entries in the vocabulary, and the model must learn that they refer to the same concept. This is a necessary tradeoff for lossless round-trip encoding.

Key Takeaways