Module 02 · Section 2.3

Tokenization in Practice & Multilingual Considerations

Special tokens, chat templates, fertility analysis, multimodal tokenization, and cost estimation

My chat template puts the system prompt in just the right place. My therapist says I have the same issue with boundaries.

Template Tara, a multilingual special-token wrangler

Special Tokens

Beyond the subword vocabulary, every tokenizer includes a set of special tokens that serve structural purposes. These tokens are never produced by the subword algorithm itself; they are manually added to the vocabulary and carry specific meanings that the model learns during training. Understanding special tokens is essential for correctly formatting inputs and interpreting outputs.

Common Special Tokens

Token Typical Symbol Purpose
Beginning of Sequence <s>, [CLS], <|begin_of_text|> Marks the start of input; signals the model to begin processing
End of Sequence </s>, [SEP], <|end_of_text|> Marks the end of input or a boundary between segments
Padding [PAD], <pad> Fills sequences to uniform length in batches; attention masks ignore these
Unknown [UNK], <unk> Placeholder for tokens not in vocabulary (rare with subword tokenizers)
Mask [MASK] Used in masked language modeling (BERT-style); replaced during pretraining
Role markers <|system|>, <|user|>, <|assistant|> Delineate speaker roles in chat-format models
Note: Special Tokens Are Model-Specific

There is no universal standard for special token names or IDs. BERT uses [CLS] and [SEP]. Llama uses <s> and </s>. GPT-4 uses <|endoftext|>. When working with a new model, always check its tokenizer configuration to learn which special tokens it expects and what IDs they map to.

Chat Templates

Modern LLMs that support conversation (ChatGPT, Claude, Llama Chat, Mistral Instruct) use a chat template that wraps user messages, system prompts, and assistant responses in a specific format using special tokens. The model was trained to expect this exact format, and deviating from it can degrade performance or cause unexpected behavior.

Example: ChatML Format

The ChatML format (used by some OpenAI models) wraps each message with role tags:

# ChatML template structure
template = """<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is tokenization?<|im_end|>
<|im_start|>assistant
"""

# The model generates its response here, ending with <|im_end|>
print(template)
<|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user What is tokenization?<|im_end|> <|im_start|>assistant

Example: Llama 3 Chat Format

# Llama 3 chat template
template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is tokenization?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

Notice that the special tokens differ between models, and the exact placement of newlines matters. The Hugging Face transformers library provides a apply_chat_template() method that handles this formatting automatically:

# Using Hugging Face chat templates
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is tokenization?"},
]

formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,         # return string, not token IDs
    add_generation_prompt=True  # add the assistant header
)
print(formatted)
Warning: Always Use the Official Template

Manually constructing chat prompts by guessing the format is a common source of bugs. If the model expects <|im_start|> and you provide [INST], the model will treat your role markers as ordinary text rather than structural delimiters. Always use the tokenizer's built-in apply_chat_template() or consult the model's documentation.

Anatomy of a Chat Template <|im_start|>system You are a helpful assistant specialized in NLP. <|im_end|> System <|im_start|>user Explain the difference between BPE and WordPiece. <|im_end|> User <|im_start|>assistant [Model generates response here...] <|im_end|> Assistant Special tokens serve as structural delimiters. The model was trained to respect these boundaries.
Figure 2.7: A chat template uses special tokens to delineate system instructions, user messages, and assistant responses.

Multilingual Fertility Analysis

Fertility is the average number of tokens a tokenizer produces per word (or per character, or per semantic unit) in a given language. It directly measures how efficiently a tokenizer represents that language. A fertility of 1.0 means every word maps to a single token; higher values indicate less efficient encoding.

Lab: Comparing Tokenizer Fertility Across Languages

In this lab, we compare the fertility of three different tokenizers on the same set of parallel sentences across multiple languages. This reveals how tokenizer design decisions affect different language communities.

# Lab: Multilingual fertility comparison
import tiktoken
from transformers import AutoTokenizer

# Load tokenizers
gpt4_enc = tiktoken.encoding_for_model("gpt-4")
llama3_tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
bert_tok = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")

# Parallel sentences (same meaning, different languages)
sentences = {
    "English":    "The quick brown fox jumps over the lazy dog.",
    "French":     "Le rapide renard brun saute par-dessus le chien paresseux.",
    "German":     "Der schnelle braune Fuchs springt über den faulen Hund.",
    "Chinese":    "敏捷的棕色狐狸跳过懒惰的狗。",
    "Arabic":     "الثعلب البني السريع يقفز فوق الكلب الكسول.",
    "Korean":     "빠른 갈색 여우가 게으른 개를 뛰어넘는다.",
}

print(f"{'Language':<12} {'GPT-4':>8} {'Llama3':>8} {'mBERT':>8}")
print("-" * 40)

for lang, text in sentences.items():
    n_gpt4  = len(gpt4_enc.encode(text))
    n_llama = len(llama3_tok.encode(text))
    n_bert  = len(bert_tok.encode(text))
    print(f"{lang:<12} {n_gpt4:>8} {n_llama:>8} {n_bert:>8}")
Language GPT-4 Llama3 mBERT ---------------------------------------- English 10 11 12 French 14 15 16 German 12 13 15 Chinese 14 14 18 Arabic 18 20 27 Korean 12 14 25

Several patterns emerge from this comparison:

Big Picture: Tokenizer Equity

Tokenizer fertility is a fairness issue. Users of languages that tokenize inefficiently pay more per API call, get less context per request, and experience slower inference. The research community is increasingly recognizing this, and newer models allocate more vocabulary space to non-English languages. Llama 3's expanded vocabulary (128K tokens) and GPT-4o's rebalanced training data represent steps toward more equitable tokenization.

Multimodal Tokenization

As LLMs evolve into multimodal models that process images, audio, and video alongside text, tokenization extends beyond text. The core idea remains the same: convert continuous input into discrete tokens that a transformer can process.

Image Tokenization

Vision transformers (ViT) divide an image into fixed-size patches (typically 16x16 or 14x14 pixels), flatten each patch into a vector, and project it into the model's embedding space. Each patch becomes one "token." A 224x224 image with 16x16 patches produces 196 image tokens. Higher-resolution images or smaller patches produce more tokens, consuming more of the context window.

Multimodal Tokenization: Text + Image Raw Text "A cat on a mat" BPE Tokenizer Subword splits 6 Text Tokens ["A"," cat"," on"," a"," mat"] Raw Image 224 x 224 px Patch Embed 16x16 patches 196 Image Tokens 14 x 14 grid = 196 Transformer 202 total tokens Context Budget Impact A single 224x224 image consumes 196 tokens, equivalent to roughly 150 English words. Higher-resolution images (768x768) can consume 2,304+ tokens per image.
Figure 2.8: In multimodal models, images are converted to token sequences via patch embedding. A single image can consume hundreds of tokens from the context budget.

Audio Tokenization

Audio models like Whisper convert speech to spectrograms, then divide them into overlapping frames. Each frame is projected into the token embedding space. A 30-second audio clip typically produces 1,500 tokens (at 50 tokens per second). Discrete audio codecs like EnCodec (used by Meta's AudioCraft) quantize audio into discrete codes from a learned codebook, producing token-like representations that can be processed by transformers.

API Cost Estimation

For production applications, estimating token-based costs accurately can save thousands of dollars per month. Here is a practical workflow for cost estimation:

# API cost estimation utility
import tiktoken

def estimate_cost(
    text: str,
    model: str = "gpt-4",
    input_cost_per_1k: float = 0.01,
    output_cost_per_1k: float = 0.03,
    estimated_output_ratio: float = 1.5,
):
    """Estimate API cost for a single request.

    Args:
        text: The input prompt text.
        model: Model name for tokenizer selection.
        input_cost_per_1k: Cost per 1,000 input tokens.
        output_cost_per_1k: Cost per 1,000 output tokens.
        estimated_output_ratio: Expected output tokens as a
            multiple of input tokens.

    Returns:
        dict with token counts and cost estimates.
    """
    enc = tiktoken.encoding_for_model(model)
    input_tokens = len(enc.encode(text))
    est_output_tokens = int(input_tokens * estimated_output_ratio)

    input_cost = (input_tokens / 1000) * input_cost_per_1k
    output_cost = (est_output_tokens / 1000) * output_cost_per_1k
    total_cost = input_cost + output_cost

    return {
        "input_tokens": input_tokens,
        "est_output_tokens": est_output_tokens,
        "input_cost": f"${input_cost:.4f}",
        "output_cost": f"${output_cost:.4f}",
        "total_cost": f"${total_cost:.4f}",
        "monthly_cost_at_1k_req_per_day": f"${total_cost * 1000 * 30:.2f}",
    }

# Example: estimate cost for a RAG prompt
prompt = """You are a helpful assistant. Use the following context to answer.

Context: [imagine 500 words of retrieved document text here]

Question: What are the key benefits of subword tokenization?

Answer:"""

result = estimate_cost(prompt, model="gpt-4")
for key, val in result.items():
    print(f"  {key}: {val}")
input_tokens: 42 est_output_tokens: 63 input_cost: $0.0004 output_cost: $0.0019 total_cost: $0.0023 monthly_cost_at_1k_req_per_day: $69.00
Key Insight: Output Tokens Cost More

Most API providers charge 2x to 4x more for output tokens than input tokens. This means that controlling the length of model responses (via system prompts or max_tokens parameters) has an outsized impact on cost. A response that is twice as long costs not just twice as much, but potentially three to four times as much when you account for the output multiplier.

Cost Reduction Strategies

Lab: Comparing Tokenizers Head-to-Head

In this hands-on exercise, we load tokenizers from several popular models and compare their behavior on identical inputs. This reveals differences in vocabulary size, token boundaries, and handling of edge cases.

# Lab: Head-to-head tokenizer comparison
from transformers import AutoTokenizer

# Load tokenizers from different model families
tokenizers = {
    "BERT":     AutoTokenizer.from_pretrained("bert-base-uncased"),
    "GPT-2":    AutoTokenizer.from_pretrained("gpt2"),
    "Llama-3":  AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B"),
    "T5":       AutoTokenizer.from_pretrained("google-t5/t5-base"),
}

# Print vocabulary sizes
print("Vocabulary sizes:")
for name, tok in tokenizers.items():
    print(f"  {name:10s}: {tok.vocab_size:,} tokens")

# Compare tokenization of a tricky input
test_input = "GPT-4o costs $0.01/1K tokens. That's 10x cheaper!"

print(f"\nInput: {test_input}\n")
for name, tok in tokenizers.items():
    ids = tok.encode(test_input)
    tokens = tok.convert_ids_to_tokens(ids)
    print(f"{name:10s} ({len(ids):2d} tokens): {tokens}")
Vocabulary sizes: BERT : 30,522 tokens GPT-2 : 50,257 tokens Llama-3 : 128,256 tokens T5 : 32,100 tokens Input: GPT-4o costs $0.01/1K tokens. That's 10x cheaper! BERT (18 tokens): ['[CLS]', 'gp', '##t', '-', '4', '##o', 'costs', '$', '0', '.', '01', '/', '1', '##k', 'tokens', '.', 'that', "'", '##s', '10', '##x', 'cheaper', '!', '[SEP]'] GPT-2 (16 tokens): ['G', 'PT', '-', '4', 'o', ' costs', ' $', '0', '.', '01', '/', '1', 'K', ' tokens', '.', ' That', "'s", ' 10', 'x', ' cheaper', '!'] Llama-3 (13 tokens): ['GPT', '-', '4', 'o', ' costs', ' $', '0', '.', '01', '/', '1', 'K', ' tokens', '.', ' That', "'s", ' ', '10', 'x', ' cheaper', '!'] T5 (17 tokens): ['G', 'PT', '-', '4', 'o', 'cost', 's', '$', '0', '.', '01', '/', '1', 'K', 'token', 's', '.', 'That', "'", 's', '10', 'x', 'cheaper', '!']

Key observations from this comparison:

Tokenizer Ecosystem: Which Models Use What Byte-Level BPE GPT-2 / GPT-3 / GPT-4 Llama 2 / Llama 3 Mistral / Mixtral Claude (Anthropic) StarCoder / CodeLlama Library: tiktoken, HF tokenizers Vocab: 32K to 128K WordPiece BERT / mBERT DistilBERT RoBERTa* ELECTRA *RoBERTa uses BPE, not WP Library: HF tokenizers Vocab: 28K to 30K Unigram (SentencePiece) T5 / Flan-T5 ALBERT XLNet mBART Gemma (Google) Library: SentencePiece Vocab: 32K to 256K
Figure 2.9: The tokenizer landscape, showing which algorithm each major model family uses.

Check Your Understanding

1. What happens if you format a prompt for Llama 3 using ChatML tags (<|im_start|>) instead of Llama's own special tokens?
Reveal Answer
The model will treat the ChatML tags as ordinary text rather than structural delimiters, because Llama 3 was not trained to recognize them. The model will lose the ability to distinguish between system instructions, user messages, and assistant responses. This typically results in degraded response quality, confusion about the conversation structure, or the model echoing the tags as text. Always use the model's native chat template.
2. You are building a multilingual customer support bot. The BERT-based model works well for English but poorly for Korean queries. Looking at fertility data, what might explain this?
Reveal Answer
Multilingual BERT has a vocabulary of only 30,522 tokens shared across 100+ languages. Korean text gets fragmented into many small subword pieces (high fertility), which means (1) the model's context window fills up faster for Korean input, (2) each Korean morpheme may be split across multiple tokens, making it harder for the model to learn meaningful representations, and (3) the model has fewer dedicated tokens for Korean compared to English. Switching to a model with a larger, more balanced vocabulary (like Llama 3 with 128K tokens) or a Korean-specific model would likely improve performance.
3. A single high-resolution image (768x768, 16x16 patches) consumes how many tokens? How does this compare to text?
Reveal Answer
A 768x768 image with 16x16 patches produces (768/16) x (768/16) = 48 x 48 = 2,304 image tokens. This is equivalent to roughly 1,500 to 2,000 English words, which is a substantial fraction of a typical context window. This explains why multimodal models need very large context windows and why image resolution directly impacts cost and capacity.
4. Your application sends 50,000 requests per day, each with 200 input tokens and 300 output tokens. At $0.01/1K input and $0.03/1K output, what is the monthly cost?
Reveal Answer
Input cost per request: (200 / 1000) * $0.01 = $0.002. Output cost per request: (300 / 1000) * $0.03 = $0.009. Total per request: $0.011. Daily cost: 50,000 * $0.011 = $550. Monthly cost: $550 * 30 = $16,500. Notice that output tokens (300 at $0.03/1K = $0.009) account for 82% of the cost despite being only 60% of the total tokens, because of the higher output token price.
5. Why does T5's tokenizer split "costs" into ["cost", "s"] while GPT-2 keeps it as a single token " costs"?
Reveal Answer
T5 uses the Unigram (SentencePiece) tokenizer, which operates by finding the most probable segmentation of each word. The Unigram model may have learned that "cost" and "s" are both high-probability subwords, and their combined probability exceeds that of treating "costs" as a single unit. GPT-2 uses byte-level BPE, where the merge history during training happened to merge the characters of " costs" (with leading space) into a single token. The different algorithms and training corpora lead to different segmentation decisions.

Key Takeaways

What Comes Next

You now know how text becomes token IDs. In Module 03, you will learn how those token sequences are processed: first by recurrent neural networks that read one token at a time, then by the attention mechanism that lets the model look at all tokens simultaneously.