Module 02 · Section 2.2

Subword Tokenization Algorithms

BPE, WordPiece, Unigram, byte-level BPE, and the path toward tokenizer-free models

I merged byte pairs until "un" and "believable" became one token. Then I tried "un" and "fortunate." We don't talk about that.

Merge Marvin, a BPE algorithm with regrets

Overview of Subword Methods

All subword tokenization algorithms solve the same problem: given a training corpus, learn a vocabulary of subword units that can represent any text efficiently. They differ in how they build the vocabulary and how they segment new text at inference time. The three dominant families are:

  1. Byte Pair Encoding (BPE): a bottom-up, merge-based approach used by GPT-2, GPT-3, GPT-4, Llama, and most modern LLMs.
  2. WordPiece: a variant of BPE that uses a likelihood-based merge criterion, used by BERT and its derivatives.
  3. Unigram Language Model: a top-down, pruning-based approach that assigns probabilities to all subword candidates and uses the Viterbi algorithm for segmentation, used by T5, ALBERT, and XLNet.

In addition, we will cover byte-level BPE (which eliminates the need for character-level preprocessing) and tokenizer-free models (which bypass discrete tokenization entirely).

Byte Pair Encoding (BPE)

The Algorithm

BPE was originally a data compression algorithm introduced by Philip Gage in 1994. Sennrich et al. (2016) adapted it for NLP by applying it to sequences of characters rather than bytes. The algorithm is elegantly simple:

  1. Start with a vocabulary containing all individual characters found in the training corpus (plus any special tokens).
  2. Count the frequency of every adjacent pair of tokens in the corpus.
  3. Find the most frequent pair and merge it into a new token.
  4. Add the new token to the vocabulary.
  5. Replace all occurrences of the pair in the corpus with the new merged token.
  6. Repeat from step 2 until the desired vocabulary size is reached.

The sequence of merges is recorded in a merge table, which is all you need (along with the base vocabulary) to tokenize new text at inference time. To encode a new word, you start with its characters and apply the merges in the same order they were learned.

BPE Merge Process for Corpus: "low lower lowest" Step 0: Initial Characters l o w </w> | l o w e r </w> | l o w e s t </w> Step 1: Merge (l, o) = "lo" [freq: 3] lo w </w> | lo w e r </w> | lo w e s t </w> Step 2: Merge (lo, w) = "low" [freq: 3] low </w> | low e r </w> | low e s t </w> Step 3: Merge (e, r) = "er" [freq: 1] or (e, s) [freq: 1] ... low </w> | low er </w> | low e s t </w> Merge Table (ordered) 1. (l, o) → lo 2. (lo, w) → low 3. (e, r) → er ...
Figure 2.4: BPE iteratively merges the most frequent character pair. The merge table records all merges in order.

Lab: Implementing BPE from Scratch

Below is a minimal but complete BPE implementation in Python. Study the code carefully; every line maps directly to a step in the algorithm described above.

# Minimal BPE implementation from scratch
import re
from collections import Counter

def get_vocab(corpus):
    """Split each word into characters and add end-of-word marker."""
    words = corpus.split()
    word_freqs = Counter(words)
    vocab = {}
    for word, freq in word_freqs.items():
        # Each word becomes a tuple of characters + end-of-word symbol
        symbols = tuple(list(word) + ["</w>"])
        vocab[symbols] = freq
    return vocab

def get_pair_counts(vocab):
    """Count frequency of all adjacent symbol pairs."""
    pairs = Counter()
    for symbols, freq in vocab.items():
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i + 1])] += freq
    return pairs

def merge_pair(pair, vocab):
    """Merge all occurrences of a pair in the vocabulary."""
    new_vocab = {}
    bigram = pair
    for symbols, freq in vocab.items():
        new_symbols = []
        i = 0
        while i < len(symbols):
            # Look for the pair at position i
            if i < len(symbols) - 1 and symbols[i] == bigram[0] and symbols[i+1] == bigram[1]:
                new_symbols.append(bigram[0] + bigram[1])
                i += 2
            else:
                new_symbols.append(symbols[i])
                i += 1
        new_vocab[tuple(new_symbols)] = freq
    return new_vocab

def train_bpe(corpus, num_merges):
    """Train BPE and return the merge table and final vocabulary."""
    vocab = get_vocab(corpus)
    merges = []

    for step in range(num_merges):
        pairs = get_pair_counts(vocab)
        if not pairs:
            break
        best_pair = max(pairs, key=pairs.get)
        merges.append(best_pair)
        vocab = merge_pair(best_pair, vocab)
        print(f"Merge {step+1}: {best_pair} -> "
              f"'{best_pair[0]+best_pair[1]}' (freq={pairs[best_pair]})")

    return merges, vocab

# Train on a small corpus
corpus = "low low low lower lower lowest newest widest"
merges, final_vocab = train_bpe(corpus, num_merges=10)

print("\nFinal vocabulary:")
for symbols, freq in sorted(final_vocab.items(), key=lambda x: -x[1]):
    print(f"  {' '.join(symbols):30s}  (freq={freq})")
Merge 1: ('l', 'o') -> 'lo' (freq=6) Merge 2: ('lo', 'w') -> 'low' (freq=6) Merge 3: ('e', 's') -> 'es' (freq=2) Merge 4: ('es', 't') -> 'est' (freq=2) Merge 5: ('est', '</w>') -> 'est</w>' (freq=2) Merge 6: ('low', '</w>') -> 'low</w>' (freq=3) Merge 7: ('n', 'e') -> 'ne' (freq=1) Merge 8: ('ne', 'w') -> 'new' (freq=1) Merge 9: ('new', 'est</w>') -> 'newest</w>' (freq=1) Merge 10: ('w', 'i') -> 'wi' (freq=1) Final vocabulary: low</w> (freq=3) low e r </w> (freq=2) low est</w> (freq=2) newest</w> (freq=1) wi d est</w> (freq=1)
Note: BPE Encoding at Inference Time

To tokenize a new word, you start with its characters, then apply the learned merges in priority order (the order they were learned during training). The first merge in the table that matches any adjacent pair in the current sequence is applied, and you repeat until no more merges apply. This greedy strategy is deterministic and fast.

Encoding New Text with Learned Merges

# Encode a new word using the learned merge table
def encode_word(word, merges):
    """Tokenize a single word using learned BPE merges."""
    symbols = list(word) + ["</w>"]

    for left, right in merges:
        i = 0
        while i < len(symbols) - 1:
            if symbols[i] == left and symbols[i + 1] == right:
                symbols = symbols[:i] + [left + right] + symbols[i+2:]
            else:
                i += 1
    return symbols

# Test on known and unknown words
test_words = ["low", "lower", "lowest", "newer", "slowly"]

for word in test_words:
    tokens = encode_word(word, merges)
    print(f"  '{word}' -> {tokens}")
'low' -> ['low</w>'] 'lower' -> ['low', 'e', 'r', '</w>'] 'lowest' -> ['low', 'est</w>'] 'newer' -> ['new', 'e', 'r', '</w>'] 'slowly' -> ['s', 'lo', 'w', 'l', 'y', '</w>']

Notice how "lowest" gets efficiently segmented into ["low", "est</w>"], reusing subword units learned from the training corpus. The unseen word "slowly" is handled by falling back to smaller units where no merges apply.

WordPiece

WordPiece, developed at Google for the neural machine translation system and later adopted by BERT, is closely related to BPE. The key difference is in the merge criterion. While BPE merges the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training data. Specifically, it merges the pair (A, B) that maximizes:

score(A, B) = freq(AB) / (freq(A) × freq(B))

This score favors merging pairs where the combination AB appears much more often than you would expect if A and B occurred independently. In practice, this tends to produce linguistically meaningful subwords, because morphological boundaries (prefixes, suffixes, roots) tend to co-occur at rates much higher than chance.

WordPiece at Inference: MaxMatch

At inference time, WordPiece uses a greedy longest-match-first (MaxMatch) strategy. Given a word, it tries to match the longest possible token from the vocabulary starting from the left. If no match is found, it falls back to the longest match for the remaining substring. Continuation tokens are typically prefixed with ## to indicate they are not word-initial.

# Simulated WordPiece MaxMatch tokenization
def wordpiece_tokenize(word, vocab, max_token_len=20):
    """Greedy longest-match-first tokenization."""
    tokens = []
    start = 0
    while start < len(word):
        end = min(start + max_token_len, len(word))
        found = False
        while end > start:
            substr = word[start:end]
            if start > 0:
                substr = "##" + substr  # continuation prefix
            if substr in vocab:
                tokens.append(substr)
                found = True
                break
            end -= 1
        if not found:
            tokens.append("[UNK]")
            start += 1
            continue
        start = end
    return tokens

# Example vocabulary (simplified)
vocab = {"un", "##help", "##ful", "##ness", "##able", "token",
         "##ization", "##ize", "the", "##s", "play", "##ing",
         "##er", "##ed", "help", "##less"}

for word in ["tokenization", "unhelpfulness", "players", "helpless"]:
    tokens = wordpiece_tokenize(word, vocab)
    print(f"  '{word}' -> {tokens}")
'tokenization' -> ['token', '##ization'] 'unhelpfulness' -> ['un', '##help', '##ful', '##ness'] 'players' -> ['play', '##er', '##s'] 'helpless' -> ['help', '##less']
Key Insight: BPE vs. WordPiece

BPE builds its vocabulary bottom-up by frequency counting; WordPiece builds its vocabulary bottom-up by likelihood maximization. At inference, BPE replays its merge sequence; WordPiece uses greedy longest-match. In practice, both produce similar results, and the choice between them often comes down to the framework and codebase rather than a fundamental quality difference.

The core difference in one sentence: BPE asks "what pair appears most often?" WordPiece asks "what pair appears most surprisingly often relative to its parts?"

Concrete example: Suppose in our training corpus the pair ("t", "h") appears 10,000 times, while ("x", "z") appears only 50 times. BPE merges ("t", "h") first because 10,000 > 50. But WordPiece computes the score as: score(a, b) = freq(ab) / (freq(a) × freq(b)). If "t" and "h" are each extremely common on their own (say 500,000 and 400,000 occurrences), then 10,000 / (500,000 × 400,000) is tiny. Meanwhile, if "x" and "z" are rare (200 and 180), then 50 / (200 × 180) is much larger. WordPiece would merge ("x", "z") first because their co-occurrence is surprising given how rare each part is individually.

Unigram Language Model

The Unigram model, proposed by Kudo (2018) and implemented in the SentencePiece library, takes the opposite approach from BPE. Instead of starting small and merging upward, it starts with a large initial vocabulary of candidate subwords and iteratively prunes the least useful ones.

Training Procedure

  1. Begin with a large candidate vocabulary (for example, all substrings up to a certain length that appear in the training data).
  2. Assign a probability to each subword based on its frequency in the corpus, forming a unigram language model: P(token) = freq(token) / total_freq.
  3. For each word in the training data, find the most probable segmentation using the Viterbi algorithm (dynamic programming over all possible tokenizations).
  4. Compute the overall log-likelihood of the corpus under the current model.
  5. For each subword in the vocabulary, estimate how much the log-likelihood would decrease if that subword were removed.
  6. Remove the subwords whose removal causes the smallest decrease (they are the least useful).
  7. Repeat from step 2 until the vocabulary reaches the target size.

Worked Example: Viterbi Segmentation Step by Step

To see how Viterbi decoding works in practice, consider tokenizing the word "cats" with a small unigram vocabulary. We process the word left to right, at each position finding the best way to reach that point.

Suppose our vocabulary has these subwords and log-probabilities:

Subwordlog P
c-2.5
a-2.3
t-2.4
s-2.6
ca-1.8
cat-1.2
cats-3.0
at-1.9
ats-2.1
ts-2.0

Step 1: Initialize. Create a score array for positions 0 through 4 (one past each character). Set score[0] = 0 (the start has no cost).

Step 2: Fill position 1 (after "c"). The only subword ending here is "c" (log P = -2.5). Best path to position 1: ["c"], score = -2.5.

Step 3: Fill position 2 (after "ca"). Two candidates reach here:

Best path to position 2: ["ca"], score = -1.8.

Step 4: Fill position 3 (after "cat"). Three candidates:

Best path to position 3: ["cat"], score = -1.2.

Step 5: Fill position 4 (after "cats"). Four candidates:

Best path to position 4: ["cats"], score = -3.0. But if "cats" were not in the vocabulary, the best segmentation would be ["cat", "s"] with score -3.8.

Note: Why Dynamic Programming?

A 10-character word can have 512 possible segmentations. A 20-character word: over 500,000. Viterbi avoids enumerating all of them by building optimal partial solutions left to right. At each position, it only keeps the single best path. This reduces exponential time to O(n × m), where n is the word length and m is the maximum subword length.

Unigram: Viterbi Segmentation of "unbreakable" Finding the most probable tokenization among all candidates u n b r e a k a b l e Best: un break able Alt 1: un bre ak able Alt 2: u n break able Best path: log P("un") + log P("break") + log P("able") = -8.2 Alt 2: log P("u") + log P("n") + log P("break") + log P("able") = -12.1
Figure 2.5: The Unigram model considers all possible segmentations and selects the one with the highest probability (lowest negative log-likelihood) using the Viterbi algorithm.
Big Picture: Bottom-Up vs. Top-Down

BPE and WordPiece are bottom-up: they start with characters and build larger tokens by merging. Unigram is top-down: it starts with a large set of candidates and prunes down. A practical advantage of Unigram is that it produces multiple possible segmentations with associated probabilities, which can be used for data augmentation during training (by sampling different segmentations of the same word).

Comparison of the Three Methods

Property BPE WordPiece Unigram
Build direction Bottom-up (merge) Bottom-up (merge) Top-down (prune)
Merge criterion Most frequent pair Max likelihood gain Min likelihood loss on removal
Inference method Replay merges Greedy MaxMatch Viterbi (dynamic programming)
Segmentation Deterministic Deterministic Probabilistic (can sample)
Notable users GPT-2/3/4, Llama, Mistral BERT, DistilBERT T5, ALBERT, XLNet
Library tiktoken, HF tokenizers HF tokenizers SentencePiece

Byte-Level BPE

Standard BPE operates on Unicode characters, which means the initial vocabulary must include all characters that appear in the training data. For multilingual models, this can mean thousands of distinct characters from Chinese, Japanese, Korean, Arabic, Devanagari, and other scripts before any merges even begin.

Byte-level BPE, introduced by the GPT-2 team (Radford et al., 2019), solves this by operating on raw bytes instead of Unicode characters. Since there are only 256 possible byte values, the initial vocabulary is always exactly 256, regardless of what languages or scripts appear in the data. Multi-byte Unicode characters (such as Chinese characters, which are typically 3 bytes in UTF-8) are initially represented as sequences of byte tokens, and BPE merges then learn to combine them into higher-level units.

Note: The 256-Byte Alphabet

In UTF-8 encoding, ASCII characters (English letters, digits, common punctuation) are each 1 byte. Characters from most European languages are 2 bytes. CJK characters are 3 bytes, and some emoji are 4 bytes. Byte-level BPE starts with these raw bytes and lets the merge process discover character and word boundaries automatically. GPT-2, GPT-3, GPT-4, Llama 2, and Llama 3 all use byte-level BPE.

Byte-Level BPE: How Multi-Byte Characters Are Handled Character: H i UTF-8 bytes: 0x48 0x69 0xE4 0xBD 0xA0 0xE5 0xA5 0xBD 8 bytes total After merges: "Hi" (1 token) "你" (1 token) "好" (1 token) 3 tokens BPE merges learn to recombine the 3-byte sequences for common CJK characters into single tokens. Rare characters may remain as 2 or 3 byte-tokens.
Figure 2.6: Byte-level BPE starts with 256 byte tokens and merges them upward. Common multi-byte characters become single tokens after training.

Tokenizer-Free Models

What if we eliminated the tokenizer entirely and let the model operate directly on raw bytes? This idea has gained traction with models like ByT5 (Xue et al., 2022) and MegaByte (Yu et al., 2023), which process byte sequences without any subword segmentation step.

ByT5: Bytes Are All You Need

ByT5 modifies the T5 architecture to accept raw UTF-8 bytes as input. It uses a vocabulary of only 259 tokens (256 byte values plus 3 special tokens). The obvious downside is that sequences become very long: a 100-word English sentence might be 400 to 600 bytes, compared to 120 to 150 subword tokens. ByT5 compensates with a deeper encoder (relative to the decoder) and achieves competitive performance on many NLP benchmarks while being more robust to noise, spelling errors, and cross-lingual transfer.

MegaByte: Hierarchical Byte Models

MegaByte addresses the long-sequence problem by introducing a hierarchical architecture. It divides the byte sequence into fixed-size patches (for example, 8 bytes each) and processes these patches with a large "global" transformer. Within each patch, a smaller "local" transformer predicts individual bytes. This two-level approach reduces the quadratic cost of self-attention on long byte sequences, making byte-level modeling practical for longer documents.

Tradeoffs of Tokenizer-Free Models

Advantage Disadvantage
No tokenizer training required Much longer sequences (3x to 5x)
Robust to typos and noise Higher computational cost per text
Inherently multilingual and script-agnostic Harder to learn long-range dependencies
No out-of-vocabulary tokens Less interpretable intermediate representations
Uniform treatment of all inputs Not yet competitive at the largest scales
Warning: Tokenizer-Free Is Not Yet Mainstream

As of 2025, all widely deployed production LLMs (GPT-4, Claude, Llama, Gemini, Mistral) use subword tokenization. Tokenizer-free models are an active research direction with promising results on specific benchmarks, but they have not yet been proven at the scale and breadth of tasks required for general-purpose deployment. Keep an eye on this space, but do not assume that tokenizers are going away anytime soon.

Check Your Understanding

1. In BPE, what determines which pair gets merged at each step?
Reveal Answer
The pair with the highest frequency count across the entire training corpus is merged at each step. BPE counts all adjacent token pairs, finds the most frequent one, creates a new token by concatenating the pair, replaces all occurrences, and repeats.
2. How does WordPiece's merge criterion differ from BPE's?
Reveal Answer
WordPiece merges the pair that maximizes the likelihood of the training data, using the score: freq(AB) / (freq(A) * freq(B)). This favors pairs that co-occur much more than expected by chance, rather than simply the most frequent pairs. In practice, this tends to produce slightly more linguistically meaningful subwords.
3. What is the key advantage of the Unigram model over BPE and WordPiece?
Reveal Answer
The Unigram model produces probabilistic segmentations. For any input word, it can generate multiple possible tokenizations with associated probabilities. This enables data augmentation by sampling different segmentations during training (subword regularization). BPE and WordPiece always produce a single deterministic segmentation.
4. Why does byte-level BPE always start with a base vocabulary of exactly 256 tokens?
Reveal Answer
There are exactly 256 possible byte values (0x00 through 0xFF). Since byte-level BPE operates on raw bytes rather than Unicode characters, its initial vocabulary is always these 256 values regardless of what languages, scripts, or special characters appear in the training data. This eliminates the need for any character-level preprocessing.
5. A colleague suggests replacing your model's BPE tokenizer with ByT5's byte-level approach. What tradeoff should you warn them about?
Reveal Answer
The primary tradeoff is sequence length and computational cost. Byte-level input produces sequences that are roughly 3x to 5x longer than subword tokenization for the same text. Since transformer self-attention scales quadratically with sequence length (or linearly with optimized attention), this significantly increases training and inference cost. The model also needs to learn word and phrase boundaries from scratch, which requires more training data and compute. However, it gains robustness to typos, eliminates OOV tokens, and provides more uniform treatment of all languages.

Key Takeaways