Module 07 · Section 7.4

Multilingual & Cross-Cultural LLMs

Cross-lingual transfer, low-resource languages, cultural bias, and adapting models beyond English

A truly multilingual model must master the grammar of a thousand languages, the idioms of a hundred cultures, and the quiet indignity of being evaluated almost exclusively in English.

An Underappreciated Polyglot
★ Big Picture

Language technology is not linguistically neutral. The vast majority of LLM training data, evaluation benchmarks, and engineering effort is concentrated on English and a handful of other high-resource languages. This creates a world where the roughly 1.5 billion English speakers enjoy capable AI assistants, while the remaining 6.5 billion people receive a degraded experience, if they receive one at all. Multilingual LLMs attempt to bridge this gap through cross-lingual transfer, multilingual pre-training, and targeted adaptation. However, language is deeply intertwined with culture, and serving diverse populations requires more than translation. This section examines how multilingual models work, where they fall short, and what techniques exist for extending LLMs to new languages and cultural contexts.

⚙ Prerequisites

This section assumes familiarity with tokenization from Module 02 (particularly BPE and vocabulary construction) and the open model families from Section 7.2. Understanding of cross-lingual transfer builds on the attention mechanism from Module 04.

1. Multilingual Pre-Training: How One Model Learns Many Languages

Modern multilingual LLMs are trained on corpora containing text in dozens to hundreds of languages. The model is never explicitly told which language it is processing; it must learn to handle all languages within a shared parameter space. This approach produces a remarkable phenomenon: cross-lingual transfer, where knowledge learned in one language becomes available in others.

1.1 Cross-Lingual Transfer

Cross-lingual transfer occurs because languages share deep structural similarities despite surface-level differences. When a model learns that "The cat sat on the mat" has a subject-verb-object structure in English, this structural knowledge partially transfers to French ("Le chat s'est assis sur le tapis") and even to languages with different word orders, because the underlying conceptual relationships are similar.

The mechanism behind cross-lingual transfer operates at multiple levels:

Cross-Lingual Transfer in Multilingual Models Input Layer English: "The cat sleeps" French: "Le chat dort" Japanese: "猫が眠る" Arabic: "القط ينام" Shared Representation Language-agnostic concepts: [ANIMAL] [SLEEP] [PRESENT] Structural + semantic alignment Zero-shot Transfer Train on EN, works in FR Partial Transfer Works but degraded for low-resource Knowledge Sharing Facts learned in EN available in JA Key finding: models develop a "universal" middle layer where representations converge across languages, with language-specific encoding in early layers and language-specific decoding in late layers.
Figure 7.4.1: Multilingual models map different languages into a shared representation space, enabling knowledge learned in one language to transfer to others.

1.2 The Curse of Multilinguality

Cross-lingual transfer is not free. Training a single model on many languages introduces a fundamental tension known as the curse of multilinguality: for a fixed model capacity, adding more languages improves low-resource language performance (through transfer from high-resource languages) but degrades high-resource language performance (because capacity is shared across more languages).

This trade-off can be expressed informally as:

Performance(lang) ∝ Data(lang) + Transfer(other_langs) − Interference(other_langs)

The interference term grows with the number of languages and the dissimilarity between them. Languages with different scripts, typological features, or morphological systems create more interference than closely related languages. Research by Conneau et al. (2020) showed that for XLM-R, performance on English decreased measurably when the model was trained on 100 languages versus 10, despite the English data remaining constant.

Key Insight

The curse of multilinguality explains why the largest multilingual models (GPT-4, Claude, Gemini) outperform smaller ones so dramatically on non-English languages. With hundreds of billions of parameters, these models have enough capacity to represent many languages without severe interference. Smaller models must make sharper trade-offs, which is why a 7B multilingual model often performs significantly worse on low-resource languages than a 70B model, even when both have seen similar multilingual data.

2. Low-Resource Language Challenges

Of the roughly 7,000 languages spoken worldwide, the vast majority are considered "low-resource" in the context of NLP. A language is low-resource when there is insufficient digital text data to train a capable model. The distribution is extremely skewed: English alone accounts for roughly 50% of internet content, and the top 10 languages cover over 80%. Thousands of languages have virtually no digital presence.

2.1 The Data Scarcity Problem

Low-resource languages face several compounding challenges:

# Demonstrating tokenization inefficiency across languages
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

# The same concept in different languages
sentences = {
    "English":  "The weather is beautiful today.",
    "French":   "Le temps est magnifique aujourd'hui.",
    "Chinese":  "今天天气很好。",
    "Thai":     "วันนี้อากาศดีมาก",
    "Amharic":  "ዛሬ የአየር ሁኔታ ቆንጆ ነው።",
    "Khmer":    "អាកាសធាតុល្អនៅថ្ងៃនេះ។",
}

print(f"{'Language':<12} {'Tokens':>6}  {'Chars/Token':>11}")
print("-" * 35)
for lang, text in sentences.items():
    tokens = tokenizer.encode(text)
    ratio = len(text) / len(tokens)
    print(f"{lang:<12} {len(tokens):>6}  {ratio:>11.2f}")

# Typical output:
# Language     Tokens  Chars/Token
# -----------------------------------
# English           7         4.71
# French            9         4.11
# Chinese          10         0.70
# Thai             23         0.78
# Amharic          33         0.52
# Khmer            43         0.47
Language Tokens/Word Fertility English 1 1.00 Spanish 1 0.93 Chinese 10 0.70 Thai 23 0.78 Amharic 33 0.52 Khmer 43 0.47
⚡ Key Insight

Key Insight: The tokenization tax. A Khmer user may need 6x more tokens than an English user to express the same meaning. This means they pay 6x more per API call, receive 6x shorter responses within the same token budget, and experience 6x slower generation. Tokenizer design is not a neutral technical choice; it directly determines who benefits from language model capabilities and who is underserved.

2.2 Solutions for Low-Resource Languages

Several strategies have been developed to improve LLM performance on low-resource languages:

3. Cultural Bias in LLMs

Language models inherit the cultural perspectives embedded in their training data. Since most training data originates from English-language internet sources, these models tend to encode Western (and specifically American) cultural norms, values, and assumptions as defaults. This manifests in several ways.

3.1 Western-Centric Defaults

When asked culturally dependent questions, LLMs overwhelmingly default to Western contexts:

# Demonstrating cultural bias in LLM responses
# (Illustrative; actual responses vary by model)

prompts_and_bias = [
    {
        "prompt": "What should I bring to a dinner party?",
        "typical_response": "Wine or dessert",
        "cultural_note": "In many Asian cultures, bringing fruit or "
                         "specialty items from your region is preferred. "
                         "In some Middle Eastern cultures, the host may "
                         "be offended by gifts implying they cannot "
                         "provide enough."
    },
    {
        "prompt": "How should I greet my colleague's parents?",
        "typical_response": "A firm handshake and eye contact",
        "cultural_note": "In Japan, a bow is appropriate. In many South "
                         "Asian cultures, touching elders' feet is a sign "
                         "of respect. In some Middle Eastern cultures, "
                         "cross-gender handshakes may be inappropriate."
    },
    {
        "prompt": "What is a healthy breakfast?",
        "typical_response": "Oatmeal, eggs, or yogurt with fruit",
        "cultural_note": "In Japan: miso soup, rice, grilled fish. "
                         "In India: idli, dosa, upma. "
                         "In Mexico: chilaquiles, huevos rancheros. "
                         "The concept of 'healthy' itself varies by culture."
    },
]

# A culturally aware system should detect the user's context
# and adapt responses accordingly, or explicitly acknowledge
# that the answer is culturally dependent.

3.2 Measuring Cultural Bias

Researchers have developed several approaches to quantify cultural bias in LLMs:

📝 Note: Bias Is Not Just "Wrong Answers"

Cultural bias in LLMs is subtle. The model may produce factually correct responses that are nevertheless culturally inappropriate. Recommending a "firm handshake" as a greeting is not wrong in an absolute sense, but it reflects a specific cultural norm that may not apply to the user. Addressing this requires not just better data, but a fundamental shift toward culturally adaptive systems that consider the user's context rather than imposing a single cultural default.

4. Multilingual Evaluation Benchmarks

Evaluating multilingual LLMs requires benchmarks that span multiple languages and tasks. Several important benchmarks have emerged:

Benchmark Languages Tasks Key Feature
XTREME 40 Classification, QA, retrieval, structured prediction Broad task coverage; includes low-resource languages
XTREME-R 50 Extended XTREME with retrieval tasks Adds cross-lingual retrieval; harder evaluation
MEGA 70+ NLU, generation, reasoning Specifically designed for generative LLMs
FLORES-200 200 Machine translation Covers 200 languages with professional translations
Belebele 122 Reading comprehension Parallel QA across 122 languages; isolates language vs. knowledge
Global MMLU 42 Multitask knowledge Culturally adapted MMLU; not just translated
Typical Performance Gap Across Language Resource Levels Accuracy (%) 90 70 50 30 10 EN 88% DE 85% ZH 84% AR 70% HI 65% SW 60% YO 43% QU 34% High-resource Medium-resource Low-resource EN=English, DE=German, ZH=Chinese, AR=Arabic, HI=Hindi, SW=Swahili, YO=Yoruba, QU=Quechua
Figure 7.4.2: Illustrative performance gap on multilingual QA benchmarks. Low-resource languages can trail English by 40+ percentage points on the same task.

5. Adapting English-Centric Models to New Languages

Given the dominance of English in LLM training data, a practical question arises: how can we adapt an existing English-centric model to serve a new target language well? Several techniques have proven effective.

5.1 Vocabulary Extension

The first bottleneck is often the tokenizer. A tokenizer trained primarily on English will fragment text in other scripts into many small, semantically meaningless tokens. Vocabulary extension adds new tokens specific to the target language:

# Vocabulary extension for a new language
from transformers import AutoTokenizer, AutoModelForCausalLM
from tokenizers import trainers, Tokenizer, models
import torch

def extend_tokenizer(base_model_name, target_corpus_path, n_new_tokens=10000):
    """
    Extend a model's tokenizer with tokens from a new language.
    Steps:
    1. Train a new tokenizer on the target language corpus
    2. Find tokens in the new tokenizer that are not in the base
    3. Add the most frequent new tokens to the base tokenizer
    4. Resize the model's embedding matrix
    """
    base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    base_vocab = set(base_tokenizer.get_vocab().keys())

    # Train a tokenizer on the target language
    target_tokenizer = Tokenizer(models.BPE())
    trainer = trainers.BpeTrainer(
        vocab_size=32000,
        special_tokens=["[UNK]", "[PAD]"],
    )
    target_tokenizer.train([target_corpus_path], trainer)
    target_vocab = target_tokenizer.get_vocab()

    # Find new tokens not in the base vocabulary
    new_tokens = [
        tok for tok in target_vocab
        if tok not in base_vocab
    ]

    # Sort by frequency and take top N
    new_tokens = sorted(
        new_tokens,
        key=lambda t: target_vocab[t],
        reverse=True,
    )[:n_new_tokens]

    # Add to base tokenizer
    base_tokenizer.add_tokens(new_tokens)

    # Resize model embeddings
    model = AutoModelForCausalLM.from_pretrained(base_model_name)
    model.resize_token_embeddings(len(base_tokenizer))

    # Initialize new embeddings as average of existing ones
    with torch.no_grad():
        avg_embedding = model.get_input_embeddings().weight[
            :len(base_vocab)
        ].mean(dim=0)
        for i in range(len(base_vocab), len(base_tokenizer)):
            model.get_input_embeddings().weight[i] = avg_embedding

    return model, base_tokenizer

# After extension, fine-tune on target language data

5.2 Continued Pre-Training

After extending the vocabulary, the model needs exposure to text in the target language. Continued pre-training (also called language-adaptive pre-training) trains the model on a large corpus of target language text using the same next-token prediction objective as the original pre-training. This approach is simpler than training from scratch and leverages the model's existing knowledge.

Key considerations for continued pre-training include:

# Continued pre-training configuration for language adaptation
training_config = {
    # Data mixing: target language + English to prevent forgetting
    "datasets": [
        {"path": "target_language_corpus/", "weight": 0.70},
        {"path": "english_corpus/",         "weight": 0.30},
    ],

    # Conservative learning rate to preserve existing knowledge
    "learning_rate": 2e-5,       # ~1/10 of original pre-training LR
    "lr_scheduler": "cosine",
    "warmup_steps": 500,

    # Training duration
    "max_steps": 50000,          # ~2B tokens with batch size 512
    "per_device_batch_size": 4,
    "gradient_accumulation": 128,

    # Regularization to prevent forgetting
    "weight_decay": 0.1,
    "max_grad_norm": 1.0,

    # Precision
    "bf16": True,
}

# After continued pre-training, fine-tune on
# task-specific data in the target language

5.3 Cross-Lingual Instruction Tuning

After language-adaptive pre-training, the model understands the target language but may not follow instructions well in it. Cross-lingual instruction tuning fine-tunes the model on instruction-following examples in the target language. These can be:

Key Insight

The full pipeline for language adaptation follows three stages: (1) vocabulary extension to improve tokenization efficiency, (2) continued pre-training on target language data to build language understanding, and (3) cross-lingual instruction tuning to enable instruction following. Each stage addresses a different aspect of the problem. Skipping vocabulary extension leads to high per-token costs; skipping continued pre-training leads to poor fluency; skipping instruction tuning leads to a model that understands the language but cannot follow user requests.

6. Multilingual Model Families

Several model families have been designed with multilingual capability as a core goal rather than an afterthought:

Model Languages Strength Approach
Qwen 2.5 29+ CJK languages, English Large-scale multilingual pre-training with balanced sampling
Aya (Cohere) 101 Broad coverage, instruction-tuned Community-sourced multilingual instruction data from native speakers
BLOOM 46+ African and Southeast Asian languages Deliberate inclusion of underrepresented languages in pre-training
Gemma 3 35+ European and Asian languages Curated multilingual data with quality filtering per language
Llama 3.1 8 Major world languages High quality on supported languages; limited coverage
📝 Note: The Aya Initiative

Cohere's Aya project is notable for its community-driven approach to multilingual AI. Rather than relying solely on web-scraped data or machine translation, Aya recruited native speakers from over 100 countries to create instruction-following examples in their own languages. This produces more natural, culturally appropriate training data than translation-based approaches. The resulting Aya 101 model covers more languages than any other open instruction-tuned model, though it remains smaller than frontier models in raw capability.

Section 7.4 Quiz

1. What is cross-lingual transfer, and what mechanisms enable it in multilingual models?
Reveal Answer
Cross-lingual transfer is the phenomenon where knowledge learned in one language becomes available in other languages within the same model. It is enabled by three mechanisms: (1) shared vocabulary through subword tokenizers that create overlapping tokens across languages (cognates, numbers, shared scripts), (2) structural alignment where languages with similar syntax develop aligned internal representations, and (3) conceptual universals where language-agnostic concepts (numbers, spatial relationships, causality) force the model to develop shared representations.
2. Explain the "curse of multilinguality" and its implications for model design.
Reveal Answer
The curse of multilinguality is the trade-off in multilingual models where, for a fixed model capacity, adding more languages improves low-resource language performance (through transfer from high-resource languages) but degrades high-resource language performance (because capacity is shared across more languages). The implication is that larger models suffer less from this curse because they have more capacity to represent all languages without severe interference. This is why frontier models with hundreds of billions of parameters dramatically outperform smaller models on non-English tasks.
3. Why is tokenization efficiency a critical issue for multilingual LLMs, and what are its practical consequences?
Reveal Answer
Tokenizers trained primarily on English fragment text in other scripts (Thai, Khmer, Amharic, etc.) into many small tokens. A single word that takes 1 to 2 tokens in English may require 5 to 10 tokens in a low-resource language. The practical consequences are significant: API costs scale with token count (so users of these languages pay more), the effective context window shrinks proportionally (less information fits in the model's context), and generation speed decreases (more tokens must be generated). This "tokenization tax" compounds existing disadvantages for underserved languages.
4. Describe the three-stage pipeline for adapting an English-centric model to a new target language.
Reveal Answer
Stage 1, Vocabulary Extension: Train a new tokenizer on target language data, identify tokens not present in the base vocabulary, and add the most frequent new tokens. Resize the model's embedding matrix and initialize new embeddings. Stage 2, Continued Pre-training: Train the model on a mix of target language data (roughly 70%) and English data (roughly 30%) using a conservative learning rate to prevent catastrophic forgetting. Stage 3, Cross-lingual Instruction Tuning: Fine-tune on instruction-following examples in the target language, using a combination of translated and natively written examples. Each stage addresses a different limitation: tokenization efficiency, language understanding, and instruction following.
5. How do Western-centric cultural biases manifest in LLMs, and why is this more than just a translation problem?
Reveal Answer
Western-centric biases manifest as default assumptions about geography (assuming U.S. context), cultural norms (recommending handshakes as greetings), historical perspective (presenting events from Western viewpoints), and values (defaulting to Western individualist frameworks for moral reasoning). This is more than a translation problem because culture is not just about language. A perfectly translated response can still be culturally inappropriate. Recommending wine as a dinner party gift is linguistically correct in Arabic but may be culturally inappropriate for a Muslim host. Addressing this requires culturally adaptive systems that consider the user's cultural context, not just their language.

Key Takeaways