Section 1.4 · Module 01

Contextual Embeddings: ELMo & the Path to Transformers

Same word, different meaning: solving the fatal flaw of static embeddings

A word's meaning depends on context, they said. So does my performance review, but nobody built ELMo for that.

Context Colleen, a bidirectional thinker

Learning Objectives

After completing this section, you will be able to:

The Polysemy Problem

Word2Vec, GloVe, and FastText share a fundamental limitation: each word gets exactly one vector, regardless of context. But language is deeply contextual.

Consider the word "bank":

With Word2Vec, all three uses map to the same vector: a compromise that captures none of the meanings well. This is called the polysemy problem.

Static Embeddings (Word2Vec) "deposited money at the bank" "sat on the river bank" "don't bank on it" SAME vector Contextual (ELMo/BERT) v("bank") = financial v("bank") = riverbank v("bank") = to rely on

Quick Check: Polysemy Spotting

How many different meanings does the word "run" have in these sentences? Would Word2Vec give them different or identical vectors?

  1. "I went for a run this morning." (exercise)
  2. "There was a run on the bank." (financial panic)
  3. "She has a run in her stockings." (tear in fabric)
  4. "The program takes a long time to run." (execute)
Reveal answer

Four completely different meanings, but Word2Vec assigns one single vector for all of them. That vector would be a blurry average of all four meanings, capturing none of them well. This is exactly the problem contextual embeddings solve.

ELMo: Embeddings from Language Models (2018)

ELMo (Peters et al., 2018) was the first widely successful contextual embedding model. The key idea: run the entire sentence through a deep bidirectional LSTM, and use the hidden states as word representations. Since the LSTM has seen the whole sentence, each word's representation is influenced by its context.

How it works:

  1. Train a bidirectional language model (forward LSTM + backward LSTM) on a large corpus
  2. For each word in a sentence, extract hidden states from all layers
  3. The ELMo embedding is a learned weighted combination of all layer representations
The river bank is Layer 0 LSTM → ← LSTM ELMo("bank") = α&sub0;h&sub0; + α&sub1;h&sub1; + α&sub2;h&sub2;

The breakthrough: "bank" in "river bank" now gets a different vector than "bank" in "bank account", because the LSTM hidden states are conditioned on the entire sentence.

Why Different Layers Capture Different Information

A remarkable finding from the ELMo paper: different layers of the LSTM capture different types of linguistic information:

Layer 2: Semantics Word sense, topic, meaning in context Layer 1: Syntax Part-of-speech, grammatical role, phrase structure Layer 0: Morphology Word identity, character patterns, inflection

This is why ELMo uses a weighted combination of all layers (the α weights in the diagram above): different downstream tasks benefit from different layers. A POS tagger might weight Layer 1 heavily, while a sentiment classifier might rely more on Layer 2. The weights are learned during fine-tuning for each specific task.

Historical Context

ELMo improved the state of the art on every single NLP benchmark it was tested on: question answering, sentiment analysis, NER, coreference resolution, and more. The gains were typically 3 to 10% absolute improvement, which was enormous by the standards of 2018. This proved definitively that contextual representations were the future, and set the stage for BERT just six months later.

The Paradigm Shift: Pre-train, Then Fine-tune

ELMo introduced what would become the dominant paradigm in NLP: pre-train a model on a large unlabeled corpus (learning general language understanding), then fine-tune or use the representations for specific tasks. This is exactly what BERT, GPT, and all modern LLMs do, just at a much larger scale.

Contextual Embeddings in Code

While ELMo itself is rarely used directly today (BERT and transformers have superseded it), we can demonstrate the concept of contextual embeddings using Hugging Face Transformers, which makes it easy to extract hidden states from any model:

# Demonstrating contextual embeddings: same word, different vectors
from transformers import AutoTokenizer, AutoModel
import torch
from scipy.spatial.distance import cosine

# Load a BERT model (the modern successor to ELMo)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

def get_word_embedding(sentence, word):
    """Extract the contextual embedding for a specific word in a sentence."""
    inputs = tokenizer(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)

    # Find the token position for our word
    tokens = tokenizer.tokenize(sentence)
    word_idx = tokens.index(word) + 1  # +1 for [CLS] token

    # Return the hidden state at that position
    return outputs.last_hidden_state[0, word_idx].numpy()

# "bank" in two different contexts
bank_river = get_word_embedding("I sat by the river bank", "bank")
bank_money = get_word_embedding("I went to the bank to deposit money", "bank")

# Measure how different the two "bank" vectors are
distance = cosine(bank_river, bank_money)
print(f"Cosine distance between 'bank' in different contexts: {distance:.3f}")
# Output: ~0.35: substantially different vectors for the same word!

# Compare: "bank" (river) is closer to "shore" than to "bank" (money)
shore = get_word_embedding("We walked along the shore", "shore")
print(f"Distance bank(river) to shore:      {cosine(bank_river, shore):.3f}")
print(f"Distance bank(river) to bank(money): {cosine(bank_river, bank_money):.3f}")
# bank(river) is CLOSER to "shore" than to bank(money)!
# This is exactly what contextual embeddings solve.
Why We Used BERT Instead of ELMo

ELMo (2018) proved the concept, but BERT (2018, released just months later) does the same thing better and faster using transformers instead of LSTMs. Both produce contextual embeddings; the code above works identically with either. We use BERT here because it is readily available via Hugging Face and is the tool you would actually use in practice. The concept (same word gets different vectors in different contexts) is ELMo's contribution; the implementation is modern.

From ELMo to Transformers: What Changed

Let us compare the approaches side by side to see the progression clearly:

PropertyWord2Vec / GloVeELMoBERT / GPT (next modules)
Context-aware?No (static)Yes (bi-LSTM)Yes (self-attention)
Pre-trained?YesYesYes (much larger scale)
ArchitectureShallow networkDeep bi-LSTMTransformer
Handles polysemy?NoYesYes (better)
Parallelizable?N/ANo (sequential)Yes (all at once!)
Long-range context?Window onlyLimited by LSTM memoryFull sequence via attention

The key limitation of ELMo was its reliance on LSTMs, which process text sequentially (one word at a time). This made training slow and limited the model's ability to capture very long-range dependencies. The Transformer architecture (Module 4) solves this by processing all words simultaneously using self-attention.

Summary: The Representation Journey

This module traced the evolution of how we represent text for machines. Each step solved a problem that the previous approach could not handle:

Bag-of-Words / TF-IDF Sparse, no semantics, no order Need meaning! Word2Vec / GloVe Dense, semantic, but static Need context! ELMo Contextual, but sequential Need speed & scale! Transformers (BERT, GPT) Contextual, parallel, scalable Scale it up! Large Language Models (GPT-4, Claude, Llama) Billions of parameters, emergent abilities, general intelligence
The Thread That Connects Everything

The entire history of NLP can be read as a quest for better representations of meaning. Each breakthrough, from TF-IDF to Word2Vec to ELMo to Transformers, made the representation denser (fewer dimensions, more information per number), more contextual (same word, different meaning in different contexts), and more general (works across tasks without task-specific engineering). Understanding this trajectory is the key to understanding where the field is heading next.

What is Next

In Module 2, we will explore tokenization, the critical first step that determines how text is broken into pieces before being fed to any model. You will learn the BPE algorithm that powers GPT-4 and Llama, and understand why tokenizer choice affects everything from model quality to API cost.

In Module 3, we will dive deep into attention, the mechanism that solved the sequential bottleneck of RNNs and enabled the Transformer revolution.

And in Module 4, we will build a complete Transformer from scratch in PyTorch, the architecture that underpins every model we will work with for the rest of the course.

* * *

What You Built in This Chapter

You now have hands-on experience with every major text representation technique from the past 30 years. Not bad for one chapter.

Exercises & Self-Check Questions

How to use these exercises: The conceptual questions test your understanding of the why behind each technique. Try answering them in your own words before moving on. The coding exercises are hands-on challenges you should run in a Jupyter notebook.

Conceptual Questions

  1. Representation evolution: In your own words, explain why the transition from sparse vectors (BoW) to dense vectors (Word2Vec) was such a big deal. What specific problems did it solve, and what new capabilities did it unlock?
  2. The distributional hypothesis: The phrase "you shall know a word by the company it keeps" is the foundation of Word2Vec. Can you think of cases where this assumption breaks down? (Hint: think about antonyms. "Hot" and "cold" appear in very similar contexts...)
  3. Static vs. contextual: Give three sentences where the word "play" means different things. Explain why Word2Vec would struggle with these but ELMo would handle them well.
  4. Why pre-train? ELMo was pre-trained on a large corpus, then used for specific tasks. Why is this better than training a model from scratch for each task? What does the pre-training capture that task-specific training would miss?
  5. Trade-offs: A colleague argues "TF-IDF is obsolete; just use embeddings for everything." Give two scenarios where TF-IDF would actually be the better choice, and explain why.

Coding Exercises

  1. Preprocessing exploration: Take a paragraph from a news article. Run it through the preprocessing pipeline from Section 1.2. Then experiment: what happens if you do not remove stop words? What if you use stemming instead of lemmatization? How do the resulting BoW vectors differ?
  2. Analogy hunting: Using the pre-trained word2vec-google-news-300 vectors, find 5 analogies that work well (beyond king/queen) and 3 that fail. Can you explain why the failures happen?
  3. Similarity exploration: Pick 20 words from three different categories (e.g., sports, food, technology). Compute all pairwise cosine similarities. Do words within a category have higher similarity than cross-category pairs? Visualize the similarity matrix as a heatmap.
  4. Word2Vec from scratch (challenge): Implement the Skip-gram model with negative sampling in pure PyTorch (no Gensim). Train it on a small text corpus and verify that similar words end up with similar vectors. Compare your results with Gensim's output.

Further Reading

TopicPaper / ResourceWhy Read It
Word2VecMikolov et al., "Efficient Estimation of Word Representations in Vector Space" (2013)The original paper. Surprisingly short and readable.
GloVePennington et al., "GloVe: Global Vectors for Word Representation" (2014)Elegant math showing why co-occurrence ratios encode meaning.
FastTextBojanowski et al., "Enriching Word Vectors with Subword Information" (2017)The subword approach that later influenced BPE tokenizers.
ELMoPeters et al., "Deep contextualized word representations" (2018)The paper that proved contextual embeddings work on every task.
Word2Vec explainedJay Alammar, "The Illustrated Word2Vec"The best visual explanation of Word2Vec on the internet.
Embeddings theoryLevy & Goldberg, "Neural Word Embedding as Implicit Matrix Factorization" (2014)The mathematical connection between Word2Vec and GloVe.
Where This Leads Next

You now understand how text becomes numbers: from sparse one-hot vectors to dense Word2Vec embeddings to contextual ELMo representations. But all these methods assumed words were already given to you. In Module 02: Tokenization and Subword Models, you will discover that the choice of how to split text into units is itself a critical design decision. BPE, WordPiece, and Unigram determine the atoms of the model's world, and those atoms affect everything from multilingual performance to arithmetic reasoning to API cost.