Section 1.4: Contextual Embeddings: ELMo & the Path to Transformers

Learning Objectives

After completing this section, you will be able to:

Explain the polysemy problem and why static embeddings cannot solve it
Describe ELMo's architecture and how bidirectional LSTMs produce contextual representations
Explain why different layers capture different types of linguistic information
Articulate the "pre-train, then fine-tune" paradigm that ELMo introduced
Extract and compare contextual embeddings using Hugging Face Transformers
Trace the evolution from ELMo to BERT to modern LLMs

The Polysemy Problem

Word2Vec, GloVe, and FastText share a fundamental limitation: each word gets exactly one vector, regardless of context. But language is deeply contextual.

Consider the word "bank":

"I deposited money at the bank." → financial institution
"We sat on the river bank." → edge of a river
"Don't bank on it." → to rely on

With Word2Vec, all three uses map to the same vector: a compromise that captures none of the meanings well. This is called the polysemy problem.

Quick Check: Polysemy Spotting

How many different meanings does the word "run" have in these sentences? Would Word2Vec give them different or identical vectors?

"I went for a run this morning." (exercise)
"There was a run on the bank." (financial panic)
"She has a run in her stockings." (tear in fabric)
"The program takes a long time to run." (execute)

Reveal answer

Four completely different meanings, but Word2Vec assigns one single vector for all of them. That vector would be a blurry average of all four meanings, capturing none of them well. This is exactly the problem contextual embeddings solve.

ELMo: Embeddings from Language Models (2018)

ELMo (Peters et al., 2018) was the first widely successful contextual embedding model. The key idea: run the entire sentence through a deep bidirectional LSTM, and use the hidden states as word representations. Since the LSTM has seen the whole sentence, each word's representation is influenced by its context.

How it works:

Train a bidirectional language model (forward LSTM + backward LSTM) on a large corpus
For each word in a sentence, extract hidden states from all layers
The ELMo embedding is a learned weighted combination of all layer representations

The breakthrough: "bank" in "river bank" now gets a different vector than "bank" in "bank account", because the LSTM hidden states are conditioned on the entire sentence.

Why Different Layers Capture Different Information

A remarkable finding from the ELMo paper: different layers of the LSTM capture different types of linguistic information:

Layer 0 (token embeddings): Captures basic word identity and morphological features, similar to Word2Vec. "Running" is close to "runner" and "runs."
Layer 1 (first LSTM): Captures syntactic information: part of speech, grammatical role. The model has learned whether a word is a noun, verb, or adjective from context.
Layer 2 (second LSTM): Captures semantic information: word sense disambiguation, topical context. This is the layer that distinguishes "river bank" from "bank account."

This is why ELMo uses a weighted combination of all layers (the α weights in the diagram above): different downstream tasks benefit from different layers. A POS tagger might weight Layer 1 heavily, while a sentiment classifier might rely more on Layer 2. The weights are learned during fine-tuning for each specific task.

Historical Context

ELMo improved the state of the art on every single NLP benchmark it was tested on: question answering, sentiment analysis, NER, coreference resolution, and more. The gains were typically 3 to 10% absolute improvement, which was enormous by the standards of 2018. This proved definitively that contextual representations were the future, and set the stage for BERT just six months later.

The Paradigm Shift: Pre-train, Then Fine-tune

ELMo introduced what would become the dominant paradigm in NLP: pre-train a model on a large unlabeled corpus (learning general language understanding), then fine-tune or use the representations for specific tasks. This is exactly what BERT, GPT, and all modern LLMs do, just at a much larger scale.

Contextual Embeddings in Code

While ELMo itself is rarely used directly today (BERT and transformers have superseded it), we can demonstrate the concept of contextual embeddings using Hugging Face Transformers, which makes it easy to extract hidden states from any model:

# Demonstrating contextual embeddings: same word, different vectors
from transformers import AutoTokenizer, AutoModel
import torch
from scipy.spatial.distance import cosine

# Load a BERT model (the modern successor to ELMo)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

def get_word_embedding(sentence, word):
    """Extract the contextual embedding for a specific word in a sentence."""
    inputs = tokenizer(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)

    # Find the token position for our word
    tokens = tokenizer.tokenize(sentence)
    word_idx = tokens.index(word) + 1  # +1 for [CLS] token

    # Return the hidden state at that position
    return outputs.last_hidden_state[0, word_idx].numpy()

# "bank" in two different contexts
bank_river = get_word_embedding("I sat by the river bank", "bank")
bank_money = get_word_embedding("I went to the bank to deposit money", "bank")

# Measure how different the two "bank" vectors are
distance = cosine(bank_river, bank_money)
print(f"Cosine distance between 'bank' in different contexts: {distance:.3f}")
# Output: ~0.35: substantially different vectors for the same word!

# Compare: "bank" (river) is closer to "shore" than to "bank" (money)
shore = get_word_embedding("We walked along the shore", "shore")
print(f"Distance bank(river) to shore:      {cosine(bank_river, shore):.3f}")
print(f"Distance bank(river) to bank(money): {cosine(bank_river, bank_money):.3f}")
# bank(river) is CLOSER to "shore" than to bank(money)!
# This is exactly what contextual embeddings solve.

Why We Used BERT Instead of ELMo

ELMo (2018) proved the concept, but BERT (2018, released just months later) does the same thing better and faster using transformers instead of LSTMs. Both produce contextual embeddings; the code above works identically with either. We use BERT here because it is readily available via Hugging Face and is the tool you would actually use in practice. The concept (same word gets different vectors in different contexts) is ELMo's contribution; the implementation is modern.

From ELMo to Transformers: What Changed

Let us compare the approaches side by side to see the progression clearly:

Property	Word2Vec / GloVe	ELMo	BERT / GPT (next modules)
Context-aware?	No (static)	Yes (bi-LSTM)	Yes (self-attention)
Pre-trained?	Yes	Yes	Yes (much larger scale)
Architecture	Shallow network	Deep bi-LSTM	Transformer
Handles polysemy?	No	Yes	Yes (better)
Parallelizable?	N/A	No (sequential)	Yes (all at once!)
Long-range context?	Window only	Limited by LSTM memory	Full sequence via attention

The key limitation of ELMo was its reliance on LSTMs, which process text sequentially (one word at a time). This made training slow and limited the model's ability to capture very long-range dependencies. The Transformer architecture (Module 4) solves this by processing all words simultaneously using self-attention.

Summary: The Representation Journey

This module traced the evolution of how we represent text for machines. Each step solved a problem that the previous approach could not handle:

The Thread That Connects Everything

The entire history of NLP can be read as a quest for better representations of meaning. Each breakthrough, from TF-IDF to Word2Vec to ELMo to Transformers, made the representation denser (fewer dimensions, more information per number), more contextual (same word, different meaning in different contexts), and more general (works across tasks without task-specific engineering). Understanding this trajectory is the key to understanding where the field is heading next.

What is Next

In Module 2, we will explore tokenization, the critical first step that determines how text is broken into pieces before being fed to any model. You will learn the BPE algorithm that powers GPT-4 and Llama, and understand why tokenizer choice affects everything from model quality to API cost.

In Module 3, we will dive deep into attention, the mechanism that solved the sequential bottleneck of RNNs and enabled the Transformer revolution.

And in Module 4, we will build a complete Transformer from scratch in PyTorch, the architecture that underpins every model we will work with for the rest of the course.

* * *

What You Built in This Chapter

A text preprocessing pipeline in both NLTK and spaCy
Bag-of-Words and TF-IDF vectorizers with sklearn
A Word2Vec model trained from scratch with Gensim
Cosine similarity computations and word analogy queries
A FastText model that handles out-of-vocabulary words
GloVe vectors loaded and compared with Word2Vec
A similarity heatmap and t-SNE visualization
Contextual embeddings extracted from BERT, proving that "bank" gets different vectors in different contexts

You now have hands-on experience with every major text representation technique from the past 30 years. Not bad for one chapter.

Exercises & Self-Check Questions

How to use these exercises: The conceptual questions test your understanding of the why behind each technique. Try answering them in your own words before moving on. The coding exercises are hands-on challenges you should run in a Jupyter notebook.

Conceptual Questions

Representation evolution: In your own words, explain why the transition from sparse vectors (BoW) to dense vectors (Word2Vec) was such a big deal. What specific problems did it solve, and what new capabilities did it unlock?
The distributional hypothesis: The phrase "you shall know a word by the company it keeps" is the foundation of Word2Vec. Can you think of cases where this assumption breaks down? (Hint: think about antonyms. "Hot" and "cold" appear in very similar contexts...)
Static vs. contextual: Give three sentences where the word "play" means different things. Explain why Word2Vec would struggle with these but ELMo would handle them well.
Why pre-train? ELMo was pre-trained on a large corpus, then used for specific tasks. Why is this better than training a model from scratch for each task? What does the pre-training capture that task-specific training would miss?
Trade-offs: A colleague argues "TF-IDF is obsolete; just use embeddings for everything." Give two scenarios where TF-IDF would actually be the better choice, and explain why.

Coding Exercises

Preprocessing exploration: Take a paragraph from a news article. Run it through the preprocessing pipeline from Section 1.2. Then experiment: what happens if you do not remove stop words? What if you use stemming instead of lemmatization? How do the resulting BoW vectors differ?
Analogy hunting: Using the pre-trained word2vec-google-news-300 vectors, find 5 analogies that work well (beyond king/queen) and 3 that fail. Can you explain why the failures happen?
Similarity exploration: Pick 20 words from three different categories (e.g., sports, food, technology). Compute all pairwise cosine similarities. Do words within a category have higher similarity than cross-category pairs? Visualize the similarity matrix as a heatmap.
Word2Vec from scratch (challenge): Implement the Skip-gram model with negative sampling in pure PyTorch (no Gensim). Train it on a small text corpus and verify that similar words end up with similar vectors. Compare your results with Gensim's output.

Topic	Paper / Resource	Why Read It
Word2Vec	Mikolov et al., "Efficient Estimation of Word Representations in Vector Space" (2013)	The original paper. Surprisingly short and readable.
GloVe	Pennington et al., "GloVe: Global Vectors for Word Representation" (2014)	Elegant math showing why co-occurrence ratios encode meaning.
FastText	Bojanowski et al., "Enriching Word Vectors with Subword Information" (2017)	The subword approach that later influenced BPE tokenizers.
ELMo	Peters et al., "Deep contextualized word representations" (2018)	The paper that proved contextual embeddings work on every task.
Word2Vec explained	Jay Alammar, "The Illustrated Word2Vec"	The best visual explanation of Word2Vec on the internet.
Embeddings theory	Levy & Goldberg, "Neural Word Embedding as Implicit Matrix Factorization" (2014)	The mathematical connection between Word2Vec and GloVe.