Section 1.1 · Module 01

Introduction to NLP & the LLM Revolution

From hand-written rules to machines that write poetry: the four paradigm shifts that shaped modern AI

Before transformers, we parsed sentences with rules and prayers. The prayers had slightly better recall.

Legacy Louise, a retired regex engine

Learning Objectives

After completing this section, you will be able to:

The Story of NLP

Try this thought experiment. Open ChatGPT or Claude and type: "Explain quantum entanglement using only words a five-year-old would understand, but make it scientifically accurate." In two seconds, you will get a response that is creative, coherent, factually grounded, and tailored to an audience you specified. A decade ago, this was science fiction. Today, it runs on your phone.

How did we get here? That is the story of Natural Language Processing (NLP), the field of AI that teaches machines to understand, generate, and reason about human language. This chapter traces that story from its humble beginnings to the present day, and along the way, you will build the foundational skills that everything else in this course rests on.

But here is the thing: language is arguably the hardest problem in AI. While computer vision "solved" object recognition to superhuman levels by 2015, and game-playing AI mastered chess and Go, language understanding remained stubbornly difficult until very recently. The reason is that language requires simultaneously handling multiple layers of complexity. It is ambiguous ("I saw her duck" could mean she lowered her head, or that I saw her pet duck), it is context-dependent ("It's cold" means something different in a weather conversation versus a detective story), and it is infinitely composable (you can construct sentences that have never been written before, and humans will understand them instantly).

The Big Picture

This entire course is a journey through one central question: How do we represent language in a form that machines can work with? Every breakthrough in NLP, from bag-of-words to transformers to ChatGPT, is fundamentally an answer to this question. The better our representation, the more capable our systems become.

The Four Eras of NLP

In Module 00, you built neural networks and trained them with gradient descent. Now we apply those tools to the hardest domain of all: human language. NLP has undergone four major paradigm shifts. Understanding why each transition happened is key to understanding where we are today.

Rule-Based 1950s to 1980s Statistical 1990s to 2000s Neural 2013 to 2017 LLM Era 2017 to Present Hand-written rules Word counts Dense vectors Contextual vectors Each era was driven by a representation breakthrough

Era 1: Rule-Based NLP (1950s to 1980s)

The earliest NLP systems were hand-crafted rules. Linguists would write grammars like S → NP VP (a sentence is a noun phrase followed by a verb phrase) and build parsers to decompose text. ELIZA (1966), the famous chatbot, used pattern matching: if the user says "I feel X", respond with "Why do you feel X?"

Why it failed to scale: Language has too many exceptions. You cannot write enough rules to cover the full complexity of natural language. Every new domain (legal, medical, informal chat) required starting over from scratch.

Era 2: Statistical NLP (1990s to 2000s)

Instead of writing rules, let the machine learn patterns from data. Statistical models like Hidden Markov Models (HMMs) for part-of-speech tagging, Naive Bayes for text classification, and phrase-based statistical machine translation (Google Translate circa 2006) dominated this era.

The representation was still shallow: documents were bags of word counts, and features were hand-engineered (bigrams, POS tags, etc.).

Why it hit a ceiling: Feature engineering was labor-intensive and domain-specific. Models could not capture long-range dependencies or deep semantic meaning. "The movie was not bad" was hard to classify correctly because "not" and "bad" are separate features.

Era 3: Neural NLP (2013 to 2017)

The game changed when Tomas Mikolov published Word2Vec in 2013. Instead of hand-crafted features, neural networks could learn dense vector representations of words directly from data. For the first time, "king" and "queen" were mathematically close in vector space.

Recurrent Neural Networks (RNNs, LSTMs) could process entire sequences word by word, maintaining a "memory" of what came before. Sequence-to-sequence models with attention enabled neural machine translation that beat statistical systems. The key advantage: instead of translating phrase by phrase (the statistical approach), neural models could consider the entire source sentence when generating each target word, producing more fluent and coherent translations.

Why it was not enough: RNNs process text sequentially (one word at a time), making them slow to train and bad at capturing very long-range dependencies. A sentence that starts with "The cat, which sat on the mat that was in the house that Jack built, ..." loses information about "The cat" by the time the model reaches the end.

Era 4: The LLM Era (2017 to Present)

In 2017, the paper "Attention Is All You Need" introduced the Transformer architecture, which processes all words in parallel using self-attention. This removed the sequential bottleneck of RNNs and enabled training on vastly more data.

BERT (2018) showed that pre-training a transformer on massive text data and then fine-tuning it on specific tasks crushed every benchmark. GPT-2 (2019) showed that language models could generate coherent paragraphs. GPT-3 (2020) revealed that scaling up (175B parameters) led to emergent abilities like in-context learning. ChatGPT (2022) and GPT-4 (2023) brought LLMs to the mainstream.

Key Insight

Each era transition was driven by a representation breakthrough: rules, then word counts, then dense vectors, then contextual vectors, then massive pre-trained language models. The quality of the representation determines the ceiling of what NLP systems can do.

Quick Check: Can You Match the Era?

For each approach below, identify which era it belongs to (rule-based, statistical, neural, or LLM):

  1. A grammar that says VERB → "eat" | "run" | "sleep"
  2. Computing P(word | previous 2 words) from a large corpus
  3. Prompting GPT-4 with "Classify this email as spam or not spam"
  4. Training a 300-dimensional vector for each word using context prediction
Reveal answers

1. Rule-based (hand-written grammar)   2. Statistical (n-gram language model)   3. LLM era (in-context learning)   4. Neural (Word2Vec)

Core NLP Tasks

Before diving deeper, let us map the landscape of problems that NLP solves. These same tasks will reappear throughout the course as we build systems with LLMs.

At the highest level, NLP tasks fall into three families based on the relationship between input and output:

TaskFamilyInputOutputExample
Text ClassificationSeq. class.DocumentCategory labelSpam detection, topic categorization
Sentiment AnalysisSeq. class.TextPolarity score"Great movie!" → Positive (0.95)
Natural Language InferenceSeq. class.Premise + hypothesisEntailment / contradiction / neutral"It rained." + "The ground is wet." → Entailment
Named Entity RecognitionToken class.TextTagged entities"Apple [ORG] released iPhone 16 [PRODUCT]"
POS TaggingToken class.TextTags per token"The/DET cat/NOUN sat/VERB"
Machine TranslationSeq2seqText in language AText in language B"Hello" → "Bonjour"
SummarizationSeq2seqLong documentShort summaryCondensing a 10-page report to 3 sentences
Question AnsweringSeq2seq / ExtractionQuestion + contextAnswer span or text"Who wrote Hamlet?" → "Shakespeare"
Open-ended GenerationSeq2seqPromptContinuation"Write a poem about..." → (poem)
Understanding Tasks Classification NER Sentiment QA Generation Tasks Translation Summarization Open-ended Generation (LLM era)
LLMs Unify Everything

Before 2018, each NLP task required a separate model with a custom architecture. Today, a single LLM like GPT-4 or Claude can perform all six tasks above (and hundreds more) with just a text prompt. This unification is one of the defining characteristics of the LLM era and is why understanding the underlying representations matters so much.

Why Language Is Hard

To appreciate why NLP has been one of AI's toughest challenges, consider these phenomena:

Pragmatics (social context, intent) Semantics (meaning, world knowledge) Syntax (grammar, structure) Morphology (word forms) NLP must handle ALL these layers simultaneously
Why This Matters for the Course

Every technique we will study in this course is an attempt to solve these problems. Bag-of-words ignores word order entirely, Word2Vec captures some semantics but not context, transformers handle long-range context but still struggle with world knowledge. Understanding what each technique can and cannot do is more important than memorizing how it works.

The Representation Thread

Let us step back and connect all four eras through a single lens: representation quality. Every advance in NLP has come from finding a better way to turn words into numbers.

EraRepresentationWhat It CapturesWhat It Misses
Rule-BasedSymbolic parse treesGrammar structureEverything else
StatisticalWord counts (sparse)Word frequency, some patternsMeaning, word order
NeuralDense vectors (300d)Semantic similarityContext, polysemy
LLMContextual vectors (thousands of dims)Meaning in contextPerfect reasoning (still improving)
The Thread That Connects Everything

The progression is clear: denser (fewer dimensions, more information per number), more contextual (same word, different meaning in different sentences), and more general (works across tasks without task-specific engineering). This module walks through each step in this progression, from Bag-of-Words all the way to contextual embeddings. Modules 2 through 4 will take us the rest of the way to transformers.

✔ Check Your Understanding

1. NLP tasks are broadly grouped into two categories. What are they, and how do their outputs differ?

Reveal Answer

The two broad categories are understanding tasks and generation tasks. Understanding tasks (classification, NER, sentiment analysis, QA) take text as input and produce a label, tag, or extracted span. Generation tasks (translation, summarization, open-ended generation) take text as input and produce new text as output. In the LLM era, a single model can handle both categories through prompting.

2. What is the "representation thread" that connects all four eras of NLP, and why does it matter?

Reveal Answer

The representation thread is the idea that every major NLP advance was driven by a better way of turning words into numbers. Rules gave way to word counts (statistical era), then dense vectors (neural era), then contextual vectors (LLM era). It matters because the quality of the representation sets the ceiling for what NLP systems can achieve. Better representations enable better downstream performance without needing task-specific engineering.

3. Give two specific reasons why natural language is harder for computers to process than, say, images or structured data.

Reveal Answer

First, language is ambiguous: the same sentence can have multiple valid interpretations (e.g., "I saw her duck" has two meanings). Second, language requires world knowledge that is not present in the text itself (e.g., understanding that "pen" means "playpen" in certain contexts). Other valid answers include compositionality (complex negation patterns), coreference resolution (tracking what "it" refers to), and pragmatics (understanding intent beyond literal meaning).

4. How do supervised and unsupervised approaches differ in NLP? Give one example of each.

Reveal Answer

Supervised NLP requires labeled training data where each input has a known correct output. Example: spam detection, where emails are labeled as spam or not-spam. Unsupervised NLP discovers patterns from raw text without labels. Example: Word2Vec learns word representations from unlabeled text by predicting context words. Pre-training large language models is also unsupervised (or self-supervised), since the model learns to predict the next word without human annotations.

5. Why was the Transformer architecture (2017) such a significant breakthrough compared to RNNs and LSTMs?

Reveal Answer

The Transformer replaced sequential processing with parallel self-attention, which brought two key advantages. First, it can process all words in a sequence simultaneously rather than one at a time, making training dramatically faster and enabling the use of much larger datasets. Second, self-attention allows every word to directly attend to every other word regardless of distance, solving the long-range dependency problem that plagued RNNs (where information about early words faded by the end of long sequences). These advantages enabled the massive scale-up that produced BERT, GPT, and modern LLMs.

Key Takeaways

  1. NLP has gone through four eras (rule-based, statistical, neural, LLM), each driven by a representation breakthrough that expanded what machines could do with language.
  2. Language is hard because it is ambiguous, context-dependent, and compositional. A single sentence can require world knowledge, coreference resolution, and pragmatic reasoning to interpret correctly.
  3. The six core NLP tasks (classification, NER, sentiment, translation, summarization, QA) cover most real-world applications and reappear throughout this course.
  4. Representation quality determines the ceiling. The progression from sparse word counts to dense vectors to contextual embeddings is the single most important thread in NLP history.
  5. LLMs unify NLP. Before 2018, each task needed a separate model. Today, a single pre-trained model can handle all tasks through prompting, which is the defining feature of the current era.