Section 4.5: Transformer Expressiveness Theory

Research-Focused Section

This section surveys theoretical results about the computational power of Transformers. The material is more abstract than the rest of this module and draws on complexity theory. It is included because these results have direct practical implications for understanding why certain tasks are easy or hard for LLMs, and why chain-of-thought prompting helps.

1. Universal Approximation

The classical universal approximation theorem for neural networks states that a single hidden layer with enough neurons can approximate any continuous function on a compact domain to arbitrary accuracy. A natural question: do Transformers have an analogous property for sequence-to-sequence functions?

1.1 Transformers as Universal Approximators of Sequence Functions

Yun et al. (2020) proved that Transformers are universal approximators of continuous sequence-to-sequence functions. Specifically, for any continuous function f: R^n×d → R^n×d and any ε > 0, there exists a Transformer network that approximates f uniformly to within ε. The key requirements are:

Sufficient depth (number of layers)
Sufficient width (model dimension and feed-forward dimension)
The attention mechanism must be able to implement "hard" attention (approaching argmax behavior)

This result tells us that the Transformer architecture is, in principle, capable of representing any continuous mapping from input sequences to output sequences. Of course, universal approximation says nothing about whether gradient-based training will find a good approximation, or how large the network needs to be. Still, it provides the theoretical foundation: the architecture itself is not the bottleneck.

Note: The Role of the FFN

The proof of universal approximation critically relies on the feed-forward network, not just attention. Attention alone (without the FFN) is a linear operator over the value vectors; it can only compute weighted averages. The FFN provides the nonlinear transformations needed to approximate arbitrary functions. This is one theoretical justification for why the FFN matters so much in practice, as we discussed in Section 4.1.

2. Transformers as Computational Models

Beyond approximation, we can ask: what computational problems can Transformers solve? This requires connecting Transformers to formal computational complexity classes.

2.1 Fixed-Depth Transformers and TC⁰

A foundational result by Merrill and Sabharwal (2023) shows that a fixed-depth Transformer with log-precision arithmetic (O(log n) bits per number, where n is the input length) can be simulated by TC⁰ circuits, and vice versa. TC⁰ is the complexity class of problems solvable by constant-depth, polynomial-size threshold circuits.

This has concrete implications for what fixed-depth Transformers cannot do:

Cannot solve problems that require counting precisely (e.g., determining if the number of 1s in a binary string is exactly equal to the number of 0s), because TC⁰ cannot solve the "majority with equality" problem.
Cannot solve inherently sequential problems like evaluating nested compositions of functions (e.g., computing f(f(f(...f(x)...))) with n nested applications of an arbitrary function f).
Cannot simulate a general-purpose Turing machine for an arbitrary number of steps, because TC⁰ is strictly contained in P.

¹ TC⁰ AC⁰ Fixed-Depth Transformer Graph connectivity Boolean formula evaluation (in NC¹, not in TC⁰) Fixed-depth Transformers with log-precision ≈ TC⁰

Figure 4.8: The complexity class hierarchy. Fixed-depth Transformers with log-precision arithmetic correspond to TC⁰. Problems requiring sequential computation (NC¹ and above) are beyond their reach without additional mechanisms.

2.2 The Precision Assumption Matters

The TC⁰ characterization assumes log-precision arithmetic: each number in the Transformer uses O(log n) bits. With unbounded precision (arbitrary-precision real numbers), a Transformer can simulate any Turing machine in polynomial time, making it trivially universal. This highlights that the precision of computation is a critical variable in the theoretical analysis.

In practice, Transformers use 16-bit or 32-bit floating point, which is somewhere between log-precision and unbounded precision. The practical implications of the theoretical results hold approximately: tasks requiring deep sequential reasoning are indeed harder for Transformers, even if the formal impossibility results rely on specific precision assumptions.

3. Limitations: What Transformers Struggle With

The theoretical results predict specific failure modes, and these predictions are borne out empirically. Here are the key categories of problems that fixed-depth Transformers find difficult:

3.1 Inherently Sequential Computation

Problems that require a long chain of dependent computation steps are hard for Transformers because each layer can only perform a constant amount of sequential work. The total sequential depth is the number of layers, typically 32 to 96. Tasks requiring more sequential steps than this (relative to the model's precision) become problematic.

Example: multi-step arithmetic. Computing ((3 + 7) * 2 - 4) / 3 requires evaluating operations in order. Each operation depends on the result of the previous one. A fixed-depth Transformer must either learn to parallelize this computation (difficult) or represent the intermediate results within a single forward pass (limited by depth).

3.2 Counting and Tracking State

Exact counting (e.g., "does this string have exactly 47 opening parentheses?") is difficult because log-precision attention scores cannot represent arbitrary integers precisely for long sequences. The model can approximate counting for small numbers but degrades as counts grow large.

3.3 Graph Reasoning

Problems like graph connectivity, shortest paths, and cycle detection require iterative algorithms that propagate information through the graph structure. A constant-depth Transformer can only propagate information a constant number of "hops" per layer, limiting its ability to reason about graph structures that require many hops.

Practical Caveat

These theoretical limitations apply to fixed-depth, fixed-precision Transformers. Real LLMs are very large (96+ layers), have moderately high precision (BF16/FP32), and are trained on massive datasets that may help them learn approximate solutions. The theory tells us which tasks will be hard and where to expect failure modes, not that the model will always fail. Large models often "mostly work" on these tasks but exhibit characteristic errors at the boundary of their capability.

4. Chain-of-Thought Extends Computational Power

This is perhaps the most practically important theoretical result. Chain-of-thought (CoT) reasoning, where the model generates intermediate reasoning steps before producing a final answer, fundamentally changes the computational model. Instead of computing the answer in a single forward pass (constant depth), the model now performs computation proportional to the number of generated tokens.

4.1 CoT as Additional Computation Steps

When a Transformer generates T tokens of chain-of-thought reasoning, each token's generation involves a full forward pass through the model. The information from each token is fed back as input for the next, creating an effective sequential computation depth of T × N (tokens times layers). This is a fundamentally different computation than a single forward pass.

Key Insight: CoT Makes Transformers Turing-Complete

Feng et al. (2023) and others have shown that a fixed-size Transformer with chain-of-thought (unbounded generation length) can simulate any Turing machine. The generated tokens serve as the Turing machine's tape, and each forward pass computes one step of the Turing machine's transition function. This means CoT is not just a prompting trick; it provably extends the computational class of the model from TC⁰ to Turing-complete.

4.2 Why CoT Helps on Hard Problems

Consider the task of multiplying two large numbers, say 347 × 892. A single forward pass through a Transformer must compute this in constant depth, which the theory predicts is very difficult (large integer arithmetic is not in TC⁰). But with CoT, the model can:

Break the multiplication into partial products (347 × 2, 347 × 90, 347 × 800)
Compute each partial product (one or two tokens of intermediate computation each)
Add the partial products (again, with intermediate steps)
Produce the final answer

Each individual step is simple enough for a single forward pass. The chain of thought provides the sequential depth that a single pass cannot.

4.3 Implications for LLM Design and Prompting

This theoretical framework has direct practical consequences:

Prompting for reasoning: Asking the model to "think step by step" is not just about better prompts; it gives the model additional computational depth for problems that require it.
Token budget matters: For hard reasoning tasks, limiting the model's output length directly limits its computational power. Longer chains of thought enable harder computations.
Scratchpad training: Training models to use intermediate computation (reasoning tokens, scratchpads) is a way to extend their effective depth. This is the principle behind approaches like "thinking" tokens in some recent models.
Verifiable vs. generative reasoning: The model's ability to generate a correct proof (step by step) may exceed its ability to directly produce the answer, because generation provides the sequential depth that direct answers require doing in a single pass.

5. Open Questions

Several fundamental questions remain open in this area:

Practical expressiveness gaps: The theory identifies tasks that are provably hard for fixed-depth Transformers, but we lack a precise understanding of which real-world tasks are affected. Are there important NLP tasks that current models fail at specifically because of computational depth limitations?
Depth vs. width tradeoffs: How does the tradeoff between model depth and model width affect the class of computable functions in practice? A very wide but shallow model versus a narrow but deep model may have qualitatively different capabilities.
Implicit chain-of-thought: Can Transformers learn to perform internal "chain-of-thought" reasoning within their hidden layers, effectively using the residual stream as a scratchpad? Some evidence suggests this happens (intermediate layers performing identifiable reasoning steps), but the theory is not well developed.
SSMs and expressiveness: How does the computational class of SSMs (Mamba, RWKV) compare to Transformers? SSMs have a recurrent structure that provides sequential depth, but they compress information into a fixed-size state. The expressiveness tradeoffs are not yet fully understood.

Key Takeaways

Transformers are universal approximators of continuous sequence-to-sequence functions, given sufficient depth and width.
Fixed-depth Transformers with log-precision arithmetic correspond to the complexity class TC⁰, which cannot solve inherently sequential problems or problems requiring deep iterative computation.
Chain-of-thought reasoning fundamentally extends the computational power of Transformers by providing additional sequential depth through token generation.
With unbounded CoT length, Transformers become Turing-complete; this is not just a prompting technique but a provable increase in computational class.
These theoretical results predict practical failure modes: tasks requiring long sequential computation, exact counting, and graph reasoning are harder for single-pass Transformers.

Check Your Understanding

1. What does the universal approximation theorem for Transformers say, and what does it not say?

Show Answer

It says that a sufficiently large Transformer can approximate any continuous sequence-to-sequence function to arbitrary accuracy. It does not say that gradient-based training will find this approximation, how large the network needs to be, or that the approximation generalizes beyond the training distribution.

2. Why can a fixed-depth Transformer not solve arbitrary graph connectivity problems?

Show Answer

Graph connectivity requires propagating information across potentially long paths in the graph. A fixed-depth Transformer can only propagate information a constant number of steps (proportional to its depth). Graph connectivity is in NL (nondeterministic log-space) which contains problems not in TC⁰ (the class corresponding to fixed-depth Transformers with log-precision). For graphs with long shortest paths, the model cannot reach all nodes from a given starting point in a constant number of layers.

3. How does chain-of-thought reasoning change the computational class of a Transformer?

Show Answer

Without CoT, a fixed-depth Transformer is limited to TC⁰. With CoT (unbounded generation length), the model performs T forward passes sequentially (one per generated token), with information flowing from each step to the next through the generated tokens. This gives the model T x N sequential depth (tokens times layers), making it capable of simulating any Turing machine. CoT elevates the model from TC⁰ to Turing-complete.

4. Why does the precision assumption matter in the theoretical analysis?

Show Answer

With log-precision (O(log n) bits), a Transformer maps to TC⁰ and has clear limitations. With unbounded precision (arbitrary real numbers), a single-layer Transformer can trivially encode any computation in its weights, making the analysis meaningless. Practical models use 16 or 32-bit precision, which falls between these extremes. The log-precision results are more informative for understanding real limitations, even though the exact correspondence is approximate.

5. Give a practical example where CoT helps and explain why using the theory.

Show Answer

Multi-digit multiplication (e.g., 347 x 892) requires a sequence of dependent arithmetic operations: computing partial products, carrying digits, and summing results. This is an inherently sequential computation that exceeds what TC⁰ can handle for large numbers. Without CoT, the model must compute the answer in a single forward pass (constant depth), which the theory predicts will fail for large inputs. With CoT, the model breaks the computation into simple steps (each within TC⁰ capability), using the generated tokens to carry intermediate results from one step to the next.

Transformer Expressiveness Theory

1. Universal Approximation

1.1 Transformers as Universal Approximators of Sequence Functions

2. Transformers as Computational Models

2.1 Fixed-Depth Transformers and TC0

2.2 The Precision Assumption Matters

3. Limitations: What Transformers Struggle With

3.1 Inherently Sequential Computation

3.2 Counting and Tracking State

3.3 Graph Reasoning

4. Chain-of-Thought Extends Computational Power

4.1 CoT as Additional Computation Steps

4.2 Why CoT Helps on Hard Problems

4.3 Implications for LLM Design and Prompting

5. Open Questions

Key Takeaways

Check Your Understanding

2.1 Fixed-Depth Transformers and TC⁰