Module 10 · Section 10.2

Chain-of-Thought & Reasoning Techniques

Teaching models to think step by step before answering
★ Big Picture

Why reasoning techniques matter. Standard prompting asks a model to jump directly from question to answer. For simple factual lookups, this works fine. But for multi-step reasoning, math problems, logic puzzles, and complex analysis, direct answering leads to frequent errors. Chain-of-Thought (CoT) prompting, introduced by Wei et al. (2022), showed that simply asking the model to "think step by step" before answering can dramatically improve accuracy on reasoning tasks. This section covers CoT and its successors: self-consistency (sample multiple reasoning paths and vote), Tree-of-Thought (structured exploration with backtracking), step-back prompting, and the ReAct framework that interleaves reasoning with tool use.

1. Chain-of-Thought Prompting

📚 Paper Spotlight: Wei et al. (2022)

"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" demonstrated that providing step-by-step reasoning examples in the prompt dramatically improves performance on arithmetic, commonsense, and symbolic reasoning tasks. The key contribution was showing that reasoning ability is an emergent property: it appears in models above ~100B parameters and is absent in smaller models. This paper launched an entire subfield of reasoning-oriented prompt engineering.

Chain-of-Thought prompting works by encouraging the model to generate intermediate reasoning steps before producing a final answer. The mechanism is straightforward: each reasoning step conditions the generation of subsequent steps, allowing the model to "carry" information forward through the computation. Without CoT, the model must compress all reasoning into a single forward pass through the network. With CoT, the model effectively uses its own generated text as a scratchpad, offloading intermediate computation into the token sequence.

🎯 Aha Moment: CoT as Working Memory

Without CoT, the model must fit all reasoning into a single forward pass. With CoT, the model's own output tokens become working memory. Each generated reasoning step is literally fed back into the model as input for the next step, just as you would use a scratchpad to carry intermediate results. This is why CoT helps on multi-step problems: it gives the model a place to store intermediate results that its fixed-size hidden state cannot hold all at once.

1.1 Zero-Shot CoT

The simplest form of CoT requires no examples at all. Kojima et al. (2022) discovered that appending the phrase "Let's think step by step" to a prompt is sufficient to trigger reasoning behavior in large models. This zero-shot CoT approach is remarkably effective, improving accuracy on GSM8K (grade school math) from 17.7% to 78.7% for PaLM 540B.

import openai

client = openai.OpenAI()

def solve_with_cot(problem: str) -> str:
    """Solve a problem using zero-shot Chain-of-Thought."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system",
             "content": """Solve the problem step by step.
Show your reasoning clearly, then provide the final answer
on a separate line starting with "ANSWER: "."""},
            {"role": "user",
             "content": problem}
        ],
        temperature=0.0
    )
    return response.choices[0].message.content

problem = """A store sells notebooks for $4 each. If you buy 5 or more,
you get a 20% discount on the entire purchase. Sarah buys 7 notebooks
and pays with a $50 bill. How much change does she receive?"""

print(solve_with_cot(problem))
Let me work through this step by step. Step 1: Calculate the base price. 7 notebooks at $4 each = 7 x $4 = $28 Step 2: Check if the discount applies. Sarah is buying 7 notebooks, which is 5 or more, so the 20% discount applies. Step 3: Calculate the discount amount. 20% of $28 = 0.20 x $28 = $5.60 Step 4: Calculate the discounted total. $28 - $5.60 = $22.40 Step 5: Calculate the change from $50. $50 - $22.40 = $27.60 ANSWER: $27.60

1.2 Few-Shot CoT

For tasks where zero-shot CoT is insufficient, providing examples of step-by-step reasoning significantly improves performance. The key insight is that the examples teach the model not just the answer format, but the style of reasoning to apply. Different reasoning styles suit different tasks: arithmetic problems benefit from sequential calculation steps, logic problems benefit from listing premises and drawing inferences, and coding problems benefit from planning before implementing.

import openai

client = openai.OpenAI()

# Few-shot CoT: provide exemplar reasoning chains
FEW_SHOT_EXAMPLES = """
Example 1:
Q: A train travels 120 miles in 2 hours. It then travels 90 miles in 1.5 hours. What is the average speed for the entire trip?
A: Let me work through this step by step.
Total distance = 120 + 90 = 210 miles
Total time = 2 + 1.5 = 3.5 hours
Average speed = total distance / total time = 210 / 3.5 = 60 mph
ANSWER: 60 mph

Example 2:
Q: A store has a "buy 2, get 1 free" deal on $6 shirts. How much do 7 shirts cost?
A: Let me work through this step by step.
For every 3 shirts, you pay for 2: that is 2 x $6 = $12 per group of 3.
7 shirts = 2 complete groups (6 shirts) + 1 remaining shirt.
Cost for 6 shirts = 2 x $12 = $24
Cost for 1 extra shirt = $6
Total = $24 + $6 = $30
ANSWER: $30
"""

def solve_with_few_shot_cot(problem: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system",
             "content": f"""Solve problems step by step, following the style shown in these examples:
{FEW_SHOT_EXAMPLES}
Show your reasoning clearly, then provide the final answer on a line starting with "ANSWER: "."""},
            {"role": "user", "content": problem}
        ],
        temperature=0.0
    )
    return response.choices[0].message.content

problem = """A bakery sells cupcakes for $3 each. If you buy a dozen (12),
you get a 25% discount. Maria buys 15 cupcakes. How much does she pay?"""

print(solve_with_few_shot_cot(problem))
Let me work through this step by step. First, I need to figure out the pricing structure. Individual price: $3 per cupcake. Dozen deal: 12 cupcakes at 25% off. Maria buys 15 cupcakes. She can buy 1 dozen (12) at the discount, plus 3 at full price. Cost for the dozen: 12 x $3 = $36, then 25% off: $36 x 0.75 = $27 Cost for 3 extra: 3 x $3 = $9 Total: $27 + $9 = $36 ANSWER: $36
★ Key Insight

CoT prompting improves accuracy primarily on tasks that require multi-step reasoning. For single-step tasks (simple factual recall, sentiment classification), CoT adds unnecessary tokens without improving quality and may even slightly reduce accuracy due to the additional opportunity for the model to "talk itself into" a wrong answer. The rule of thumb: if a human needs a scratchpad to solve the problem, CoT will help. If a human can answer instantly, CoT is unnecessary overhead.

Direct Prompting vs. Chain-of-Thought Direct Prompting Question: "What is 17 x 23 + 45?" Answer: "436" INCORRECT (should be 436) Chain-of-Thought Question: "What is 17 x 23 + 45?" Step 1: 17 x 23 = 391 Step 2: 391 + 45 = 436 Answer: 436 CORRECT
Figure 10.3: Direct prompting compresses all reasoning into one step; CoT decomposes it into verifiable intermediate steps.

2. Self-Consistency

📚 Paper Spotlight: Wang et al. (2022)

"Self-Consistency Improves Chain of Thought Reasoning in Language Models" showed that sampling multiple reasoning paths and taking a majority vote dramatically outperforms greedy single-path CoT. On GSM8K, self-consistency pushed accuracy from 78.7% to over 90% for PaLM 540B. The key insight: correct reasoning paths tend to converge on the same answer, while errors are randomly distributed across wrong answers.

Self-consistency, introduced by Wang et al. (2022), builds on CoT by sampling multiple reasoning paths and selecting the answer that appears most frequently. The intuition is that while any single reasoning chain might contain errors, different chains are likely to make different errors. When multiple independent chains converge on the same answer, confidence in that answer increases.

2.1 Implementation

import openai
from collections import Counter

client = openai.OpenAI()

def solve_with_self_consistency(problem: str, n_samples: int = 5) -> dict:
    """Sample multiple CoT paths and take majority vote."""
    answers = []

    for _ in range(n_samples):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system",
                 "content": """Solve step by step. End with ANSWER: <number>"""},
                {"role": "user", "content": problem}
            ],
            temperature=0.7  # Higher temperature for diverse paths
        )
        text = response.choices[0].message.content
        # Extract the final answer
        if "ANSWER:" in text:
            answer = text.split("ANSWER:")[-1].strip()
            answers.append(answer)

    # Majority vote
    vote_counts = Counter(answers)
    best_answer, count = vote_counts.most_common(1)[0]

    return {
        "answer": best_answer,
        "confidence": count / len(answers),
        "all_answers": dict(vote_counts)
    }

result = solve_with_self_consistency(
    "If a train travels at 60 mph for 2.5 hours, then at 80 mph for 1.5 hours, "
    "what is the total distance traveled?"
)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.0%}")
print(f"All votes: {result['all_answers']}")
Answer: 270 miles Confidence: 100% All votes: {'270 miles': 5}
📝 Note: Temperature and Self-Consistency

Self-consistency requires temperature > 0 to generate diverse reasoning paths. If temperature is zero, every sample produces identical output, defeating the purpose. A temperature of 0.5 to 0.7 provides a good balance between diversity and quality. Higher temperatures produce more diverse paths but increase the chance of individually nonsensical reasoning chains.

3. Tree-of-Thought (ToT)

Tree-of-Thought (Yao et al., 2023) extends CoT by exploring multiple reasoning paths in a structured tree, with the ability to evaluate and backtrack. While CoT follows a single linear chain and self-consistency samples independent chains, ToT builds a branching exploration tree where the model can evaluate partial solutions, prune unpromising branches, and explore alternatives.

⚠ Misconception Warning: ToT Is Not Practical for Most Production Use

Tree-of-Thought is an elegant research technique, but it requires 10 to 50+ API calls per query (one for each node explored in the tree). At $0.01 per call, a single ToT query can cost $0.10 to $0.50. For production workloads, self-consistency (3 to 5 calls) achieves most of the accuracy benefit at a fraction of the cost. Use ToT only for high-value, low-volume tasks like planning, puzzle solving, or complex code generation where the cost per query is justified.

The ToT framework has three core components:

  1. Thought generation: At each step, the model proposes multiple possible next steps (branches).
  2. Thought evaluation: The model (or a separate evaluator) scores each branch on how promising it looks.
  3. Search strategy: A search algorithm (typically breadth-first or depth-first) navigates the tree, expanding the most promising nodes and pruning dead ends.
Tree-of-Thought: Branching Exploration with Evaluation Problem Thought A Score: 0.85 Thought B Score: 0.55 Thought C Score: 0.30 (pruned) Thought A1 Score: 0.92 Thought A2 Score: 0.60 Thought B1 Score: 0.48 Solution ✓ Promising (expand) Marginal (may expand) Poor (pruned) backtrack
Figure 10.4: Tree-of-Thought generates multiple branches at each step, evaluates them, prunes poor candidates, and backtracks when needed.

3.1 Simplified ToT Implementation

import openai

client = openai.OpenAI()

def generate_thoughts(problem: str, context: str, n: int = 3) -> list[str]:
    """Generate n candidate next steps."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user",
            "content": f"""Problem: {problem}
Progress so far: {context}
Generate {n} different possible next steps. Number them 1, 2, 3.
Each should be a single reasoning step."""}],
        temperature=0.8
    )
    text = response.choices[0].message.content
    return [line.strip() for line in text.split("\n") if line.strip()]

def evaluate_thought(problem: str, thought: str) -> float:
    """Score a thought from 0 to 1."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user",
            "content": f"""Problem: {problem}
Proposed reasoning step: {thought}
Rate this step from 0.0 to 1.0 based on correctness and
usefulness. Respond with only a number."""}],
        temperature=0.0
    )
    try:
        return float(response.choices[0].message.content.strip())
    except ValueError:
        return 0.5

# Example: solve with tree exploration
problem = "Find two numbers that multiply to 36 and add to 13."
thoughts = generate_thoughts(problem, "(starting)")
for t in thoughts[:3]:
    score = evaluate_thought(problem, t)
    print(f"  [{score:.2f}] {t}")
[0.90] 1. List factor pairs of 36: (1,36), (2,18), (3,12), (4,9), (6,6) [0.70] 2. Set up equations: x * y = 36 and x + y = 13, solve quadratic [0.50] 3. Try guess and check starting with numbers close to sqrt(36)

4. Step-Back Prompting

Step-back prompting (Zheng et al., 2023) takes the opposite approach from diving into details. Instead of reasoning through the specific problem, it first asks the model to identify the abstract principle or high-level concept, then applies that principle to solve the specific problem. This is particularly effective for science, math, and policy questions where the correct reasoning requires recalling a general rule before applying it.

The two-phase approach works as follows. First, generate a "step-back" question: "What general principle or concept is relevant to this problem?" Then, use the model's answer to that abstract question as context for solving the original specific problem. This prevents the model from getting lost in surface-level details and grounds its reasoning in correct foundational knowledge.

import openai

client = openai.OpenAI()

def step_back_solve(question: str) -> str:
    """Two-phase step-back prompting: abstract first, then solve."""

    # Phase 1: Generate the step-back (abstract) question
    abstraction = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system",
             "content": "Given a specific question, identify the general principle or foundational concept needed to answer it. Respond with ONLY the principle, not the answer to the original question."},
            {"role": "user", "content": question}
        ],
        temperature=0.0
    )
    principle = abstraction.choices[0].message.content
    print(f"Step-back principle: {principle[:120]}...")

    # Phase 2: Solve using the principle as context
    solution = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system",
             "content": f"Use this foundational principle to answer the question:\n\n{principle}"},
            {"role": "user", "content": question}
        ],
        temperature=0.0
    )
    return solution.choices[0].message.content

answer = step_back_solve(
    "If the temperature of an ideal gas is doubled while keeping "
    "the volume constant, what happens to its pressure?"
)
print(f"\nAnswer: {answer}")
Step-back principle: Gay-Lussac's Law (Pressure-Temperature Law): For an ideal gas at constant volume, pressure is directly... Answer: According to Gay-Lussac's Law, pressure is directly proportional to absolute temperature when volume is held constant (P/T = constant). If the temperature is doubled (from T to 2T), the pressure also doubles (from P to 2P).
ⓘ When to Use Step-Back Prompting

Step-back prompting excels on questions where domain knowledge recall is the bottleneck, not multi-step computation. Physics, chemistry, law, and policy questions often benefit because the model needs to recall the correct rule before applying it. For pure arithmetic or logic problems, standard CoT is usually sufficient. Step-back prompting adds one extra LLM call per question, so it doubles latency and cost; use it selectively for high-value queries where accuracy matters more than speed.

5. The ReAct Framework

ReAct (Yao et al., 2022) combines Reasoning and Acting in an interleaved loop. The model alternates between generating reasoning traces (thinking about what to do) and taking actions (calling tools, searching databases, executing code). This pattern is fundamental to modern LLM agents and represents the bridge between prompt engineering and agentic AI.

5.1 The ReAct Loop

import openai, json

client = openai.OpenAI()

TOOLS = [
    {"type": "function",
     "function": {
         "name": "search",
         "description": "Search for information on a topic",
         "parameters": {
             "type": "object",
             "properties": {
                 "query": {"type": "string", "description": "Search query"}
             },
             "required": ["query"]
         }
     }}
]

REACT_SYSTEM = """You are a research assistant. For each question:
1. THINK: Reason about what information you need
2. ACT: Use the search tool to find information
3. OBSERVE: Analyze the search results
4. Repeat THINK/ACT/OBSERVE until you have enough information
5. Provide a final, well-sourced answer"""

def react_agent(question: str) -> str:
    messages = [
        {"role": "system", "content": REACT_SYSTEM},
        {"role": "user", "content": question}
    ]

    for step in range(5):  # Max 5 iterations
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=TOOLS,
            tool_choice="auto"
        )
        msg = response.choices[0].message

        if msg.tool_calls:
            # Model wants to use a tool (ACT)
            messages.append(msg)
            for call in msg.tool_calls:
                result = execute_search(
                    json.loads(call.function.arguments)["query"]
                )
                messages.append({
                    "role": "tool",
                    "tool_call_id": call.id,
                    "content": result
                })
        else:
            # Model has enough info, return final answer
            return msg.content

    return "Max iterations reached without final answer."
⚠ Cost Awareness

Advanced reasoning techniques like self-consistency and ToT multiply the number of API calls per query. Self-consistency with 5 samples costs 5x a single CoT call. ToT can require 10 to 50 calls depending on the branching factor and depth. Always consider the cost-accuracy tradeoff. For a batch of 10,000 queries, self-consistency with n=5 at $0.01 per call adds up to $500 compared to $100 for single CoT. The improvement from 90% to 95% accuracy must be worth the 5x cost increase for your use case.

6. Comparison of Reasoning Techniques

Technique API Calls Best For Limitation
Zero-shot CoT 1 Simple multi-step reasoning Single path can still err
Few-shot CoT 1 Domain-specific reasoning style Requires good examples
Self-consistency n (typically 5 to 20) Math, logic, factual questions High cost; assumes closed-form answer
Tree-of-Thought 10 to 50+ Complex planning, puzzles Very expensive; complex to implement
Step-back prompting 2 Science, policy, principle-based questions Adds latency; not useful for procedural tasks
ReAct 2 to 10+ Questions requiring external information Requires tool infrastructure

7. Choosing a Reasoning Technique: Decision Flowchart

With multiple reasoning techniques available, selecting the right one for a given task is itself a decision problem. The following flowchart provides a practical starting point. Start at the top and follow the questions to arrive at a recommended technique.

Prompt Technique Decision Flowchart Does the task require reasoning? No Zero-Shot Prompting Yes Is domain knowledge the bottleneck? Yes Step-Back Prompting No Needs external tools or data? Yes ReAct Framework No Is high reliability critical (accept 3x+ cost)? Yes Self-Consistency No Have good reasoning examples? Yes Few-Shot CoT No Zero-Shot CoT Tip: Consider reasoning models (o3, o4-mini, DeepSeek R1) as alternatives to CoT prompting. These models perform chain-of-thought internally, so explicit CoT prompting may be unnecessary.
Figure 10.4: Decision flowchart for selecting a reasoning technique. Start at the top and follow the decision path to find the best technique for your task.
ⓘ Reasoning Models vs. Prompting for Reasoning

Providers now offer reasoning models with built-in chain-of-thought: OpenAI's o3 and o4-mini, DeepSeek R1, and Claude's extended thinking mode. These models perform multi-step reasoning internally without needing explicit CoT prompting. When using a reasoning model, adding "think step by step" to your prompt is redundant and may even degrade quality. The decision is: pay for a more expensive reasoning model that handles reasoning natively, or use a standard model with explicit CoT prompting at lower per-token cost but more prompt engineering effort.

📝 Section Quiz

1. Why does CoT prompting improve accuracy on math problems but not on simple sentiment classification?

Show Answer
Math problems require multi-step reasoning where each step depends on the previous result. CoT lets the model "show its work," carrying intermediate values forward in the generated text. Sentiment classification is typically a single-step pattern matching task that does not benefit from intermediate reasoning. Adding CoT to classification wastes tokens and can sometimes reduce accuracy by giving the model opportunities to overthink and second-guess its initial (correct) judgment.

2. Self-consistency uses temperature > 0 while standard CoT often uses temperature = 0. Explain why.

Show Answer
Self-consistency relies on generating diverse reasoning paths, then using majority voting to select the most common answer. With temperature = 0, every sample would produce identical output, making multiple samples pointless. A temperature of 0.5 to 0.7 introduces enough randomness to explore different reasoning approaches while keeping each individual chain mostly coherent. Standard CoT uses temperature = 0 because it only generates a single chain and wants it to be as reliable as possible.

3. How does Tree-of-Thought differ from running self-consistency multiple times?

Show Answer
Self-consistency generates complete, independent reasoning chains from start to finish, then votes on the final answer. Tree-of-Thought generates and evaluates partial reasoning steps, building branches incrementally. It can detect and prune bad paths early (before they waste tokens on a full chain) and it can backtrack to explore alternatives. ToT is more structured and efficient for problems where early decisions strongly constrain later steps, but it is more complex to implement and requires more API calls for the evaluation steps.

4. In the ReAct framework, what happens if the model enters an infinite loop of searching without producing a final answer?

Show Answer
This is a real risk with agentic loops. The standard defense is a maximum iteration limit (e.g., 5 to 10 tool-use rounds). If the model reaches the limit without converging on an answer, the system returns either the best partial answer available or an explicit "could not determine" response. Additional mitigations include tracking which searches have already been performed (to avoid redundant queries) and adding a system prompt instruction like "If you have searched for the same information twice, synthesize what you have and provide your best answer."

5. When would you choose step-back prompting over standard CoT?

Show Answer
Step-back prompting excels when the problem requires applying a general principle or rule that the model might forget if it dives directly into specifics. For example, physics problems (where recalling the relevant law is crucial), legal questions (where identifying the applicable statute comes first), and medical questions (where differential diagnosis starts with identifying the relevant body system). Standard CoT is better for straightforward arithmetic and logic where the steps are procedural rather than principle-based.
🛠 Modify and Observe

Experiment with the CoT examples from this section:

  1. Take the zero-shot CoT example and remove "step by step" from the system prompt. Compare the accuracy on a multi-step math problem. How many problems does it get wrong without the CoT trigger?
  2. In the self-consistency example, change the number of samples from 5 to 1, 3, 10, and 20. Plot the accuracy at each sample count. At what point do additional samples stop helping?
  3. Try adding CoT to a simple yes/no factual question (e.g., "Is Python a compiled language?"). Observe whether the model "overthinks" and produces a less confident or incorrect answer. This demonstrates why CoT can hurt on simple tasks.

Key Takeaways