Why reasoning techniques matter. Standard prompting asks a model to jump directly from question to answer. For simple factual lookups, this works fine. But for multi-step reasoning, math problems, logic puzzles, and complex analysis, direct answering leads to frequent errors. Chain-of-Thought (CoT) prompting, introduced by Wei et al. (2022), showed that simply asking the model to "think step by step" before answering can dramatically improve accuracy on reasoning tasks. This section covers CoT and its successors: self-consistency (sample multiple reasoning paths and vote), Tree-of-Thought (structured exploration with backtracking), step-back prompting, and the ReAct framework that interleaves reasoning with tool use.
1. Chain-of-Thought Prompting
"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" demonstrated that providing step-by-step reasoning examples in the prompt dramatically improves performance on arithmetic, commonsense, and symbolic reasoning tasks. The key contribution was showing that reasoning ability is an emergent property: it appears in models above ~100B parameters and is absent in smaller models. This paper launched an entire subfield of reasoning-oriented prompt engineering.
Chain-of-Thought prompting works by encouraging the model to generate intermediate reasoning steps before producing a final answer. The mechanism is straightforward: each reasoning step conditions the generation of subsequent steps, allowing the model to "carry" information forward through the computation. Without CoT, the model must compress all reasoning into a single forward pass through the network. With CoT, the model effectively uses its own generated text as a scratchpad, offloading intermediate computation into the token sequence.
Without CoT, the model must fit all reasoning into a single forward pass. With CoT, the model's own output tokens become working memory. Each generated reasoning step is literally fed back into the model as input for the next step, just as you would use a scratchpad to carry intermediate results. This is why CoT helps on multi-step problems: it gives the model a place to store intermediate results that its fixed-size hidden state cannot hold all at once.
1.1 Zero-Shot CoT
The simplest form of CoT requires no examples at all. Kojima et al. (2022) discovered that appending the phrase "Let's think step by step" to a prompt is sufficient to trigger reasoning behavior in large models. This zero-shot CoT approach is remarkably effective, improving accuracy on GSM8K (grade school math) from 17.7% to 78.7% for PaLM 540B.
import openai client = openai.OpenAI() def solve_with_cot(problem: str) -> str: """Solve a problem using zero-shot Chain-of-Thought.""" response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": """Solve the problem step by step. Show your reasoning clearly, then provide the final answer on a separate line starting with "ANSWER: "."""}, {"role": "user", "content": problem} ], temperature=0.0 ) return response.choices[0].message.content problem = """A store sells notebooks for $4 each. If you buy 5 or more, you get a 20% discount on the entire purchase. Sarah buys 7 notebooks and pays with a $50 bill. How much change does she receive?""" print(solve_with_cot(problem))
1.2 Few-Shot CoT
For tasks where zero-shot CoT is insufficient, providing examples of step-by-step reasoning significantly improves performance. The key insight is that the examples teach the model not just the answer format, but the style of reasoning to apply. Different reasoning styles suit different tasks: arithmetic problems benefit from sequential calculation steps, logic problems benefit from listing premises and drawing inferences, and coding problems benefit from planning before implementing.
import openai client = openai.OpenAI() # Few-shot CoT: provide exemplar reasoning chains FEW_SHOT_EXAMPLES = """ Example 1: Q: A train travels 120 miles in 2 hours. It then travels 90 miles in 1.5 hours. What is the average speed for the entire trip? A: Let me work through this step by step. Total distance = 120 + 90 = 210 miles Total time = 2 + 1.5 = 3.5 hours Average speed = total distance / total time = 210 / 3.5 = 60 mph ANSWER: 60 mph Example 2: Q: A store has a "buy 2, get 1 free" deal on $6 shirts. How much do 7 shirts cost? A: Let me work through this step by step. For every 3 shirts, you pay for 2: that is 2 x $6 = $12 per group of 3. 7 shirts = 2 complete groups (6 shirts) + 1 remaining shirt. Cost for 6 shirts = 2 x $12 = $24 Cost for 1 extra shirt = $6 Total = $24 + $6 = $30 ANSWER: $30 """ def solve_with_few_shot_cot(problem: str) -> str: response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": f"""Solve problems step by step, following the style shown in these examples: {FEW_SHOT_EXAMPLES} Show your reasoning clearly, then provide the final answer on a line starting with "ANSWER: "."""}, {"role": "user", "content": problem} ], temperature=0.0 ) return response.choices[0].message.content problem = """A bakery sells cupcakes for $3 each. If you buy a dozen (12), you get a 25% discount. Maria buys 15 cupcakes. How much does she pay?""" print(solve_with_few_shot_cot(problem))
CoT prompting improves accuracy primarily on tasks that require multi-step reasoning. For single-step tasks (simple factual recall, sentiment classification), CoT adds unnecessary tokens without improving quality and may even slightly reduce accuracy due to the additional opportunity for the model to "talk itself into" a wrong answer. The rule of thumb: if a human needs a scratchpad to solve the problem, CoT will help. If a human can answer instantly, CoT is unnecessary overhead.
2. Self-Consistency
"Self-Consistency Improves Chain of Thought Reasoning in Language Models" showed that sampling multiple reasoning paths and taking a majority vote dramatically outperforms greedy single-path CoT. On GSM8K, self-consistency pushed accuracy from 78.7% to over 90% for PaLM 540B. The key insight: correct reasoning paths tend to converge on the same answer, while errors are randomly distributed across wrong answers.
Self-consistency, introduced by Wang et al. (2022), builds on CoT by sampling multiple reasoning paths and selecting the answer that appears most frequently. The intuition is that while any single reasoning chain might contain errors, different chains are likely to make different errors. When multiple independent chains converge on the same answer, confidence in that answer increases.
2.1 Implementation
import openai from collections import Counter client = openai.OpenAI() def solve_with_self_consistency(problem: str, n_samples: int = 5) -> dict: """Sample multiple CoT paths and take majority vote.""" answers = [] for _ in range(n_samples): response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": """Solve step by step. End with ANSWER: <number>"""}, {"role": "user", "content": problem} ], temperature=0.7 # Higher temperature for diverse paths ) text = response.choices[0].message.content # Extract the final answer if "ANSWER:" in text: answer = text.split("ANSWER:")[-1].strip() answers.append(answer) # Majority vote vote_counts = Counter(answers) best_answer, count = vote_counts.most_common(1)[0] return { "answer": best_answer, "confidence": count / len(answers), "all_answers": dict(vote_counts) } result = solve_with_self_consistency( "If a train travels at 60 mph for 2.5 hours, then at 80 mph for 1.5 hours, " "what is the total distance traveled?" ) print(f"Answer: {result['answer']}") print(f"Confidence: {result['confidence']:.0%}") print(f"All votes: {result['all_answers']}")
Self-consistency requires temperature > 0 to generate diverse reasoning paths. If temperature is zero, every sample produces identical output, defeating the purpose. A temperature of 0.5 to 0.7 provides a good balance between diversity and quality. Higher temperatures produce more diverse paths but increase the chance of individually nonsensical reasoning chains.
3. Tree-of-Thought (ToT)
Tree-of-Thought (Yao et al., 2023) extends CoT by exploring multiple reasoning paths in a structured tree, with the ability to evaluate and backtrack. While CoT follows a single linear chain and self-consistency samples independent chains, ToT builds a branching exploration tree where the model can evaluate partial solutions, prune unpromising branches, and explore alternatives.
Tree-of-Thought is an elegant research technique, but it requires 10 to 50+ API calls per query (one for each node explored in the tree). At $0.01 per call, a single ToT query can cost $0.10 to $0.50. For production workloads, self-consistency (3 to 5 calls) achieves most of the accuracy benefit at a fraction of the cost. Use ToT only for high-value, low-volume tasks like planning, puzzle solving, or complex code generation where the cost per query is justified.
The ToT framework has three core components:
- Thought generation: At each step, the model proposes multiple possible next steps (branches).
- Thought evaluation: The model (or a separate evaluator) scores each branch on how promising it looks.
- Search strategy: A search algorithm (typically breadth-first or depth-first) navigates the tree, expanding the most promising nodes and pruning dead ends.
3.1 Simplified ToT Implementation
import openai client = openai.OpenAI() def generate_thoughts(problem: str, context: str, n: int = 3) -> list[str]: """Generate n candidate next steps.""" response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": f"""Problem: {problem} Progress so far: {context} Generate {n} different possible next steps. Number them 1, 2, 3. Each should be a single reasoning step."""}], temperature=0.8 ) text = response.choices[0].message.content return [line.strip() for line in text.split("\n") if line.strip()] def evaluate_thought(problem: str, thought: str) -> float: """Score a thought from 0 to 1.""" response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": f"""Problem: {problem} Proposed reasoning step: {thought} Rate this step from 0.0 to 1.0 based on correctness and usefulness. Respond with only a number."""}], temperature=0.0 ) try: return float(response.choices[0].message.content.strip()) except ValueError: return 0.5 # Example: solve with tree exploration problem = "Find two numbers that multiply to 36 and add to 13." thoughts = generate_thoughts(problem, "(starting)") for t in thoughts[:3]: score = evaluate_thought(problem, t) print(f" [{score:.2f}] {t}")
4. Step-Back Prompting
Step-back prompting (Zheng et al., 2023) takes the opposite approach from diving into details. Instead of reasoning through the specific problem, it first asks the model to identify the abstract principle or high-level concept, then applies that principle to solve the specific problem. This is particularly effective for science, math, and policy questions where the correct reasoning requires recalling a general rule before applying it.
The two-phase approach works as follows. First, generate a "step-back" question: "What general principle or concept is relevant to this problem?" Then, use the model's answer to that abstract question as context for solving the original specific problem. This prevents the model from getting lost in surface-level details and grounds its reasoning in correct foundational knowledge.
import openai client = openai.OpenAI() def step_back_solve(question: str) -> str: """Two-phase step-back prompting: abstract first, then solve.""" # Phase 1: Generate the step-back (abstract) question abstraction = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "Given a specific question, identify the general principle or foundational concept needed to answer it. Respond with ONLY the principle, not the answer to the original question."}, {"role": "user", "content": question} ], temperature=0.0 ) principle = abstraction.choices[0].message.content print(f"Step-back principle: {principle[:120]}...") # Phase 2: Solve using the principle as context solution = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": f"Use this foundational principle to answer the question:\n\n{principle}"}, {"role": "user", "content": question} ], temperature=0.0 ) return solution.choices[0].message.content answer = step_back_solve( "If the temperature of an ideal gas is doubled while keeping " "the volume constant, what happens to its pressure?" ) print(f"\nAnswer: {answer}")
Step-back prompting excels on questions where domain knowledge recall is the bottleneck, not multi-step computation. Physics, chemistry, law, and policy questions often benefit because the model needs to recall the correct rule before applying it. For pure arithmetic or logic problems, standard CoT is usually sufficient. Step-back prompting adds one extra LLM call per question, so it doubles latency and cost; use it selectively for high-value queries where accuracy matters more than speed.
5. The ReAct Framework
ReAct (Yao et al., 2022) combines Reasoning and Acting in an interleaved loop. The model alternates between generating reasoning traces (thinking about what to do) and taking actions (calling tools, searching databases, executing code). This pattern is fundamental to modern LLM agents and represents the bridge between prompt engineering and agentic AI.
5.1 The ReAct Loop
import openai, json client = openai.OpenAI() TOOLS = [ {"type": "function", "function": { "name": "search", "description": "Search for information on a topic", "parameters": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"} }, "required": ["query"] } }} ] REACT_SYSTEM = """You are a research assistant. For each question: 1. THINK: Reason about what information you need 2. ACT: Use the search tool to find information 3. OBSERVE: Analyze the search results 4. Repeat THINK/ACT/OBSERVE until you have enough information 5. Provide a final, well-sourced answer""" def react_agent(question: str) -> str: messages = [ {"role": "system", "content": REACT_SYSTEM}, {"role": "user", "content": question} ] for step in range(5): # Max 5 iterations response = client.chat.completions.create( model="gpt-4o", messages=messages, tools=TOOLS, tool_choice="auto" ) msg = response.choices[0].message if msg.tool_calls: # Model wants to use a tool (ACT) messages.append(msg) for call in msg.tool_calls: result = execute_search( json.loads(call.function.arguments)["query"] ) messages.append({ "role": "tool", "tool_call_id": call.id, "content": result }) else: # Model has enough info, return final answer return msg.content return "Max iterations reached without final answer."
Advanced reasoning techniques like self-consistency and ToT multiply the number of API calls per query. Self-consistency with 5 samples costs 5x a single CoT call. ToT can require 10 to 50 calls depending on the branching factor and depth. Always consider the cost-accuracy tradeoff. For a batch of 10,000 queries, self-consistency with n=5 at $0.01 per call adds up to $500 compared to $100 for single CoT. The improvement from 90% to 95% accuracy must be worth the 5x cost increase for your use case.
6. Comparison of Reasoning Techniques
| Technique | API Calls | Best For | Limitation |
|---|---|---|---|
| Zero-shot CoT | 1 | Simple multi-step reasoning | Single path can still err |
| Few-shot CoT | 1 | Domain-specific reasoning style | Requires good examples |
| Self-consistency | n (typically 5 to 20) | Math, logic, factual questions | High cost; assumes closed-form answer |
| Tree-of-Thought | 10 to 50+ | Complex planning, puzzles | Very expensive; complex to implement |
| Step-back prompting | 2 | Science, policy, principle-based questions | Adds latency; not useful for procedural tasks |
| ReAct | 2 to 10+ | Questions requiring external information | Requires tool infrastructure |
7. Choosing a Reasoning Technique: Decision Flowchart
With multiple reasoning techniques available, selecting the right one for a given task is itself a decision problem. The following flowchart provides a practical starting point. Start at the top and follow the questions to arrive at a recommended technique.
Providers now offer reasoning models with built-in chain-of-thought: OpenAI's o3 and o4-mini, DeepSeek R1, and Claude's extended thinking mode. These models perform multi-step reasoning internally without needing explicit CoT prompting. When using a reasoning model, adding "think step by step" to your prompt is redundant and may even degrade quality. The decision is: pay for a more expensive reasoning model that handles reasoning natively, or use a standard model with explicit CoT prompting at lower per-token cost but more prompt engineering effort.
📝 Section Quiz
1. Why does CoT prompting improve accuracy on math problems but not on simple sentiment classification?
Show Answer
2. Self-consistency uses temperature > 0 while standard CoT often uses temperature = 0. Explain why.
Show Answer
3. How does Tree-of-Thought differ from running self-consistency multiple times?
Show Answer
4. In the ReAct framework, what happens if the model enters an infinite loop of searching without producing a final answer?
Show Answer
5. When would you choose step-back prompting over standard CoT?
Show Answer
Experiment with the CoT examples from this section:
- Take the zero-shot CoT example and remove "step by step" from the system prompt. Compare the accuracy on a multi-step math problem. How many problems does it get wrong without the CoT trigger?
- In the self-consistency example, change the number of samples from 5 to 1, 3, 10, and 20. Plot the accuracy at each sample count. At what point do additional samples stop helping?
- Try adding CoT to a simple yes/no factual question (e.g., "Is Python a compiled language?"). Observe whether the model "overthinks" and produces a less confident or incorrect answer. This demonstrates why CoT can hurt on simple tasks.
Key Takeaways
- Chain-of-Thought is the single most impactful prompting technique for reasoning tasks. "Think step by step" can improve math accuracy from under 20% to over 78%, and it costs nothing extra in implementation complexity.
- Self-consistency trades cost for accuracy. Sampling 5 to 10 reasoning paths and voting raises accuracy further, but multiplies API cost linearly. Use it when correctness is worth the expense.
- Tree-of-Thought is powerful but expensive. Reserve it for complex planning problems where early pruning of bad paths saves overall computation compared to exhaustive self-consistency sampling.
- Step-back prompting grounds reasoning in principles. For science, law, and domain-specific reasoning, abstracting before solving prevents the model from getting lost in surface-level details.
- ReAct bridges prompting and agents. By interleaving reasoning with tool use, ReAct enables the model to access external information and take actions, forming the foundation of modern AI agent architectures.
- Always match the technique to the task. Simple tasks need simple prompts. Complex tasks need complex reasoning strategies. Overengineering wastes cost; underengineering wastes accuracy.