Section 21.3: Planning & Agentic Reasoning

★ Big Picture

Planning is what separates a simple tool-calling chatbot from a genuine agent. Without planning, an agent reacts to each user message in isolation, choosing one tool at a time. With planning, an agent can decompose complex goals into subtasks, reason about dependencies between steps, execute them in the right order, and recover when things go wrong. This section covers the major planning architectures: plan-and-execute, reflection loops, tree search, and the LLM Compiler pattern for parallel execution.

1. Why Planning Matters

Consider a request like "Research the top 3 competitors of Acme Corp, compare their pricing, and create a summary report." A reactive agent would attempt to handle this in a single LLM call, likely producing shallow results. A planning agent decomposes this into discrete steps: identify competitors, research each one's pricing, structure the comparison, and generate the report. Each step can use different tools and the agent can verify intermediate results before proceeding.

Planning provides several concrete advantages. First, it enables task decomposition, breaking complex goals into manageable subtasks. Second, it supports dependency management, ensuring steps execute in the correct order. Third, it allows error recovery, because a failed step can be re-planned without restarting from scratch. Fourth, it creates transparency, since users and developers can inspect the plan to understand what the agent intends to do before it acts.

Figure 1: Reactive agents respond in a single pass; planning agents decompose, execute, and verify step by step.

2. Plan-and-Execute Architecture

The plan-and-execute pattern (introduced by the LangGraph team and inspired by the BabyAGI project) separates the agent into two distinct roles: a planner that generates a sequence of steps, and an executor that carries out each step using tools. After each step completes, the planner can revise the remaining plan based on what was learned. This separation keeps planning logic clean and allows different models or prompts for each role.

from openai import OpenAI
from pydantic import BaseModel
import json

client = OpenAI()

class PlanStep(BaseModel):
    step_id: int
    description: str
    tool: str          # Tool to use: "search", "analyze", "write", etc.
    depends_on: list[int] = []  # IDs of steps this depends on

class Plan(BaseModel):
    goal: str
    steps: list[PlanStep]
    reasoning: str

def create_plan(user_goal: str) -> Plan:
    """Use the LLM as a planner to decompose a goal into steps."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "You are a planning agent. Decompose the user's goal into "
                "a sequence of concrete, executable steps. Each step should "
                "use exactly one tool. Specify dependencies between steps."
            )},
            {"role": "user", "content": user_goal}
        ],
        response_format=Plan
    )
    return response.choices[0].message.parsed

def execute_step(step: PlanStep, context: dict) -> str:
    """Execute a single plan step using the appropriate tool."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "You are an execution agent. Carry out the assigned step "
                "using the provided context from previous steps."
            )},
            {"role": "user", "content": (
                f"Step: {step.description}\n"
                f"Tool: {step.tool}\n"
                f"Context from previous steps:\n{json.dumps(context, indent=2)}"
            )}
        ],
        tools=get_tools_for(step.tool)  # Load relevant tool definitions
    )
    return process_response(response)

# Main plan-and-execute loop
plan = create_plan("Research top 3 Acme Corp competitors and compare pricing")
results = {}

for step in plan.steps:
    # Gather context from completed dependencies
    context = {sid: results[sid] for sid in step.depends_on if sid in results}
    results[step.step_id] = execute_step(step, context)
    print(f"Completed step {step.step_id}: {step.description}")

⚙ Key Insight

Structured outputs are essential for reliable planning. Using Pydantic models (or equivalent JSON schemas) with the LLM's structured output mode ensures the plan is always valid, parseable, and contains the required fields. Without structured outputs, you must parse free-text plans, which is brittle and error-prone. OpenAI's response_format parameter and Anthropic's tool use both support constrained output generation.

3. Agentic Reflection Loops

Reflection is the process by which an agent evaluates its own outputs and decides whether to revise them. Instead of accepting the first result, a reflection loop adds a "critic" step that examines the output for errors, missing information, or quality issues. If the critic identifies problems, the agent re-executes the step with targeted feedback. This pattern dramatically improves output quality for tasks like code generation, report writing, and data analysis.

Figure 2: The reflection loop pattern. Output is generated, critiqued, and either accepted or sent back for revision with specific feedback.

from openai import OpenAI

client = OpenAI()

def generate_with_reflection(task: str, max_iterations: int = 3) -> str:
    """Generate output with iterative self-reflection."""
    draft = ""

    for iteration in range(max_iterations):
        # Step 1: Generate (or revise) the output
        gen_messages = [
            {"role": "system", "content": "You are a skilled writer. Produce high-quality output."},
            {"role": "user", "content": task}
        ]
        if draft:
            gen_messages.append({
                "role": "user",
                "content": f"Previous draft:\n{draft}\n\nRevise based on this feedback:\n{feedback}"
            })

        gen_response = client.chat.completions.create(
            model="gpt-4o", messages=gen_messages
        )
        draft = gen_response.choices[0].message.content

        # Step 2: Reflect / critique the output
        critique = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": (
                    "You are a strict quality reviewer. Evaluate the draft for "
                    "accuracy, completeness, and clarity. If the draft is excellent, "
                    "respond with exactly 'APPROVED'. Otherwise, provide specific "
                    "feedback on what needs improvement."
                )},
                {"role": "user", "content": f"Task: {task}\n\nDraft to review:\n{draft}"}
            ]
        )
        feedback = critique.choices[0].message.content

        if "APPROVED" in feedback:
            print(f"Approved after {iteration + 1} iteration(s)")
            return draft

    print(f"Returning best draft after {max_iterations} iterations")
    return draft

4. LATS: Language Agent Tree Search

Language Agent Tree Search (LATS) combines the planning capabilities of LLMs with Monte Carlo Tree Search (MCTS), a technique originally developed for game-playing AI. Instead of committing to a single plan, LATS explores multiple possible action sequences as branches of a tree. It uses the LLM both as a policy (to propose actions) and as a value function (to evaluate how promising each branch looks). This allows the agent to explore alternatives when one path fails and backtrack to more promising branches.

How LATS Works

Selection: Starting from the root node (current state), traverse the tree by selecting the most promising child at each level using UCT (Upper Confidence Bound for Trees).
Expansion: At a leaf node, use the LLM to generate multiple possible next actions. Each action becomes a new child node.
Simulation: Use the LLM to evaluate each expanded node, estimating how likely it is to lead to success.
Backpropagation: Update the value estimates of all ancestor nodes based on the simulation results.

Figure 3: LATS explores multiple action branches, evaluating and backpropagating scores to find the optimal path.

import math
from dataclasses import dataclass, field

@dataclass
class LATSNode:
    """A node in the LATS search tree."""
    state: str                          # Current state description
    action: str = ""                   # Action that led to this state
    parent: "LATSNode" = None
    children: list["LATSNode"] = field(default_factory=list)
    visits: int = 0
    total_value: float = 0.0

    @property
    def avg_value(self) -> float:
        return self.total_value / self.visits if self.visits > 0 else 0.0

    def uct_score(self, exploration_weight: float = 1.41) -> float:
        """Upper Confidence Bound for Trees (UCT) selection score."""
        if self.visits == 0:
            return float("inf")  # Always explore unvisited nodes
        exploitation = self.avg_value
        exploration = exploration_weight * math.sqrt(
            math.log(self.parent.visits) / self.visits
        )
        return exploitation + exploration

def lats_search(goal: str, n_iterations: int = 10) -> list[str]:
    """Run LATS to find the best action sequence for a goal."""
    root = LATSNode(state=f"Goal: {goal}")

    for _ in range(n_iterations):
        # 1. Selection: traverse tree using UCT
        node = root
        while node.children:
            node = max(node.children, key=lambda n: n.uct_score())

        # 2. Expansion: generate candidate actions via LLM
        candidate_actions = llm_propose_actions(node.state, n=3)
        for action in candidate_actions:
            new_state = llm_simulate_action(node.state, action)
            child = LATSNode(state=new_state, action=action, parent=node)
            node.children.append(child)

        # 3. Simulation: evaluate each new child
        for child in node.children:
            if child.visits == 0:
                value = llm_evaluate_state(child.state, goal)
                child.visits = 1
                child.total_value = value

                # 4. Backpropagation: update ancestor values
                ancestor = child.parent
                while ancestor:
                    ancestor.visits += 1
                    ancestor.total_value += value
                    ancestor = ancestor.parent

    # Extract best path
    path, node = [], root
    while node.children:
        node = max(node.children, key=lambda n: n.avg_value)
        path.append(node.action)
    return path

5. LLM Compiler: Parallel Function Calling

The LLM Compiler pattern (from UC Berkeley, 2023) addresses a major inefficiency in sequential plan-and-execute: many steps are independent and could run in parallel. The LLM Compiler analyzes a plan's dependency graph and identifies which steps can execute concurrently. This is particularly valuable for tasks that involve multiple independent API calls, database queries, or search operations.

Pattern	Execution	Best For	Latency
Sequential Plan-and-Execute	One step at a time	Tightly dependent steps	High (sum of all steps)
LLM Compiler (parallel)	Independent steps run simultaneously	Multiple independent data fetches	Low (longest parallel chain)
LATS (tree search)	Explores multiple branches	Uncertain tasks, many possible paths	Variable (depends on search depth)
Reflection Loop	Generate, critique, revise	Quality-critical outputs (writing, code)	Moderate (2x to 4x single pass)

import asyncio
from dataclasses import dataclass

@dataclass
class Task:
    task_id: int
    description: str
    tool: str
    depends_on: list[int]
    result: str = None

async def execute_task(task: Task, results: dict) -> str:
    """Execute a single task, waiting for dependencies first."""
    # Wait for all dependencies to complete
    while not all(dep_id in results for dep_id in task.depends_on):
        await asyncio.sleep(0.1)

    context = {did: results[did] for did in task.depends_on}
    result = await call_tool_async(task.tool, task.description, context)
    results[task.task_id] = result
    return result

async def llm_compiler_execute(tasks: list[Task]) -> dict:
    """Execute tasks with maximum parallelism based on dependencies."""
    results = {}

    # Launch all tasks concurrently; each waits for its own dependencies
    coroutines = [execute_task(task, results) for task in tasks]
    await asyncio.gather(*coroutines)
    return results

# Example: research three competitors in parallel, then compare
tasks = [
    Task(1, "Search for Competitor A pricing", "web_search", []),
    Task(2, "Search for Competitor B pricing", "web_search", []),
    Task(3, "Search for Competitor C pricing", "web_search", []),
    Task(4, "Compare all pricing data", "analyze", [1, 2, 3]),  # Depends on 1,2,3
    Task(5, "Generate summary report", "write", [4]),          # Depends on 4
]
# Tasks 1, 2, 3 run in parallel; task 4 waits for all three; task 5 waits for 4
results = asyncio.run(llm_compiler_execute(tasks))

Note

OpenAI and Anthropic both support parallel tool calls at the API level, where the model can request multiple tool executions in a single response. The LLM Compiler pattern builds on this by analyzing the entire plan upfront and scheduling all parallelizable work across multiple turns, not just within a single turn.

6. Human-in-the-Loop Design

Not every agent should run autonomously. For high-stakes tasks (financial transactions, sending emails, modifying databases), a human-in-the-loop checkpoint lets users review and approve actions before execution. The key design challenge is deciding where to insert human checkpoints: too few and the agent may take harmful actions; too many and the agent becomes tedious to use.

Checkpoint Strategies

Plan approval: Show the full plan to the user before any execution begins. The user can modify or reject steps.
Action-level gating: Classify tools as "safe" (search, read) or "sensitive" (write, delete, send). Only pause for sensitive tools.
Confidence-based gating: Let the agent self-assess its confidence. Low-confidence actions trigger a human review.
Budget-based gating: Allow autonomous execution up to a cost threshold (e.g., up to $10 in API calls), then require approval.

from enum import Enum

class ToolRisk(Enum):
    LOW = "low"       # Read-only: search, fetch, analyze
    MEDIUM = "medium" # Reversible writes: create draft, stage changes
    HIGH = "high"     # Irreversible: send email, delete, purchase

TOOL_RISK_MAP = {
    "web_search": ToolRisk.LOW,
    "read_file": ToolRisk.LOW,
    "analyze_data": ToolRisk.LOW,
    "write_file": ToolRisk.MEDIUM,
    "create_draft_email": ToolRisk.MEDIUM,
    "send_email": ToolRisk.HIGH,
    "delete_record": ToolRisk.HIGH,
    "execute_transaction": ToolRisk.HIGH,
}

def execute_with_human_gate(tool_name: str, arguments: dict) -> str:
    """Execute a tool call with risk-appropriate human oversight."""
    risk = TOOL_RISK_MAP.get(tool_name, ToolRisk.HIGH)  # Default to HIGH for unknown tools

    if risk == ToolRisk.LOW:
        # Auto-execute safe, read-only operations
        return execute_tool(tool_name, arguments)

    elif risk == ToolRisk.MEDIUM:
        # Log and execute, but allow rollback
        print(f"[INFO] Executing: {tool_name}({arguments})")
        result = execute_tool(tool_name, arguments)
        log_action(tool_name, arguments, result)  # For audit trail
        return result

    else:  # ToolRisk.HIGH
        # Require explicit human approval
        print(f"\n{'='*50}")
        print(f"APPROVAL REQUIRED: {tool_name}")
        print(f"Arguments: {json.dumps(arguments, indent=2)}")
        print(f"{'='*50}")
        approval = input("Approve? (yes/no): ").strip().lower()

        if approval == "yes":
            return execute_tool(tool_name, arguments)
        else:
            return "Action rejected by user."

Warning

Default to HIGH risk for unknown tools. When an agent encounters a tool not in your risk map, always require human approval. It is far better to ask for unnecessary approval than to execute an irreversible action without consent. This principle applies doubly when the agent is interacting with external APIs or third-party services.

7. Combining Patterns: A Complete Planning Agent

Real-world agents combine multiple patterns. A production planning agent might use plan-and-execute for overall task decomposition, the LLM Compiler for parallelizing independent subtasks, reflection loops for quality-critical steps like report generation, and human-in-the-loop gating for sensitive actions. The architecture becomes a pipeline where each pattern handles what it does best.

Concern	Pattern	When to Apply
Task decomposition	Plan-and-Execute	Complex multi-step goals
Maximizing throughput	LLM Compiler	Independent data fetches, parallel API calls
Output quality	Reflection Loop	Writing, analysis, code generation
Exploring alternatives	LATS	Uncertain tasks, puzzle-like problems
Safety and trust	Human-in-the-Loop	Irreversible or high-stakes actions

⚙ Key Insight

Planning quality scales with model capability. Weaker models tend to produce overly granular plans with too many steps, or overly vague plans that skip critical details. For the planner role, use the most capable model available (GPT-4o, Claude Sonnet/Opus). For the executor role, a smaller model may suffice since each step is well-defined. This asymmetric approach balances cost and quality.

Lab: Plan-and-Execute Agent with Self-Correction

In this lab, you will build a complete plan-and-execute agent that decomposes tasks, executes each step, evaluates results, and re-plans when something goes wrong. The agent uses structured outputs for planning, tool execution for each step, and a reflection loop to verify results before moving on.

from openai import OpenAI
from pydantic import BaseModel
import json

client = OpenAI()

# --- Data Models ---
class Step(BaseModel):
    id: int
    description: str
    tool: str
    expected_output: str

class AgentPlan(BaseModel):
    steps: list[Step]

class StepEvaluation(BaseModel):
    success: bool
    issues: list[str]
    suggestion: str

# --- Core Functions ---
def plan(goal: str, context: str = "") -> AgentPlan:
    """Generate or revise a plan based on the goal and context."""
    messages = [
        {"role": "system", "content": (
            "Create a concrete plan with 3 to 6 steps. Each step uses one "
            "tool: search, calculate, analyze, or write. Include what output "
            "you expect from each step."
        )},
        {"role": "user", "content": f"Goal: {goal}\n{context}"}
    ]
    resp = client.beta.chat.completions.parse(
        model="gpt-4o", messages=messages, response_format=AgentPlan
    )
    return resp.choices[0].message.parsed

def evaluate_step(step: Step, result: str) -> StepEvaluation:
    """Evaluate whether a step's result meets expectations."""
    resp = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{"role": "user", "content": (
            f"Step: {step.description}\n"
            f"Expected: {step.expected_output}\n"
            f"Actual result:\n{result}\n\n"
            "Did this step succeed? List any issues."
        )}],
        response_format=StepEvaluation
    )
    return resp.choices[0].message.parsed

# --- Main Agent Loop ---
def run_agent(goal: str, max_replans: int = 2):
    agent_plan = plan(goal)
    results = {}
    replan_count = 0

    for step in agent_plan.steps:
        print(f"\nExecuting step {step.id}: {step.description}")
        result = execute_tool(step.tool, step.description, results)
        evaluation = evaluate_step(step, result)

        if evaluation.success:
            results[step.id] = result
            print(f"  Step {step.id} passed.")
        else:
            print(f"  Step {step.id} failed: {evaluation.issues}")
            if replan_count < max_replans:
                replan_count += 1
                context = (
                    f"Step {step.id} failed.\n"
                    f"Issues: {evaluation.issues}\n"
                    f"Suggestion: {evaluation.suggestion}\n"
                    f"Completed so far: {json.dumps(results, indent=2)}"
                )
                agent_plan = plan(goal, context)
                print(f"  Re-planned. New plan has {len(agent_plan.steps)} steps.")
                return run_agent_from_plan(agent_plan, results, goal, max_replans - 1)

    return results

run_agent("Analyze Q4 2024 sales trends and create an executive summary")

Executing step 1: Search for Q4 2024 sales data Step 1 passed. Executing step 2: Calculate quarter-over-quarter growth rates Step 2 passed. Executing step 3: Identify top performing product categories Step 3 failed: ['Missing regional breakdown data'] Re-planned. New plan has 4 steps. Executing step 3: Fetch regional sales breakdown for Q4 2024 Step 3 passed. Executing step 4: Analyze regional and category performance trends Step 4 passed. Executing step 5: Generate executive summary with key findings Step 5 passed.

Knowledge Check

1. What are the two distinct roles in the plan-and-execute architecture, and why is separation beneficial?

Show Answer

The two roles are the planner (decomposes goals into steps with dependencies) and the executor (carries out each step using tools). Separation is beneficial because: (1) each role can use a different model or prompt optimized for its task, (2) the plan can be inspected and modified before execution, (3) failure in one step does not corrupt the planning logic, and (4) the planner can revise remaining steps based on execution results.

2. How does LATS differ from standard plan-and-execute, and when would you choose it?

Show Answer

LATS explores multiple possible action paths as a search tree, using UCT to balance exploration and exploitation. Standard plan-and-execute commits to a single sequence of steps. Choose LATS when the task is uncertain (many possible approaches), when early failures are likely and backtracking is needed, or when the optimal strategy is not obvious from the goal description alone. LATS trades higher computational cost for better exploration of the solution space.

3. What problem does the LLM Compiler pattern solve, and how does it determine which tasks can run in parallel?

Show Answer

The LLM Compiler solves the latency problem of sequential execution by identifying independent tasks that can run concurrently. It analyzes the dependency graph of a plan: tasks with no unmet dependencies can execute simultaneously. For example, if three search tasks have no dependencies on each other, they run in parallel. A downstream analysis task that depends on all three waits until they complete. This reduces total latency from the sum of all tasks to the length of the longest dependency chain.

4. Name three strategies for deciding when to insert human-in-the-loop checkpoints.

Show Answer

Three strategies are: (1) Plan approval, where the user reviews the entire plan before execution begins. (2) Action-level gating, where tools are classified by risk level (low/medium/high) and only sensitive operations require approval. (3) Confidence-based gating, where the agent self-assesses its confidence and requests human review when confidence is low. A fourth option is budget-based gating, where autonomous execution is allowed up to a spending threshold.

5. In the reflection loop pattern, what happens if the critic never approves the output?

Show Answer

The max_iterations parameter prevents infinite loops. If the critic does not approve after the maximum number of iterations, the agent returns the best draft produced so far. This is essential for production systems because: (1) some tasks may not have a "perfect" answer, (2) the critic may be overly strict, and (3) unbounded iteration wastes tokens and time. In practice, most outputs are approved within 2 to 3 iterations, and the quality improvement from additional iterations shows diminishing returns.

Key Takeaways

Planning transforms reactive chatbots into goal-directed agents by enabling task decomposition, dependency management, and error recovery.
Plan-and-execute separates the planner (decomposes goals) from the executor (uses tools), allowing each to be optimized independently.
Reflection loops add a critic step that evaluates outputs and triggers revision, significantly improving quality for writing and code generation tasks.
LATS combines LLM reasoning with tree search, exploring multiple action paths and backtracking from dead ends.
The LLM Compiler pattern analyzes dependency graphs to maximize parallel execution, reducing latency for plans with independent subtasks.
Human-in-the-loop checkpoints should be risk-proportionate: auto-execute safe operations, log medium-risk actions, and require approval for irreversible ones.
Production agents combine multiple patterns: plan-and-execute for structure, parallel execution for speed, reflection for quality, and human gating for safety.