Module 10 · Section 10.3

Advanced Prompt Patterns

Self-reflection, meta-prompting, prompt chaining, and programmatic optimization
★ Big Picture

From manual craft to automated optimization. Sections 10.1 and 10.2 covered techniques you write by hand: few-shot examples, role prompts, chain-of-thought templates. This section moves beyond manual prompt design into patterns that let LLMs improve their own outputs (reflection), generate new prompts (meta-prompting), and optimize prompt pipelines programmatically (DSPy, OPRO). These techniques represent the frontier where prompt engineering becomes prompt programming, shifting from artisanal tuning to systematic, reproducible optimization.

1. Self-Reflection and Iterative Refinement

Reflection is one of the most powerful patterns in modern prompt engineering. The core idea is simple: after generating an initial output, ask the model to critique its own work and then produce a revised version. This generate, critique, revise loop mirrors how skilled human writers and programmers work. They draft, review, and edit rather than producing a final version in a single pass.

Andrew Ng has identified reflection as a first-class agentic design pattern, noting that it provides outsized improvement relative to its implementation complexity. A single reflection pass can often close the gap between a weaker model with reflection and a stronger model without it.

1.1 Basic Reflection Loop

The simplest reflection pattern uses two sequential calls. The first call generates an initial output. The second call receives both the original task and the initial output, then critiques the output and produces an improved version.

import openai

client = openai.OpenAI()

def reflect_and_refine(task: str, max_rounds: int = 2) -> dict:
    """Generate, critique, and revise in a loop."""

    # Step 1: Initial generation
    draft = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": task}],
        temperature=0.7
    ).choices[0].message.content

    history = [{"round": 0, "output": draft}]

    for i in range(max_rounds):
        # Step 2: Critique the current draft
        critique = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system",
                 "content": """You are a rigorous reviewer. Analyze the draft for:
1. Correctness: Are there factual or logical errors?
2. Completeness: Is anything important missing?
3. Clarity: Is the writing clear and well-structured?
List specific issues, then rate overall quality 1-10."""},
                {"role": "user",
                 "content": f"Task: {task}\n\nDraft:\n{draft}"}
            ],
            temperature=0.3
        ).choices[0].message.content

        # Step 3: Revise based on critique
        draft = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system",
                 "content": "Revise the draft to address every issue raised."},
                {"role": "user",
                 "content": f"Task: {task}\nDraft:\n{draft}\nCritique:\n{critique}"}
            ],
            temperature=0.4
        ).choices[0].message.content

        history.append({"round": i + 1, "critique": critique, "output": draft})

    return {"final": draft, "history": history}
Generate, Critique, Revise Loop GENERATE Produce initial draft draft CRITIQUE Identify specific flaws feedback REVISE Fix all flagged issues repeat until quality threshold met Final Output
Figure 10.5: The reflection loop generates a draft, critiques it for specific flaws, revises to address each flaw, and optionally repeats.

1.2 Reflexion: Memory-Augmented Self-Improvement

Reflexion (Shinn et al., 2023) extends basic reflection by adding persistent memory across attempts. When a task involves multiple trials (such as coding challenges or multi-step reasoning), Reflexion stores natural-language "lessons learned" from each failed attempt. On subsequent attempts, the agent reads these lessons before trying again, avoiding previously encountered pitfalls. This is analogous to how a human programmer keeps a mental note of bugs they have already debugged.

The Reflexion architecture has three components:

  1. Actor: Generates actions or outputs for the task.
  2. Evaluator: Provides a binary or scalar signal (e.g., did the code pass all tests?).
  3. Self-reflection: Given the failure signal and the trajectory, generates a natural-language reflection that is stored in a memory buffer.
import openai

client = openai.OpenAI()

def reflexion_code_solver(
    task: str, tests: list[str], max_attempts: int = 3
) -> dict:
    """Solve a coding task with Reflexion-style memory."""
    memory = []  # Persistent lessons from past failures

    for attempt in range(max_attempts):
        # Build context with accumulated lessons
        memory_str = "\n".join(
            f"Lesson {i+1}: {m}" for i, m in enumerate(memory)
        )

        prompt = f"""Solve this coding task.
Task: {task}

{"Lessons from previous attempts:" + chr(10) + memory_str if memory else ""}

Return ONLY the Python function."""

        # Generate code
        code = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2
        ).choices[0].message.content

        # Run tests (evaluator)
        results = run_tests(code, tests)

        if all(r["passed"] for r in results):
            return {"code": code, "attempts": attempt + 1}

        # Reflect on failures and add to memory
        failures = [r for r in results if not r["passed"]]
        reflection = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user",
                "content": f"""My code failed these tests: {failures}
My code was:
{code}
In one sentence, what key insight did I miss?"""}],
            temperature=0.3
        ).choices[0].message.content

        memory.append(reflection)

    return {"code": code, "attempts": max_attempts, "failed": True}
★ Key Insight

Reflexion's power comes from converting failure signals into natural-language lessons that persist across attempts. Unlike simple retry loops (which just re-run the same prompt), Reflexion accumulates structured knowledge about what went wrong. On the HumanEval coding benchmark, Reflexion improved pass@1 from 80% to 91% using GPT-4, with most problems solved within two to three attempts.

1.3 When Reflection Helps vs. When It Wastes Compute

Reflection is not universally beneficial. It adds latency (two to three extra API calls per round) and cost. Use it when the task has objectively verifiable quality criteria (tests pass, facts are correct, format matches spec). Avoid it for subjective tasks where the model may "revise" a perfectly good answer into something worse, or for simple tasks where the first-pass accuracy is already above 95%.

Scenario Reflection Helps? Why
Code generation with unit tests Yes Tests provide clear pass/fail signal
Fact-checked report writing Yes Verifiable claims can be audited
Creative fiction Rarely Quality is subjective; revisions may flatten style
Simple classification No Single-pass accuracy is already high
Structured data extraction Yes Schema validation provides concrete error signals

2. Meta-Prompting: Prompts That Generate Prompts

Meta-prompting uses one LLM call to write the prompt for a subsequent call. Instead of manually crafting a system prompt, you describe what you need the prompt to accomplish and let the model generate it. This is particularly useful when you need domain-specific prompts for areas where you lack expertise, or when you want to rapidly prototype multiple prompt variants for testing.

import openai

client = openai.OpenAI()

def generate_expert_prompt(task_description: str, audience: str) -> str:
    """Use an LLM to generate a specialized system prompt."""
    meta_prompt = f"""You are a prompt engineering expert. Create a detailed
system prompt for an LLM that will perform the following task:

Task: {task_description}
Target audience: {audience}

Your system prompt should include:
1. A clear role definition
2. Specific output format instructions
3. Quality criteria the LLM should follow
4. Edge cases to handle
5. Two concise examples of ideal output

Return ONLY the system prompt text, no commentary."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": meta_prompt}],
        temperature=0.7
    )
    return response.choices[0].message.content

# Generate a prompt for medical triage
prompt = generate_expert_prompt(
    task_description="Classify patient symptoms into urgency levels",
    audience="Emergency department nurses"
)
print(prompt[:300])
You are a clinical triage assistant. Your role is to analyze patient-reported symptoms and classify them into urgency levels following the Emergency Severity Index (ESI) framework. OUTPUT FORMAT: - Urgency Level: [1-Critical, 2-Emergent, 3-Urgent, 4-Less Urgent, 5-Non-Urgent] - Key Symptoms: [bullet list] - Recommended Action: [one sentence] ...
📝 Note: Meta-Prompting and Iteration

Meta-prompting is most powerful when combined with evaluation. Generate several candidate prompts, test each on a validation set, and select the best performer. This converts prompt engineering from a creative exercise into a measurable optimization problem. The next sections on DSPy and automatic prompt engineering formalize this approach.

3. Prompt Chaining and Decomposition

Prompt chaining breaks complex tasks into a pipeline of smaller, focused LLM calls. Each call in the chain handles one well-defined subtask, and the output of one call becomes the input for the next. This is the LLM equivalent of Unix pipes or functional composition: each stage does one thing well.

Decomposition offers several advantages over monolithic prompts:

Prompt Chaining: Multi-Stage Pipeline Stage 1: Extract Parse raw input into structured entities Model: gpt-4o-mini Stage 2: Analyze Reason about entities and relationships Model: gpt-4o Stage 3: Validate Check constraints and consistency Model: gpt-4o-mini Stage 4: Format Produce final structured output Model: gpt-4o-mini Raw text Entities (JSON) Analysis + flags Validated result Cost Comparison Monolithic (1 gpt-4o call): ~2,000 tokens in, ~800 out = $0.034 Chained (3 mini + 1 gpt-4o): stages 1,3,4 use mini = $0.019 Savings: 44% cost reduction with better reliability
Figure 10.6: A four-stage prompt chain uses cheaper models for simple stages and reserves expensive models for reasoning-heavy stages.

4. DSPy: Programmatic Prompt Optimization

📚 Paper Spotlight: Khattab et al. (2023)

"DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines" introduced the idea that prompts should be compiled, not hand-written. By defining module signatures and letting an optimizer search over prompt variants, DSPy consistently outperforms expert-crafted prompts on multi-step tasks. The paradigm shift: prompts are parameters to be learned, not strings to be crafted.

DSPy (Declarative Self-improving Language Programs, Khattab et al., 2023) is a framework that replaces hand-written prompts with compiled, optimizable programs. Instead of manually tweaking prompt text, you declare what each module should do using signatures, compose modules into a pipeline, and let an optimizer automatically discover the best prompts, few-shot examples, and configurations.

DSPy treats prompt engineering the way PyTorch treats neural network training: you define the architecture (modules and signatures), provide training data, and let the optimizer handle the rest.

⚠ Version Sensitivity

The DSPy API has changed significantly between major versions. The examples in this section use DSPy v2+ syntax (pip install dspy>=2.5). If you encounter import errors or unfamiliar class names, check your installed version. The DSPy documentation at dspy.ai tracks the current API surface.

4.1 Core Concepts

DSPy is built around three abstractions:

import dspy

# Configure the language model
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

# Define a signature: what the module should do
class FactCheck(dspy.Signature):
    """Verify whether a claim is supported by the given context."""
    context: str = dspy.InputField(desc="Reference text with known facts")
    claim: str = dspy.InputField(desc="The claim to verify")
    reasoning: str = dspy.OutputField(desc="Step-by-step verification")
    verdict: str = dspy.OutputField(desc="SUPPORTED, REFUTED, or NOT ENOUGH INFO")

# Create a module that implements the signature with CoT
fact_checker = dspy.ChainOfThought(FactCheck)

# Use it (DSPy generates the prompt automatically)
result = fact_checker(
    context="The Eiffel Tower is 330 meters tall and was built in 1889.",
    claim="The Eiffel Tower is the tallest structure in Paris at over 400m."
)
print(f"Verdict: {result.verdict}")
print(f"Reasoning: {result.reasoning}")
Verdict: REFUTED Reasoning: The context states the Eiffel Tower is 330 meters tall, not over 400 meters. While it may be the tallest structure in Paris, the claim about 400m is incorrect based on the provided context.

4.2 Optimizing with DSPy

The real power of DSPy is optimization. Given a training set and a metric function, the optimizer automatically discovers the best instructions and few-shot examples for your pipeline.

import dspy
from dspy.evaluate import Evaluate

# Define your program (multi-hop question answering)
class MultiHopQA(dspy.Module):
    def __init__(self):
        self.find_evidence = dspy.ChainOfThought(
            "question -> search_queries: list[str]"
        )
        self.answer = dspy.ChainOfThought(
            "question, evidence -> answer"
        )

    def forward(self, question):
        queries = self.find_evidence(question=question)
        evidence = [search(q) for q in queries.search_queries]
        return self.answer(
            question=question,
            evidence="\n".join(evidence)
        )

# Prepare training data
trainset = [
    dspy.Example(
        question="Who directed the highest-grossing film of 2023?",
        answer="Christopher Nolan"
    ).with_inputs("question"),
    # ... more examples
]

# Define a metric
def answer_match(example, prediction, trace=None):
    return example.answer.lower() in prediction.answer.lower()

# Optimize: automatically find best prompts and examples
optimizer = dspy.MIPROv2(metric=answer_match, auto="medium")
optimized_qa = optimizer.compile(MultiHopQA(), trainset=trainset)

# The optimized program has better prompts baked in
result = optimized_qa(question="What country hosted the 2024 Olympics?")
print(result.answer)
⚠ DSPy Optimization Cost

DSPy optimizers like MIPRO explore many prompt variants during compilation. A typical optimization run for a two-module pipeline might make 200 to 500 LLM calls to find optimal configurations. This upfront cost is amortized over all future queries. For a pipeline serving 10,000 queries per day, spending $5 on optimization to improve accuracy by 5% is a strong investment. For a one-off analysis, manual prompt tuning is more practical.

5. Automatic Prompt Engineering (APE and OPRO)

While DSPy optimizes entire programs, other approaches focus specifically on optimizing the prompt text itself. Two landmark methods illustrate this direction.

5.1 APE: Automatic Prompt Engineer

APE (Zhou et al., 2022) uses one LLM to generate candidate instructions for another LLM, then evaluates each candidate on a validation set and selects the winner. The process is straightforward: given a few input-output examples of the desired behavior, ask a model to "generate an instruction that would produce these outputs from these inputs." Generate many candidates, score them, and keep the best.

5.2 OPRO: Optimization by Prompting

OPRO (Yang et al., 2023) takes an iterative approach. It maintains a running log of previously tried prompts along with their scores. At each iteration, the optimizer LLM sees this history and generates new prompt candidates that attempt to improve on past results. This turns prompt optimization into an LLM-driven search process.

OPRO: Iterative Prompt Optimization Optimization History Prompt v1: "Classify the text" Score: 62% Prompt v2: "You are an expert..." Score: 74% Prompt v3: "Analyze sentiment..." Score: 79% Prompt v4: "Rate the emotional..." Score: 85% Scores trend upward over iterations Optimizer LLM "Given these past prompts and scores, generate a better prompt" Evaluate Run candidate prompt on validation set Score: 88% add to history, repeat
Figure 10.7: OPRO maintains a history of prompts and scores. The optimizer LLM uses this history to propose improved prompts iteratively.

6. Comparison of Optimization Approaches

Approach What It Optimizes Requires Best For
Manual tuning Prompt text (by hand) Human intuition One-off tasks, prototyping
Meta-prompting Prompt generation Task description only Rapid prompt drafts
APE Instruction text Input-output examples Instruction discovery
OPRO Instruction text (iterative) Validation set + metric Iterative refinement
DSPy Full pipeline (prompts + examples + modules) Training set + metric Production multi-step pipelines

📝 Section Quiz

1. How does Reflexion differ from a simple "retry on failure" loop?

Show Answer
A simple retry loop re-runs the same prompt, hoping randomness produces a better result. Reflexion adds a reflection step after each failure that generates a natural-language lesson (e.g., "I forgot to handle the edge case where the list is empty"). These lessons are stored in memory and prepended to the prompt on subsequent attempts, so the model learns from its mistakes rather than repeating them. This makes each retry substantively different from the last.

2. Why does prompt chaining often produce more reliable results than a single complex prompt?

Show Answer
A single complex prompt requires the model to handle multiple subtasks simultaneously: extraction, reasoning, validation, and formatting all in one pass. This overloads the model's attention and increases the probability of error at each subtask. Prompt chaining decomposes the problem so each call handles exactly one subtask. Simpler prompts have higher per-step accuracy, and intermediate outputs can be inspected and validated before passing to the next stage. The total pipeline reliability often exceeds that of a monolithic prompt, even though it involves more API calls.

3. In DSPy, what is the role of a "signature" and how does it differ from a traditional prompt?

Show Answer
A DSPy signature is a typed declaration of inputs and outputs (e.g., "context, question -> reasoning, answer") that specifies what a module should do without specifying how. A traditional prompt is a specific piece of text that tells the model how to behave. DSPy separates the "what" from the "how" so the optimizer can automatically discover the best prompt text, few-shot examples, and configuration. This means the developer declares intent and the framework handles implementation.

4. When would you choose OPRO over DSPy for prompt optimization?

Show Answer
OPRO is simpler and optimizes a single prompt instruction iteratively. It works well when you have one prompt to optimize and a clear metric. DSPy is designed for multi-module pipelines where you need to optimize prompts, few-shot examples, and inter-module flow jointly. If your task is a single-stage classification or generation task, OPRO is lighter-weight and sufficient. If you have a multi-step pipeline (extract, then reason, then validate), DSPy's module composition and joint optimization provide more benefit.

5. What is the risk of reflection on subjective tasks like creative writing?

Show Answer
On subjective tasks, the critique step lacks a clear ground truth. The model's "critique" may impose arbitrary preferences, simplify vivid language, or average out distinctive stylistic choices. This tends to make outputs more generic and homogeneous with each revision round. Reflection works best when there are objective success criteria (tests pass, facts are correct, schema is valid). For creative tasks, a single well-prompted generation often outperforms multiple rounds of self-critique.

Key Takeaways