From manual craft to automated optimization. Sections 10.1 and 10.2 covered techniques you write by hand: few-shot examples, role prompts, chain-of-thought templates. This section moves beyond manual prompt design into patterns that let LLMs improve their own outputs (reflection), generate new prompts (meta-prompting), and optimize prompt pipelines programmatically (DSPy, OPRO). These techniques represent the frontier where prompt engineering becomes prompt programming, shifting from artisanal tuning to systematic, reproducible optimization.
1. Self-Reflection and Iterative Refinement
Reflection is one of the most powerful patterns in modern prompt engineering. The core idea is simple: after generating an initial output, ask the model to critique its own work and then produce a revised version. This generate, critique, revise loop mirrors how skilled human writers and programmers work. They draft, review, and edit rather than producing a final version in a single pass.
Andrew Ng has identified reflection as a first-class agentic design pattern, noting that it provides outsized improvement relative to its implementation complexity. A single reflection pass can often close the gap between a weaker model with reflection and a stronger model without it.
1.1 Basic Reflection Loop
The simplest reflection pattern uses two sequential calls. The first call generates an initial output. The second call receives both the original task and the initial output, then critiques the output and produces an improved version.
import openai client = openai.OpenAI() def reflect_and_refine(task: str, max_rounds: int = 2) -> dict: """Generate, critique, and revise in a loop.""" # Step 1: Initial generation draft = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": task}], temperature=0.7 ).choices[0].message.content history = [{"round": 0, "output": draft}] for i in range(max_rounds): # Step 2: Critique the current draft critique = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": """You are a rigorous reviewer. Analyze the draft for: 1. Correctness: Are there factual or logical errors? 2. Completeness: Is anything important missing? 3. Clarity: Is the writing clear and well-structured? List specific issues, then rate overall quality 1-10."""}, {"role": "user", "content": f"Task: {task}\n\nDraft:\n{draft}"} ], temperature=0.3 ).choices[0].message.content # Step 3: Revise based on critique draft = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "Revise the draft to address every issue raised."}, {"role": "user", "content": f"Task: {task}\nDraft:\n{draft}\nCritique:\n{critique}"} ], temperature=0.4 ).choices[0].message.content history.append({"round": i + 1, "critique": critique, "output": draft}) return {"final": draft, "history": history}
1.2 Reflexion: Memory-Augmented Self-Improvement
Reflexion (Shinn et al., 2023) extends basic reflection by adding persistent memory across attempts. When a task involves multiple trials (such as coding challenges or multi-step reasoning), Reflexion stores natural-language "lessons learned" from each failed attempt. On subsequent attempts, the agent reads these lessons before trying again, avoiding previously encountered pitfalls. This is analogous to how a human programmer keeps a mental note of bugs they have already debugged.
The Reflexion architecture has three components:
- Actor: Generates actions or outputs for the task.
- Evaluator: Provides a binary or scalar signal (e.g., did the code pass all tests?).
- Self-reflection: Given the failure signal and the trajectory, generates a natural-language reflection that is stored in a memory buffer.
import openai client = openai.OpenAI() def reflexion_code_solver( task: str, tests: list[str], max_attempts: int = 3 ) -> dict: """Solve a coding task with Reflexion-style memory.""" memory = [] # Persistent lessons from past failures for attempt in range(max_attempts): # Build context with accumulated lessons memory_str = "\n".join( f"Lesson {i+1}: {m}" for i, m in enumerate(memory) ) prompt = f"""Solve this coding task. Task: {task} {"Lessons from previous attempts:" + chr(10) + memory_str if memory else ""} Return ONLY the Python function.""" # Generate code code = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], temperature=0.2 ).choices[0].message.content # Run tests (evaluator) results = run_tests(code, tests) if all(r["passed"] for r in results): return {"code": code, "attempts": attempt + 1} # Reflect on failures and add to memory failures = [r for r in results if not r["passed"]] reflection = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": f"""My code failed these tests: {failures} My code was: {code} In one sentence, what key insight did I miss?"""}], temperature=0.3 ).choices[0].message.content memory.append(reflection) return {"code": code, "attempts": max_attempts, "failed": True}
Reflexion's power comes from converting failure signals into natural-language lessons that persist across attempts. Unlike simple retry loops (which just re-run the same prompt), Reflexion accumulates structured knowledge about what went wrong. On the HumanEval coding benchmark, Reflexion improved pass@1 from 80% to 91% using GPT-4, with most problems solved within two to three attempts.
1.3 When Reflection Helps vs. When It Wastes Compute
Reflection is not universally beneficial. It adds latency (two to three extra API calls per round) and cost. Use it when the task has objectively verifiable quality criteria (tests pass, facts are correct, format matches spec). Avoid it for subjective tasks where the model may "revise" a perfectly good answer into something worse, or for simple tasks where the first-pass accuracy is already above 95%.
| Scenario | Reflection Helps? | Why |
|---|---|---|
| Code generation with unit tests | Yes | Tests provide clear pass/fail signal |
| Fact-checked report writing | Yes | Verifiable claims can be audited |
| Creative fiction | Rarely | Quality is subjective; revisions may flatten style |
| Simple classification | No | Single-pass accuracy is already high |
| Structured data extraction | Yes | Schema validation provides concrete error signals |
2. Meta-Prompting: Prompts That Generate Prompts
Meta-prompting uses one LLM call to write the prompt for a subsequent call. Instead of manually crafting a system prompt, you describe what you need the prompt to accomplish and let the model generate it. This is particularly useful when you need domain-specific prompts for areas where you lack expertise, or when you want to rapidly prototype multiple prompt variants for testing.
import openai client = openai.OpenAI() def generate_expert_prompt(task_description: str, audience: str) -> str: """Use an LLM to generate a specialized system prompt.""" meta_prompt = f"""You are a prompt engineering expert. Create a detailed system prompt for an LLM that will perform the following task: Task: {task_description} Target audience: {audience} Your system prompt should include: 1. A clear role definition 2. Specific output format instructions 3. Quality criteria the LLM should follow 4. Edge cases to handle 5. Two concise examples of ideal output Return ONLY the system prompt text, no commentary.""" response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": meta_prompt}], temperature=0.7 ) return response.choices[0].message.content # Generate a prompt for medical triage prompt = generate_expert_prompt( task_description="Classify patient symptoms into urgency levels", audience="Emergency department nurses" ) print(prompt[:300])
Meta-prompting is most powerful when combined with evaluation. Generate several candidate prompts, test each on a validation set, and select the best performer. This converts prompt engineering from a creative exercise into a measurable optimization problem. The next sections on DSPy and automatic prompt engineering formalize this approach.
3. Prompt Chaining and Decomposition
Prompt chaining breaks complex tasks into a pipeline of smaller, focused LLM calls. Each call in the chain handles one well-defined subtask, and the output of one call becomes the input for the next. This is the LLM equivalent of Unix pipes or functional composition: each stage does one thing well.
Decomposition offers several advantages over monolithic prompts:
- Reliability: Each stage is simpler, so each individual call is more likely to succeed.
- Debuggability: When something fails, you can inspect intermediate outputs to identify exactly which stage broke.
- Modularity: Individual stages can be swapped, tuned, or replaced independently.
- Cost optimization: Simple stages can use cheaper, faster models while only complex stages use expensive ones.
4. DSPy: Programmatic Prompt Optimization
"DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines" introduced the idea that prompts should be compiled, not hand-written. By defining module signatures and letting an optimizer search over prompt variants, DSPy consistently outperforms expert-crafted prompts on multi-step tasks. The paradigm shift: prompts are parameters to be learned, not strings to be crafted.
DSPy (Declarative Self-improving Language Programs, Khattab et al., 2023) is a framework that replaces hand-written prompts with compiled, optimizable programs. Instead of manually tweaking prompt text, you declare what each module should do using signatures, compose modules into a pipeline, and let an optimizer automatically discover the best prompts, few-shot examples, and configurations.
DSPy treats prompt engineering the way PyTorch treats neural network training: you define the architecture (modules and signatures), provide training data, and let the optimizer handle the rest.
The DSPy API has changed significantly between major versions. The examples in this section use DSPy v2+ syntax (pip install dspy>=2.5). If you encounter import errors or unfamiliar class names, check your installed version. The DSPy documentation at dspy.ai tracks the current API surface.
4.1 Core Concepts
DSPy is built around three abstractions:
- Signatures: Typed input/output specifications like
"question -> answer"or"context, question -> reasoning, answer". These declare what a module does without specifying how. - Modules: Building blocks that implement signatures.
dspy.Predictdoes a single call.dspy.ChainOfThoughtadds reasoning.dspy.ReActadds tool use. You compose modules into programs. - Optimizers (Teleprompters): Algorithms that tune the program.
BootstrapFewShotselects optimal few-shot examples.MIPROoptimizes instructions and examples jointly.BootstrapFinetunegenerates training data for model finetuning.
import dspy # Configure the language model lm = dspy.LM("openai/gpt-4o-mini") dspy.configure(lm=lm) # Define a signature: what the module should do class FactCheck(dspy.Signature): """Verify whether a claim is supported by the given context.""" context: str = dspy.InputField(desc="Reference text with known facts") claim: str = dspy.InputField(desc="The claim to verify") reasoning: str = dspy.OutputField(desc="Step-by-step verification") verdict: str = dspy.OutputField(desc="SUPPORTED, REFUTED, or NOT ENOUGH INFO") # Create a module that implements the signature with CoT fact_checker = dspy.ChainOfThought(FactCheck) # Use it (DSPy generates the prompt automatically) result = fact_checker( context="The Eiffel Tower is 330 meters tall and was built in 1889.", claim="The Eiffel Tower is the tallest structure in Paris at over 400m." ) print(f"Verdict: {result.verdict}") print(f"Reasoning: {result.reasoning}")
4.2 Optimizing with DSPy
The real power of DSPy is optimization. Given a training set and a metric function, the optimizer automatically discovers the best instructions and few-shot examples for your pipeline.
import dspy from dspy.evaluate import Evaluate # Define your program (multi-hop question answering) class MultiHopQA(dspy.Module): def __init__(self): self.find_evidence = dspy.ChainOfThought( "question -> search_queries: list[str]" ) self.answer = dspy.ChainOfThought( "question, evidence -> answer" ) def forward(self, question): queries = self.find_evidence(question=question) evidence = [search(q) for q in queries.search_queries] return self.answer( question=question, evidence="\n".join(evidence) ) # Prepare training data trainset = [ dspy.Example( question="Who directed the highest-grossing film of 2023?", answer="Christopher Nolan" ).with_inputs("question"), # ... more examples ] # Define a metric def answer_match(example, prediction, trace=None): return example.answer.lower() in prediction.answer.lower() # Optimize: automatically find best prompts and examples optimizer = dspy.MIPROv2(metric=answer_match, auto="medium") optimized_qa = optimizer.compile(MultiHopQA(), trainset=trainset) # The optimized program has better prompts baked in result = optimized_qa(question="What country hosted the 2024 Olympics?") print(result.answer)
DSPy optimizers like MIPRO explore many prompt variants during compilation. A typical optimization run for a two-module pipeline might make 200 to 500 LLM calls to find optimal configurations. This upfront cost is amortized over all future queries. For a pipeline serving 10,000 queries per day, spending $5 on optimization to improve accuracy by 5% is a strong investment. For a one-off analysis, manual prompt tuning is more practical.
5. Automatic Prompt Engineering (APE and OPRO)
While DSPy optimizes entire programs, other approaches focus specifically on optimizing the prompt text itself. Two landmark methods illustrate this direction.
5.1 APE: Automatic Prompt Engineer
APE (Zhou et al., 2022) uses one LLM to generate candidate instructions for another LLM, then evaluates each candidate on a validation set and selects the winner. The process is straightforward: given a few input-output examples of the desired behavior, ask a model to "generate an instruction that would produce these outputs from these inputs." Generate many candidates, score them, and keep the best.
5.2 OPRO: Optimization by Prompting
OPRO (Yang et al., 2023) takes an iterative approach. It maintains a running log of previously tried prompts along with their scores. At each iteration, the optimizer LLM sees this history and generates new prompt candidates that attempt to improve on past results. This turns prompt optimization into an LLM-driven search process.
6. Comparison of Optimization Approaches
| Approach | What It Optimizes | Requires | Best For |
|---|---|---|---|
| Manual tuning | Prompt text (by hand) | Human intuition | One-off tasks, prototyping |
| Meta-prompting | Prompt generation | Task description only | Rapid prompt drafts |
| APE | Instruction text | Input-output examples | Instruction discovery |
| OPRO | Instruction text (iterative) | Validation set + metric | Iterative refinement |
| DSPy | Full pipeline (prompts + examples + modules) | Training set + metric | Production multi-step pipelines |
📝 Section Quiz
1. How does Reflexion differ from a simple "retry on failure" loop?
Show Answer
2. Why does prompt chaining often produce more reliable results than a single complex prompt?
Show Answer
3. In DSPy, what is the role of a "signature" and how does it differ from a traditional prompt?
Show Answer
"context, question -> reasoning, answer") that specifies what a module should do without specifying how. A traditional prompt is a specific piece of text that tells the model how to behave. DSPy separates the "what" from the "how" so the optimizer can automatically discover the best prompt text, few-shot examples, and configuration. This means the developer declares intent and the framework handles implementation.4. When would you choose OPRO over DSPy for prompt optimization?
Show Answer
5. What is the risk of reflection on subjective tasks like creative writing?
Show Answer
Key Takeaways
- Reflection is the highest-value advanced pattern. A simple generate, critique, revise loop can close the gap between weak and strong models. Use it whenever you have verifiable quality criteria.
- Reflexion adds memory to reflection. By storing natural-language lessons from failures, Reflexion makes each retry attempt meaningfully different. It improved HumanEval pass@1 from 80% to 91%.
- Meta-prompting automates prompt drafting. Use an LLM to generate specialized prompts rather than writing them from scratch, especially for domains outside your expertise.
- Prompt chaining trades latency for reliability. Decomposing a complex task into focused stages makes each stage simpler and more likely to succeed. Route simple stages to cheaper models to offset the additional API calls.
- DSPy turns prompt engineering into programming. Declare signatures, compose modules, and let optimizers discover the best prompts and examples. This is the most scalable approach for production pipelines.
- Automatic optimization (APE, OPRO) treats prompt text as a search problem. Given a validation set and metric, algorithms can discover prompts that outperform human-written ones.