Why does prompting matter? A large language model is a powerful text completion engine, but its output is entirely shaped by its input. The same model that produces vague, rambling answers to a poorly worded question can deliver precise, structured, expert-level responses when given a well-designed prompt. Prompt engineering is the discipline of systematically crafting inputs to maximize output quality. This section covers the foundational techniques that every practitioner needs: zero-shot prompting, few-shot learning, role assignment, system prompt architecture, and template design. These techniques form the building blocks for every advanced strategy covered later in this chapter.
1. The Anatomy of a Prompt
Before diving into specific techniques, it helps to understand the structural components that every prompt can contain. A well-designed prompt typically includes some combination of the following elements: an instruction that tells the model what to do, context that provides background information, input data that the model should process, output format specification that constrains the response shape, and examples that demonstrate the desired behavior. Not every prompt needs all five components, but being intentional about which components to include (and which to omit) is the first step toward reliable results.
In the chat API format used by most modern LLMs, these components are distributed across different message roles. The system message typically carries the instruction, context, and format specification. The user message carries the input data. And when using few-shot examples, pairs of user and assistant messages demonstrate the expected behavior before the actual query arrives.
2. Zero-Shot Prompting
Zero-shot prompting is the simplest approach: you give the model an instruction and input with no examples. The model relies entirely on its pre-trained knowledge to produce the answer. This works surprisingly well for tasks the model has seen during training, such as summarization, translation, classification of common categories, and question answering. The key to effective zero-shot prompting is specificity. Vague instructions produce vague results.
2.1 The Specificity Principle
Consider the difference between a vague prompt and a specific one for the same task:
# Vague zero-shot prompt vague_prompt = "Summarize this article." # Specific zero-shot prompt specific_prompt = """Summarize the following article in exactly 3 bullet points. Each bullet point should be one sentence of at most 20 words. Focus on the key findings, not background information. Use present tense throughout. Article: {article_text}"""
The specific prompt constrains the output length (3 bullet points), format (one sentence each), content focus (key findings), and style (present tense). These constraints dramatically reduce the space of possible outputs, making the model far more likely to produce exactly what you need. This principle applies universally: the more precisely you define success, the more reliably the model achieves it.
2.2 Zero-Shot Classification Example
import openai client = openai.OpenAI() def classify_sentiment(text: str) -> str: response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": """Classify the sentiment of the given text. Respond with exactly one word: positive, negative, or neutral. Do not include any explanation or punctuation."""}, {"role": "user", "content": text} ], temperature=0.0, max_tokens=5 ) return response.choices[0].message.content.strip().lower() # Test examples texts = [ "This product exceeded all my expectations!", "The delivery was late and the item arrived damaged.", "The package arrived on Tuesday as scheduled." ] for t in texts: print(f"{t[:50]:50s} => {classify_sentiment(t)}")
Setting temperature=0.0 for classification tasks ensures deterministic outputs. Since we want a single correct label (not creative variation), zero temperature eliminates randomness in sampling. Combined with max_tokens=5, this constrains the model to output only the label.
3. Few-Shot Prompting
Few-shot prompting provides examples of the desired input-output mapping before the actual query. This technique was popularized by the GPT-3 paper (Brown et al., 2020), which showed that providing as few as two or three examples can dramatically improve performance on tasks the model has not been explicitly fine-tuned for. Few-shot examples serve multiple purposes: they demonstrate the expected output format, clarify ambiguous instructions, establish the decision boundary for classification tasks, and prime the model's internal representations toward the target behavior.
3.1 Designing Effective Few-Shot Examples
Not all examples are equally useful. Research and practice have identified several principles for selecting few-shot examples:
- Cover the label space: Include at least one example per category. If you have three sentiment labels, show at least one positive, one negative, and one neutral example.
- Include edge cases: Show examples near the decision boundary. A mildly negative review is more informative than an obviously negative one.
- Match the distribution: If 80% of your real inputs are neutral, your examples should reflect that proportion (or slightly oversample rare classes).
- Keep formatting consistent: Every example should follow exactly the same structure. Inconsistency in formatting confuses the model.
- Order matters: Recent research shows that example order affects performance. Placing the most relevant example last (closest to the query) often helps.
3.2 Few-Shot Entity Extraction
import openai, json client = openai.OpenAI() def extract_entities(text: str) -> dict: response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": """Extract named entities from the text. Return a JSON object with keys: persons, organizations, locations. Each value is a list of strings. If no entities are found for a category, return an empty list."""}, # Few-shot example 1 {"role": "user", "content": "Tim Cook announced the new iPhone at Apple Park in Cupertino."}, {"role": "assistant", "content": '{"persons": ["Tim Cook"], "organizations": ["Apple"], "locations": ["Apple Park", "Cupertino"]}'}, # Few-shot example 2 (edge case: no persons) {"role": "user", "content": "The European Central Bank raised interest rates in Frankfurt."}, {"role": "assistant", "content": '{"persons": [], "organizations": ["European Central Bank"], "locations": ["Frankfurt"]}'}, # Actual query {"role": "user", "content": text} ], temperature=0.0 ) return json.loads(response.choices[0].message.content) result = extract_entities( "Satya Nadella confirmed that Microsoft will open a new office in Tokyo." ) print(json.dumps(result, indent=2))
Notice how the second few-shot example deliberately shows an edge case where no persons are present. This teaches the model to return an empty list rather than hallucinating a name. Without this example, models sometimes invent a person associated with the European Central Bank.
4. System Prompts and Role Assignment
The system prompt is the most powerful lever in prompt design. It sets the model's persona, establishes behavioral constraints, defines the output format, and provides persistent context that applies to every turn of the conversation. A well-crafted system prompt transforms a general-purpose model into a specialized tool.
4.1 Role Prompting
Assigning a role to the model activates domain-specific knowledge and adjusts the style, vocabulary, and reasoning patterns of the response. When you tell a model to act as a "senior data scientist," it tends to provide more technically rigorous answers with appropriate caveats. When you assign the role of "elementary school teacher," it simplifies language and uses analogies. This works because the model has seen text written by people in these roles during pre-training, and the role assignment shifts the probability distribution toward that style of text.
Role prompting is not magic. The model does not actually become an expert. Instead, it shifts its output distribution toward the kind of text that a person in that role would write. This means role prompting works best for roles that are well-represented in the training data (e.g., "Python developer," "medical doctor") and less well for niche or fictional roles. Always validate the output against ground truth rather than trusting the persona.
4.2 System Prompt Architecture
Production system prompts follow a layered structure. Each layer serves a distinct purpose, and the order matters because models pay more attention to instructions that appear early in the prompt and those that appear at the very end.
SYSTEM_PROMPT = """
## Role
You are a medical coding assistant specializing in ICD-10 classification.
You have 15 years of experience in health information management.
## Task
Given a clinical note, extract the primary diagnosis and assign the
correct ICD-10 code. If the note is ambiguous, list the top 3 most
likely codes with confidence scores.
## Constraints
- Only use ICD-10-CM codes (not ICD-10-PCS procedure codes)
- Never fabricate codes; if unsure, say "requires manual review"
- Do not provide medical advice or treatment recommendations
- Flag any note that mentions patient safety concerns
## Output Format
Return a JSON object:
{
"primary_diagnosis": "description",
"icd10_code": "X00.0",
"confidence": 0.95,
"alternatives": [{"code": "X00.1", "confidence": 0.80}],
"flags": ["safety_concern"] or []
}
## Examples
[Include 2-3 few-shot examples here]
"""
5. Prompt Templates and Variable Injection
In production applications, prompts are rarely static strings. They are templates with placeholders that get filled at runtime with user input, retrieved context, configuration parameters, and dynamic metadata. A prompt template separates the static instruction logic from the dynamic data, making prompts reusable, testable, and version-controllable.
5.1 Building a Prompt Template System
from string import Template from dataclasses import dataclass from typing import Optional @dataclass class PromptTemplate: """A reusable prompt template with variable injection.""" name: str version: str system_template: str user_template: str def render(self, **kwargs) -> list[dict]: """Render the template with provided variables.""" return [ {"role": "system", "content": self.system_template.format(**kwargs)}, {"role": "user", "content": self.user_template.format(**kwargs)} ] # Define a reusable classification template classifier = PromptTemplate( name="intent_classifier", version="1.2.0", system_template="""You are a customer support intent classifier. Classify the customer message into one of these categories: {categories} Respond with only the category name, nothing else.""", user_template="{message}" ) # Render at runtime messages = classifier.render( categories="billing, technical_support, account, general_inquiry", message="I can't log into my account and I need to reset my password" ) print(messages[0]["content"]) print(messages[1]["content"])
When injecting user-provided variables into prompt templates, be aware of prompt injection risks. A malicious user could provide input like "Ignore all previous instructions and..." as their message. Section 10.4 covers defense strategies in detail. As a basic precaution, always validate and sanitize user inputs before template injection, and consider wrapping user content in delimiters like XML tags (<user_input>...</user_input>) to help the model distinguish instructions from data.
6. Handling Edge Cases
Even well-designed prompts encounter edge cases. Understanding the common failure modes and building defenses into your prompts is essential for production reliability.
6.1 Hallucinations
Models sometimes generate plausible-sounding but factually incorrect information. To mitigate this, explicitly instruct the model to acknowledge uncertainty:
- Add instructions like: "If you are not certain, say 'I don't have enough information to answer this.'"
- Request citations or sources, then verify them independently.
- Use constrained output formats (like selection from a fixed list) to prevent open-ended fabrication.
- Lower the temperature to reduce creative (and potentially inaccurate) responses.
6.2 Refusals
Models sometimes refuse to answer legitimate queries because they misidentify them as harmful. When building prompts for applications where false refusals are costly, include explicit permission statements in the system prompt. For example: "You are a medical education tool. You may discuss medical conditions, symptoms, and treatments in an educational context." This helps calibrate the model's safety filters without disabling them entirely.
6.3 Verbosity Control
Without explicit length constraints, models tend to over-explain. Several techniques help control output length:
- Word/sentence limits: "Answer in at most 2 sentences."
- Format constraints: "Respond with only the JSON object, no explanation."
- Negative instructions: "Do not include any caveats, disclaimers, or preamble."
- max_tokens parameter: Set a hard cap at the API level as a safety net.
7. Iterative Prompt Refinement: A Practical Workflow
Prompt engineering is inherently iterative. The most effective workflow follows a build-measure-improve cycle. Start with a simple prompt, evaluate it against a test set, identify failure patterns, refine the prompt to address those failures, and re-evaluate. Each iteration should change only one aspect of the prompt so you can attribute improvements (or regressions) to specific modifications.
| Iteration | Change | Accuracy | Observation |
|---|---|---|---|
| v1 | Basic zero-shot: "Classify sentiment" | 72% | Many neutral texts misclassified as positive |
| v2 | Added output constraint: "positive, negative, or neutral" | 81% | Fewer hallucinated labels, still struggles with sarcasm |
| v3 | Added 3 few-shot examples (including sarcastic review) | 89% | Sarcasm handling improved; some edge cases with mixed sentiment |
| v4 | Added instruction: "Focus on the overall tone, not individual phrases" | 93% | Mixed-sentiment cases resolved; close to human baseline |
This iterative table is not hypothetical. The pattern of 15-20 percentage point improvements from basic zero-shot to well-engineered prompts is consistently observed in practice. The lesson is clear: prompt quality is not binary. It exists on a spectrum, and systematic refinement delivers measurable gains at each step.
📝 Section Quiz
1. What is the primary advantage of few-shot prompting over zero-shot prompting?
Show Answer
2. Why is setting temperature=0.0 recommended for classification tasks?
Show Answer
3. In the system prompt architecture, why does the order of layers matter?
Show Answer
4. A zero-shot classifier achieves 72% accuracy. After adding few-shot examples and output constraints, it reaches 93%. What should you try next?
Show Answer
5. What is the risk of injecting user-provided text directly into a prompt template without sanitization?
Show Answer
Key Takeaways
- Specificity is the foundation of prompt quality. Vague instructions produce vague outputs. Constrain the format, length, style, and content focus explicitly.
- Few-shot examples are the most reliable way to improve accuracy. Include examples that cover the label space, demonstrate edge cases, and maintain consistent formatting.
- System prompts should follow a layered architecture: role, task, constraints, output format, then examples. This structure is reusable across applications.
- Prompt templates separate logic from data. Use parameterized templates for production applications to enable version control, testing, and reuse.
- Prompt engineering is iterative. Start simple, measure against a test set, identify failure patterns, refine one element at a time, and re-measure. Each cycle yields measurable improvements.
- Build defenses into your prompts from the start. Handle hallucinations with uncertainty acknowledgment, control verbosity with explicit constraints, and protect against injection with input sanitization.