LLMOps extends MLOps with practices specific to language model applications. Prompts are code that must be versioned. Model behavior must be tested in production through A/B experiments with statistical rigor. User feedback must flow back into evaluation datasets, fine-tuning data, and prompt improvements to create a continuously improving system. This section covers the operational practices that separate prototype LLM apps from production-grade systems that improve over time.
1. Prompt Versioning
import json, hashlib from datetime import datetime from pathlib import Path class PromptRegistry: """Version and manage prompts with content-addressable storage.""" def __init__(self, store_path: str = "prompts/"): self.store = Path(store_path) self.store.mkdir(exist_ok=True) def register(self, name: str, template: str, metadata: dict = None): content_hash = hashlib.sha256(template.encode()).hexdigest()[:12] version = { "name": name, "hash": content_hash, "template": template, "metadata": metadata or {}, "created_at": datetime.utcnow().isoformat(), } path = self.store / f"{name}_{content_hash}.json" path.write_text(json.dumps(version, indent=2)) return content_hash def get(self, name: str, version_hash: str = None): if version_hash: path = self.store / f"{name}_{version_hash}.json" return json.loads(path.read_text()) # Return latest version versions = sorted(self.store.glob(f"{name}_*.json")) return json.loads(versions[-1].read_text()) if versions else None registry = PromptRegistry() v1 = registry.register("summarizer", "Summarize: {text}") v2 = registry.register("summarizer", "Provide a concise summary of: {text}") print(f"v1={v1}, v2={v2}")
2. A/B Testing Framework
import hashlib, random from dataclasses import dataclass @dataclass class ABExperiment: """Simple A/B test for prompt variants.""" name: str variant_a: str variant_b: str traffic_split: float = 0.5 # fraction going to variant B def assign(self, user_id: str) -> str: """Deterministic assignment based on user ID hash.""" h = hashlib.md5(f"{self.name}:{user_id}".encode()).hexdigest() bucket = int(h[:8], 16) / 0xFFFFFFFF if bucket < self.traffic_split: return "B" return "A" def get_prompt(self, user_id: str) -> str: variant = self.assign(user_id) return self.variant_a if variant == "A" else self.variant_b exp = ABExperiment( name="summarizer_prompt", variant_a="Summarize the following text:\n{text}", variant_b="Write a 2-sentence summary:\n{text}", ) for uid in ["user_101", "user_202", "user_303"]: print(f"{uid} -> variant {exp.assign(uid)}")
3. Online Evaluation and Feedback Loops
from dataclasses import dataclass, field from datetime import datetime import statistics @dataclass class FeedbackCollector: """Collect and aggregate user feedback for LLM outputs.""" records: list = field(default_factory=list) def log(self, request_id: str, variant: str, rating: int, feedback_text: str = "", latency_ms: float = 0): self.records.append({ "request_id": request_id, "variant": variant, "rating": rating, "feedback": feedback_text, "latency_ms": latency_ms, "timestamp": datetime.utcnow().isoformat(), }) def summary(self): by_variant = {} for r in self.records: v = r["variant"] by_variant.setdefault(v, []).append(r["rating"]) return { v: {"mean": statistics.mean(ratings), "n": len(ratings)} for v, ratings in by_variant.items() }
4. Model Registry
| Registry Feature | MLflow | W&B | HuggingFace Hub |
|---|---|---|---|
| Model versioning | Yes (stages) | Yes (aliases) | Yes (revisions) |
| Prompt versioning | Via artifacts | Via artifacts | Via model card |
| A/B experiment tracking | Native | Native | Limited |
| Deployment integration | SageMaker, Azure ML | Launch | Inference Endpoints |
| Self-hosted option | Yes (open source) | Enterprise | Yes (enterprise) |
import mlflow # Log a prompt experiment to MLflow with mlflow.start_run(run_name="prompt_v2.1_test"): mlflow.log_param("prompt_version", "v2.1") mlflow.log_param("model", "gpt-4o-mini") mlflow.log_param("temperature", 0.7) # Log evaluation metrics mlflow.log_metric("mean_rating", 4.2) mlflow.log_metric("hallucination_rate", 0.03) mlflow.log_metric("p50_latency_ms", 820) mlflow.log_metric("cost_per_request", 0.0023) # Log the prompt template as an artifact mlflow.log_text( "Write a 2-sentence summary of:\n{text}", "prompt_template.txt" )
Prompt versioning should capture not just the template text but also the model name, temperature, max tokens, system prompt, and any few-shot examples. A prompt that works well with GPT-4o may fail with Claude or Llama, so the model is part of the prompt's identity.
A/B tests on LLM outputs require larger sample sizes than traditional web experiments because LLM quality metrics (like human ratings or LLM-as-Judge scores) have high variance. Plan for at least 200 to 500 samples per variant before drawing conclusions, and always compute confidence intervals rather than relying on point estimates.
The data flywheel is the most powerful long-term advantage of a production LLM system. Every user interaction generates data that can improve evaluation sets, fine-tuning corpora, and retrieval indices. Teams that invest in feedback collection infrastructure early will compound improvements over time, while teams that skip it remain stuck with static prompts and models.
Knowledge Check
1. Why should prompt versioning use content-addressable hashing rather than sequential version numbers?
Show Answer
2. Why is hash-based traffic splitting preferred over random assignment in A/B tests?
Show Answer
3. What is a data flywheel and why is it important for LLM applications?
Show Answer
4. What metadata should be stored alongside a prompt version for full reproducibility?
Show Answer
5. Why do LLM A/B tests require larger sample sizes than traditional web experiments?
Show Answer
Key Takeaways
- Version prompts with content-addressable hashing and store the complete configuration (model, temperature, system prompt, few-shot examples) alongside the template.
- Use hash-based traffic splitting for deterministic A/B assignment that remains consistent across user sessions.
- Collect structured feedback (thumbs up/down, ratings, corrections) on every production response to fuel the data flywheel.
- Plan for large sample sizes (200 to 500 per variant) and compute confidence intervals for LLM A/B tests due to high output variance.
- Track experiments in a model registry (MLflow, W&B) that captures prompts, metrics, and model configurations together.
- The data flywheel is a production LLM system's most valuable long-term asset; invest in feedback infrastructure early.