Section 26.4: LLMOps & Continuous Improvement

★ Big Picture

LLMOps extends MLOps with practices specific to language model applications. Prompts are code that must be versioned. Model behavior must be tested in production through A/B experiments with statistical rigor. User feedback must flow back into evaluation datasets, fine-tuning data, and prompt improvements to create a continuously improving system. This section covers the operational practices that separate prototype LLM apps from production-grade systems that improve over time.

1. Prompt Versioning

import json, hashlib
from datetime import datetime
from pathlib import Path

class PromptRegistry:
    """Version and manage prompts with content-addressable storage."""

    def __init__(self, store_path: str = "prompts/"):
        self.store = Path(store_path)
        self.store.mkdir(exist_ok=True)

    def register(self, name: str, template: str, metadata: dict = None):
        content_hash = hashlib.sha256(template.encode()).hexdigest()[:12]
        version = {
            "name": name,
            "hash": content_hash,
            "template": template,
            "metadata": metadata or {},
            "created_at": datetime.utcnow().isoformat(),
        }
        path = self.store / f"{name}_{content_hash}.json"
        path.write_text(json.dumps(version, indent=2))
        return content_hash

    def get(self, name: str, version_hash: str = None):
        if version_hash:
            path = self.store / f"{name}_{version_hash}.json"
            return json.loads(path.read_text())
        # Return latest version
        versions = sorted(self.store.glob(f"{name}_*.json"))
        return json.loads(versions[-1].read_text()) if versions else None

registry = PromptRegistry()
v1 = registry.register("summarizer", "Summarize: {text}")
v2 = registry.register("summarizer", "Provide a concise summary of: {text}")
print(f"v1={v1}, v2={v2}")

v1=a3f8b2c1d4e5, v2=7b9c1d3e5f8a

2. A/B Testing Framework

Figure 26.4.1: A/B testing pipeline for LLM prompt variants with hash-based traffic splitting and online metric collection.

import hashlib, random
from dataclasses import dataclass

@dataclass
class ABExperiment:
    """Simple A/B test for prompt variants."""
    name: str
    variant_a: str
    variant_b: str
    traffic_split: float = 0.5   # fraction going to variant B

    def assign(self, user_id: str) -> str:
        """Deterministic assignment based on user ID hash."""
        h = hashlib.md5(f"{self.name}:{user_id}".encode()).hexdigest()
        bucket = int(h[:8], 16) / 0xFFFFFFFF
        if bucket < self.traffic_split:
            return "B"
        return "A"

    def get_prompt(self, user_id: str) -> str:
        variant = self.assign(user_id)
        return self.variant_a if variant == "A" else self.variant_b

exp = ABExperiment(
    name="summarizer_prompt",
    variant_a="Summarize the following text:\n{text}",
    variant_b="Write a 2-sentence summary:\n{text}",
)
for uid in ["user_101", "user_202", "user_303"]:
    print(f"{uid} -> variant {exp.assign(uid)}")

3. Online Evaluation and Feedback Loops

from dataclasses import dataclass, field
from datetime import datetime
import statistics

@dataclass
class FeedbackCollector:
    """Collect and aggregate user feedback for LLM outputs."""
    records: list = field(default_factory=list)

    def log(self, request_id: str, variant: str, rating: int,
           feedback_text: str = "", latency_ms: float = 0):
        self.records.append({
            "request_id": request_id, "variant": variant,
            "rating": rating, "feedback": feedback_text,
            "latency_ms": latency_ms,
            "timestamp": datetime.utcnow().isoformat(),
        })

    def summary(self):
        by_variant = {}
        for r in self.records:
            v = r["variant"]
            by_variant.setdefault(v, []).append(r["rating"])
        return {
            v: {"mean": statistics.mean(ratings), "n": len(ratings)}
            for v, ratings in by_variant.items()
        }

Figure 26.4.2: The data flywheel turns production usage into training data, creating a self-improving cycle.

4. Model Registry

Registry Feature	MLflow	W&B	HuggingFace Hub
Model versioning	Yes (stages)	Yes (aliases)	Yes (revisions)
Prompt versioning	Via artifacts	Via artifacts	Via model card
A/B experiment tracking	Native	Native	Limited
Deployment integration	SageMaker, Azure ML	Launch	Inference Endpoints
Self-hosted option	Yes (open source)	Enterprise	Yes (enterprise)

import mlflow

# Log a prompt experiment to MLflow
with mlflow.start_run(run_name="prompt_v2.1_test"):
    mlflow.log_param("prompt_version", "v2.1")
    mlflow.log_param("model", "gpt-4o-mini")
    mlflow.log_param("temperature", 0.7)

    # Log evaluation metrics
    mlflow.log_metric("mean_rating", 4.2)
    mlflow.log_metric("hallucination_rate", 0.03)
    mlflow.log_metric("p50_latency_ms", 820)
    mlflow.log_metric("cost_per_request", 0.0023)

    # Log the prompt template as an artifact
    mlflow.log_text(
        "Write a 2-sentence summary of:\n{text}",
        "prompt_template.txt"
    )

📝 Note

Prompt versioning should capture not just the template text but also the model name, temperature, max tokens, system prompt, and any few-shot examples. A prompt that works well with GPT-4o may fail with Claude or Llama, so the model is part of the prompt's identity.

⚠ Warning

A/B tests on LLM outputs require larger sample sizes than traditional web experiments because LLM quality metrics (like human ratings or LLM-as-Judge scores) have high variance. Plan for at least 200 to 500 samples per variant before drawing conclusions, and always compute confidence intervals rather than relying on point estimates.

★ Key Insight

The data flywheel is the most powerful long-term advantage of a production LLM system. Every user interaction generates data that can improve evaluation sets, fine-tuning corpora, and retrieval indices. Teams that invest in feedback collection infrastructure early will compound improvements over time, while teams that skip it remain stuck with static prompts and models.

Knowledge Check

1. Why should prompt versioning use content-addressable hashing rather than sequential version numbers?

Show Answer

Content-addressable hashing ensures that the version ID is derived from the prompt content itself, making it impossible to accidentally assign the same version number to different content or to have two different systems disagree on what "v3" means. It also makes deduplication trivial: identical prompts always produce the same hash.

2. Why is hash-based traffic splitting preferred over random assignment in A/B tests?

Show Answer

Hash-based splitting is deterministic: the same user always sees the same variant across sessions. Random assignment could show different variants to the same user on different requests, contaminating the experiment and making it impossible to measure the effect of a variant on user behavior over time.

3. What is a data flywheel and why is it important for LLM applications?

Show Answer

A data flywheel is a virtuous cycle where production usage generates feedback data, which is curated into evaluation and training sets, which improves the model, which generates better interactions, producing more valuable data. It is important because LLM applications that leverage this cycle compound their quality improvements over time, creating a durable competitive advantage.

4. What metadata should be stored alongside a prompt version for full reproducibility?

Show Answer

Full reproducibility requires storing the prompt template, model name and version, temperature, max tokens, top-p, system prompt, few-shot examples, stop sequences, and any post-processing logic. The model is part of the prompt's identity because the same template can produce very different results with different models.

5. Why do LLM A/B tests require larger sample sizes than traditional web experiments?

Show Answer

LLM quality metrics (human ratings, LLM-as-Judge scores, task success rates) have much higher variance than binary click/conversion metrics. The probabilistic nature of LLM outputs means even identical inputs produce different results across runs. This high variance requires more samples to achieve statistical significance and reliable effect size estimates.

Key Takeaways

Version prompts with content-addressable hashing and store the complete configuration (model, temperature, system prompt, few-shot examples) alongside the template.
Use hash-based traffic splitting for deterministic A/B assignment that remains consistent across user sessions.
Collect structured feedback (thumbs up/down, ratings, corrections) on every production response to fuel the data flywheel.
Plan for large sample sizes (200 to 500 per variant) and compute confidence intervals for LLM A/B tests due to high output variance.
Track experiments in a model registry (MLflow, W&B) that captures prompts, metrics, and model configurations together.
The data flywheel is a production LLM system's most valuable long-term asset; invest in feedback infrastructure early.