Section 25.7: LLM Experiment Reproducibility

★ Big Picture

Reproducibility in LLM experiments is harder than in traditional ML, and also more important. Traditional ML experiments depend on data, code, and hyperparameters. LLM experiments add several new dimensions: prompt templates, provider API versions, retrieval configurations, tool definitions, and external service behaviors. When you cannot reproduce an experiment, you cannot trust its results, compare it fairly against alternatives, or debug regressions. This section covers the tools and practices that make LLM experiments reproducible, from prompt versioning through containerized execution environments.

1. Why LLM Reproducibility Is Hard

LLM experiments face reproducibility challenges that do not exist in traditional machine learning. Even with identical code, data, and configuration, you may get different results because of factors outside your control: provider-side model updates, non-deterministic GPU computation, changing API behaviors, and evolving safety filters.

The LLM Reproducibility Stack

To fully reproduce an LLM experiment, you must version and capture every layer of the stack:

Prompt layer: System prompts, user prompt templates, few-shot examples, tool descriptions
Model layer: Provider, model name, version (date suffix), temperature, seed, max tokens
Data layer: Evaluation dataset, knowledge base contents, embedding model version
Code layer: Application code, library versions, configuration files
Infrastructure layer: API endpoint, region, hardware (for local models), container image

Figure 25.16: The five layers that must be versioned for fully reproducible LLM experiments.

2. Configuration Management with Hydra

Hydra is a configuration framework that enables composable, hierarchical configuration with overrides from the command line. It is particularly useful for LLM experiments because it can manage the many interacting parameters (model settings, prompt templates, retrieval parameters, evaluation settings) in a structured, versioned way.

# config/experiment.yaml
defaults:
  - model: gpt4o
  - prompt: rag_v2
  - retrieval: dense_rerank
  - eval: standard

experiment:
  name: rag_ablation_cot
  seed: 42
  num_eval_seeds: 5

# config/model/gpt4o.yaml
provider: openai
model_name: gpt-4o-2024-08-06
temperature: 0.0
max_tokens: 1024
seed: 42

# config/retrieval/dense_rerank.yaml
embedding_model: text-embedding-3-small
top_k: 10
reranker: cohere-rerank-v3
rerank_top_n: 3
chunk_size: 512
chunk_overlap: 50

import hydra
from omegaconf import DictConfig, OmegaConf

@hydra.main(version_base=None, config_path="config", config_name="experiment")
def run_experiment(cfg: DictConfig):
    """Run an LLM experiment with full configuration tracking."""
    # Hydra automatically saves the full resolved config
    print(OmegaConf.to_yaml(cfg))

    # Access nested config values
    model_name = cfg.model.model_name
    temperature = cfg.model.temperature
    top_k = cfg.retrieval.top_k

    # Run evaluation with tracked configuration
    results = evaluate_pipeline(cfg)
    save_results(results, cfg)

def evaluate_pipeline(cfg: DictConfig) -> dict:
    """Run the evaluation pipeline with the given config."""
    # Pipeline implementation using config values...
    return {"accuracy": 0.847, "faithfulness": 0.91}

def save_results(results: dict, cfg: DictConfig):
    """Save results alongside the full config for reproducibility."""
    import json
    output = {
        "config": OmegaConf.to_container(cfg, resolve=True),
        "results": results,
    }
    with open("experiment_results.json", "w") as f:
        json.dump(output, f, indent=2)

# Run with overrides from command line:
# python experiment.py model=gpt4o_mini retrieval.top_k=20

📝 Hydra Output Directory

Hydra automatically creates a timestamped output directory for each experiment run, containing the fully resolved configuration, any output files, and logs. This makes every run self-documenting. Combined with git commit tracking, this gives you everything needed to reproduce any past experiment: the code (git SHA), the configuration (Hydra output), and the results.

3. Dataset Versioning with DVC

DVC (Data Version Control) extends git to handle large files and datasets. For LLM experiments, DVC tracks evaluation datasets, knowledge base snapshots, and embedding indexes. By storing lightweight pointer files in git and the actual data in cloud storage (S3, GCS, Azure Blob), DVC provides versioning without bloating the git repository.

# Initialize DVC in your project
$ dvc init
$ dvc remote add -d storage s3://my-bucket/llm-experiments

# Track evaluation dataset
$ dvc add data/eval_dataset_v3.jsonl
$ git add data/eval_dataset_v3.jsonl.dvc data/.gitignore
$ git commit -m "Track eval dataset v3"

# Track knowledge base snapshot
$ dvc add data/knowledge_base/
$ git add data/knowledge_base.dvc
$ git commit -m "Track knowledge base snapshot 2024-Q3"

# Push data to remote storage
$ dvc push

# Reproduce exact data state from any git commit
$ git checkout experiment-2024-08-15
$ dvc checkout  # restores data files to match that commit

4. Experiment Tracking with MLflow and W&B

Experiment tracking platforms record every run with its configuration, metrics, artifacts, and metadata. They provide dashboards for comparing runs, visualizing trends, and identifying the best configurations. Both MLflow (open source, self-hostable) and Weights & Biases (cloud-based, with a free tier) are widely used in the LLM community.

import mlflow
from omegaconf import DictConfig, OmegaConf

def track_experiment_mlflow(cfg: DictConfig, results: dict):
    """Track an LLM experiment run with MLflow."""
    mlflow.set_experiment(cfg.experiment.name)

    with mlflow.start_run():
        # Log all configuration parameters
        flat_cfg = {
            k: str(v) for k, v
            in OmegaConf.to_container(cfg, resolve=True).items()
        }
        mlflow.log_params({
            "model": cfg.model.model_name,
            "temperature": cfg.model.temperature,
            "top_k": cfg.retrieval.top_k,
            "reranker": cfg.retrieval.reranker,
            "seed": cfg.experiment.seed,
        })

        # Log evaluation metrics
        mlflow.log_metrics({
            "accuracy": results["accuracy"],
            "faithfulness": results["faithfulness"],
            "latency_p50_ms": results["latency_p50"],
            "cost_per_query_usd": results["cost_per_query"],
        })

        # Log prompt template as artifact
        mlflow.log_text(cfg.prompt.template, "prompt_template.txt")

        # Log full config
        mlflow.log_text(OmegaConf.to_yaml(cfg), "full_config.yaml")

        # Tag the run for easy filtering
        mlflow.set_tags({
            "experiment_type": "ablation",
            "git_sha": get_git_sha(),
        })

Experiment Tracking Platform Comparison

Feature	MLflow	Weights & Biases	DVC (with Studio)
Open source	Yes (fully)	No (cloud service)	Yes (core), paid UI
Self-hosting	Yes	Enterprise only	Yes
Prompt tracking	As artifacts	Tables + artifacts	Via params/files
Comparison UI	Good	Excellent	Good (Studio)
Data versioning	Artifacts (limited)	Artifacts + tables	Native (core strength)
Cost tracking	Custom metrics	Custom metrics	Custom metrics

💡 Key Insight

The choice between MLflow and W&B often comes down to infrastructure preferences. Choose MLflow when self-hosting and data sovereignty are requirements, when you want a fully open-source stack, or when you are already using the MLflow ecosystem. Choose W&B when you want the best visualization and collaboration UI, when cloud hosting is acceptable, or when you value features like report generation and team dashboards. Both integrate well with Hydra and DVC.

5. Containerized Reproducibility with Docker

Docker containers provide the ultimate reproducibility guarantee for the code and infrastructure layers. A Dockerfile that pins every dependency version, combined with versioned data (DVC) and configuration (Hydra), enables exact reproduction of any experiment on any machine.

# Dockerfile for reproducible LLM experiments
FROM python:3.11-slim

# Pin system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    git curl && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy and install pinned Python dependencies first (cache-friendly)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Record build metadata for traceability
ARG GIT_SHA=unknown
ARG BUILD_DATE=unknown
ENV GIT_SHA=${GIT_SHA} BUILD_DATE=${BUILD_DATE}

# Default command: run experiment with Hydra
ENTRYPOINT ["python", "experiment.py"]

# Build with metadata
$ docker build \
    --build-arg GIT_SHA=$(git rev-parse HEAD) \
    --build-arg BUILD_DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
    -t llm-experiment:v1.2.0 .

# Run experiment with config overrides
$ docker run \
    -e OPENAI_API_KEY=$OPENAI_API_KEY \
    -v $(pwd)/data:/app/data \
    -v $(pwd)/outputs:/app/outputs \
    llm-experiment:v1.2.0 \
    model=gpt4o retrieval.top_k=5 experiment.seed=42

Figure 25.17: Complete reproducibility workflow combining configuration, code, data, environment, and result tracking.

⚠ API-Based Models Break Full Reproducibility

When using API-based models (OpenAI, Anthropic, Google), you can never achieve full bit-level reproducibility because the provider controls the inference hardware and may change it between requests. The best you can do is pin the model version, set temperature to 0, provide a seed, and log the system_fingerprint. For experiments requiring guaranteed reproducibility, use locally hosted open-weight models where you control the entire stack.

📝 Knowledge Check

1. List five layers of the LLM reproducibility stack and explain what must be versioned at each layer.

Show Answer

(1) Prompt layer: system prompts, user templates, few-shot examples, tool descriptions must be versioned in files tracked by git. (2) Model layer: provider name, model version (with date suffix), temperature, seed, max_tokens must be captured in configuration files. (3) Data layer: evaluation datasets and knowledge base contents must be versioned with DVC or similar tools. (4) Code layer: application code (git SHA), library versions (requirements.txt) must be committed and pinned. (5) Infrastructure layer: Docker image with pinned base image, API endpoint configuration, and hardware specification for local models.

2. How does Hydra improve reproducibility compared to using command-line arguments or environment variables?

Show Answer

Hydra automatically saves the complete, fully resolved configuration for every experiment run to a timestamped output directory. This means every parameter value is recorded without any manual effort. Hydra also supports composition (combining multiple config files), type validation, and structured overrides. Command-line arguments and environment variables, by contrast, are ephemeral and not automatically recorded. You would need to manually log them, which is error-prone and often forgotten. Hydra makes configuration explicit, versioned, and automatically archived.

3. Why does DVC store pointer files in git rather than the actual data?

Show Answer

Large datasets and embedding indexes can be gigabytes or more, which would bloat the git repository, slow down cloning, and exceed GitHub's file size limits. DVC stores small pointer files (containing a hash of the data) in git and the actual data in remote storage (S3, GCS, etc.). When you check out a specific git commit and run dvc checkout, DVC uses the pointer file to download the exact data version from remote storage. This gives you git-like versioning semantics for large files without the storage overhead.

4. When would you choose MLflow over Weights & Biases for experiment tracking?

Show Answer

Choose MLflow when: (1) you require self-hosted infrastructure for data privacy or compliance reasons, (2) you want a fully open-source solution with no vendor lock-in, (3) your team already uses MLflow for traditional ML experiments, or (4) you need to integrate with tools in the MLflow ecosystem (model registry, deployment). W&B is preferred when you want the best visualization UI, team collaboration features, or managed cloud infrastructure.

5. Why can you not achieve full bit-level reproducibility with API-based LLM providers?

Show Answer

API providers control the inference infrastructure, including GPU type, parallelization strategy, and floating-point computation order. Even with temperature=0 and a fixed seed, floating-point arithmetic on GPUs is not perfectly deterministic because operations like matrix multiplication can be parallelized in different orders that produce slightly different rounding results. Additionally, providers may silently change the hardware or software stack between requests. The system_fingerprint field helps detect such changes, but cannot prevent them. For guaranteed reproducibility, you must host the model locally and control the entire inference stack.

Key Takeaways

Version every layer of the stack. Prompts, model configuration, data, code, and infrastructure all need explicit versioning. Missing any one layer makes the experiment only partially reproducible.
Use Hydra for configuration management. Hydra provides composable, hierarchical configuration with automatic archiving of every experiment's complete parameter set. This eliminates the most common source of "what settings did I use?" confusion.
Version large data with DVC. Evaluation datasets, knowledge bases, and embedding indexes should be tracked with DVC to maintain full data lineage alongside code changes.
Track experiments with MLflow or W&B. Record every run's configuration, metrics, and artifacts in a central platform. This enables comparison across runs and makes it easy to identify the best configuration.
Containerize for environment reproducibility. Docker containers pin all dependencies and eliminate "works on my machine" problems. Combined with Hydra configs and DVC data, containers complete the reproducibility chain.
Accept partial reproducibility with API models. When using hosted APIs, pin versions, set seeds, and log system fingerprints. For full control, use locally hosted open-weight models.