Reproducibility in LLM experiments is harder than in traditional ML, and also more important. Traditional ML experiments depend on data, code, and hyperparameters. LLM experiments add several new dimensions: prompt templates, provider API versions, retrieval configurations, tool definitions, and external service behaviors. When you cannot reproduce an experiment, you cannot trust its results, compare it fairly against alternatives, or debug regressions. This section covers the tools and practices that make LLM experiments reproducible, from prompt versioning through containerized execution environments.
1. Why LLM Reproducibility Is Hard
LLM experiments face reproducibility challenges that do not exist in traditional machine learning. Even with identical code, data, and configuration, you may get different results because of factors outside your control: provider-side model updates, non-deterministic GPU computation, changing API behaviors, and evolving safety filters.
The LLM Reproducibility Stack
To fully reproduce an LLM experiment, you must version and capture every layer of the stack:
- Prompt layer: System prompts, user prompt templates, few-shot examples, tool descriptions
- Model layer: Provider, model name, version (date suffix), temperature, seed, max tokens
- Data layer: Evaluation dataset, knowledge base contents, embedding model version
- Code layer: Application code, library versions, configuration files
- Infrastructure layer: API endpoint, region, hardware (for local models), container image
2. Configuration Management with Hydra
Hydra is a configuration framework that enables composable, hierarchical configuration with overrides from the command line. It is particularly useful for LLM experiments because it can manage the many interacting parameters (model settings, prompt templates, retrieval parameters, evaluation settings) in a structured, versioned way.
# config/experiment.yaml defaults: - model: gpt4o - prompt: rag_v2 - retrieval: dense_rerank - eval: standard experiment: name: rag_ablation_cot seed: 42 num_eval_seeds: 5 # config/model/gpt4o.yaml provider: openai model_name: gpt-4o-2024-08-06 temperature: 0.0 max_tokens: 1024 seed: 42 # config/retrieval/dense_rerank.yaml embedding_model: text-embedding-3-small top_k: 10 reranker: cohere-rerank-v3 rerank_top_n: 3 chunk_size: 512 chunk_overlap: 50
import hydra from omegaconf import DictConfig, OmegaConf @hydra.main(version_base=None, config_path="config", config_name="experiment") def run_experiment(cfg: DictConfig): """Run an LLM experiment with full configuration tracking.""" # Hydra automatically saves the full resolved config print(OmegaConf.to_yaml(cfg)) # Access nested config values model_name = cfg.model.model_name temperature = cfg.model.temperature top_k = cfg.retrieval.top_k # Run evaluation with tracked configuration results = evaluate_pipeline(cfg) save_results(results, cfg) def evaluate_pipeline(cfg: DictConfig) -> dict: """Run the evaluation pipeline with the given config.""" # Pipeline implementation using config values... return {"accuracy": 0.847, "faithfulness": 0.91} def save_results(results: dict, cfg: DictConfig): """Save results alongside the full config for reproducibility.""" import json output = { "config": OmegaConf.to_container(cfg, resolve=True), "results": results, } with open("experiment_results.json", "w") as f: json.dump(output, f, indent=2) # Run with overrides from command line: # python experiment.py model=gpt4o_mini retrieval.top_k=20
Hydra automatically creates a timestamped output directory for each experiment run, containing the fully resolved configuration, any output files, and logs. This makes every run self-documenting. Combined with git commit tracking, this gives you everything needed to reproduce any past experiment: the code (git SHA), the configuration (Hydra output), and the results.
3. Dataset Versioning with DVC
DVC (Data Version Control) extends git to handle large files and datasets. For LLM experiments, DVC tracks evaluation datasets, knowledge base snapshots, and embedding indexes. By storing lightweight pointer files in git and the actual data in cloud storage (S3, GCS, Azure Blob), DVC provides versioning without bloating the git repository.
# Initialize DVC in your project $ dvc init $ dvc remote add -d storage s3://my-bucket/llm-experiments # Track evaluation dataset $ dvc add data/eval_dataset_v3.jsonl $ git add data/eval_dataset_v3.jsonl.dvc data/.gitignore $ git commit -m "Track eval dataset v3" # Track knowledge base snapshot $ dvc add data/knowledge_base/ $ git add data/knowledge_base.dvc $ git commit -m "Track knowledge base snapshot 2024-Q3" # Push data to remote storage $ dvc push # Reproduce exact data state from any git commit $ git checkout experiment-2024-08-15 $ dvc checkout # restores data files to match that commit
4. Experiment Tracking with MLflow and W&B
Experiment tracking platforms record every run with its configuration, metrics, artifacts, and metadata. They provide dashboards for comparing runs, visualizing trends, and identifying the best configurations. Both MLflow (open source, self-hostable) and Weights & Biases (cloud-based, with a free tier) are widely used in the LLM community.
import mlflow from omegaconf import DictConfig, OmegaConf def track_experiment_mlflow(cfg: DictConfig, results: dict): """Track an LLM experiment run with MLflow.""" mlflow.set_experiment(cfg.experiment.name) with mlflow.start_run(): # Log all configuration parameters flat_cfg = { k: str(v) for k, v in OmegaConf.to_container(cfg, resolve=True).items() } mlflow.log_params({ "model": cfg.model.model_name, "temperature": cfg.model.temperature, "top_k": cfg.retrieval.top_k, "reranker": cfg.retrieval.reranker, "seed": cfg.experiment.seed, }) # Log evaluation metrics mlflow.log_metrics({ "accuracy": results["accuracy"], "faithfulness": results["faithfulness"], "latency_p50_ms": results["latency_p50"], "cost_per_query_usd": results["cost_per_query"], }) # Log prompt template as artifact mlflow.log_text(cfg.prompt.template, "prompt_template.txt") # Log full config mlflow.log_text(OmegaConf.to_yaml(cfg), "full_config.yaml") # Tag the run for easy filtering mlflow.set_tags({ "experiment_type": "ablation", "git_sha": get_git_sha(), })
Experiment Tracking Platform Comparison
| Feature | MLflow | Weights & Biases | DVC (with Studio) |
|---|---|---|---|
| Open source | Yes (fully) | No (cloud service) | Yes (core), paid UI |
| Self-hosting | Yes | Enterprise only | Yes |
| Prompt tracking | As artifacts | Tables + artifacts | Via params/files |
| Comparison UI | Good | Excellent | Good (Studio) |
| Data versioning | Artifacts (limited) | Artifacts + tables | Native (core strength) |
| Cost tracking | Custom metrics | Custom metrics | Custom metrics |
The choice between MLflow and W&B often comes down to infrastructure preferences. Choose MLflow when self-hosting and data sovereignty are requirements, when you want a fully open-source stack, or when you are already using the MLflow ecosystem. Choose W&B when you want the best visualization and collaboration UI, when cloud hosting is acceptable, or when you value features like report generation and team dashboards. Both integrate well with Hydra and DVC.
5. Containerized Reproducibility with Docker
Docker containers provide the ultimate reproducibility guarantee for the code and infrastructure layers. A Dockerfile that pins every dependency version, combined with versioned data (DVC) and configuration (Hydra), enables exact reproduction of any experiment on any machine.
# Dockerfile for reproducible LLM experiments FROM python:3.11-slim # Pin system dependencies RUN apt-get update && apt-get install -y --no-install-recommends \ git curl && rm -rf /var/lib/apt/lists/* WORKDIR /app # Copy and install pinned Python dependencies first (cache-friendly) COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy application code COPY . . # Record build metadata for traceability ARG GIT_SHA=unknown ARG BUILD_DATE=unknown ENV GIT_SHA=${GIT_SHA} BUILD_DATE=${BUILD_DATE} # Default command: run experiment with Hydra ENTRYPOINT ["python", "experiment.py"]
# Build with metadata $ docker build \ --build-arg GIT_SHA=$(git rev-parse HEAD) \ --build-arg BUILD_DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ) \ -t llm-experiment:v1.2.0 . # Run experiment with config overrides $ docker run \ -e OPENAI_API_KEY=$OPENAI_API_KEY \ -v $(pwd)/data:/app/data \ -v $(pwd)/outputs:/app/outputs \ llm-experiment:v1.2.0 \ model=gpt4o retrieval.top_k=5 experiment.seed=42
When using API-based models (OpenAI, Anthropic, Google), you can never achieve full bit-level reproducibility because the provider controls the inference hardware and may change it between requests. The best you can do is pin the model version, set temperature to 0, provide a seed, and log the system_fingerprint. For experiments requiring guaranteed reproducibility, use locally hosted open-weight models where you control the entire stack.
📝 Knowledge Check
Show Answer
Show Answer
Show Answer
dvc checkout, DVC uses the pointer file to download the exact data version from remote storage. This gives you git-like versioning semantics for large files without the storage overhead.Show Answer
Show Answer
system_fingerprint field helps detect such changes, but cannot prevent them. For guaranteed reproducibility, you must host the model locally and control the entire inference stack.Key Takeaways
- Version every layer of the stack. Prompts, model configuration, data, code, and infrastructure all need explicit versioning. Missing any one layer makes the experiment only partially reproducible.
- Use Hydra for configuration management. Hydra provides composable, hierarchical configuration with automatic archiving of every experiment's complete parameter set. This eliminates the most common source of "what settings did I use?" confusion.
- Version large data with DVC. Evaluation datasets, knowledge bases, and embedding indexes should be tracked with DVC to maintain full data lineage alongside code changes.
- Track experiments with MLflow or W&B. Record every run's configuration, metrics, and artifacts in a central platform. This enables comparison across runs and makes it easy to identify the best configuration.
- Containerize for environment reproducibility. Docker containers pin all dependencies and eliminate "works on my machine" problems. Combined with Hydra configs and DVC data, containers complete the reproducibility chain.
- Accept partial reproducibility with API models. When using hosted APIs, pin versions, set seeds, and log system fingerprints. For full control, use locally hosted open-weight models.