Compute is the single largest variable cost in LLM operations, and poor planning can result in either wasted capacity or service outages. Organizations running LLMs at scale must make strategic decisions about GPU selection, cloud provider allocation, and the breakeven point between API-based inference and self-hosted models. This section provides the quantitative frameworks for making these decisions with data rather than intuition.
1. GPU Selection for LLM Workloads
GPU selection depends on the workload type (training vs. inference), model size, and budget. The three most common GPU tiers for LLM work in 2024/2025 are the NVIDIA A100, H100, and L40S. Each has distinct cost-performance characteristics.
| GPU | VRAM | FP16 TFLOPS | Memory BW | Cloud Cost/hr | Best For |
|---|---|---|---|---|---|
| A100 80GB | 80 GB HBM2e | 312 | 2.0 TB/s | $2.00 to $3.50 | Training; large model inference |
| H100 80GB | 80 GB HBM3 | 990 | 3.35 TB/s | $3.50 to $5.50 | Training at scale; high-throughput inference |
| L40S 48GB | 48 GB GDDR6X | 362 | 864 GB/s | $1.20 to $2.00 | Inference; fine-tuning small models |
| A10G 24GB | 24 GB GDDR6X | 125 | 600 GB/s | $0.75 to $1.20 | Small model inference; embeddings |
from dataclasses import dataclass @dataclass class GPUConfig: """GPU configuration for LLM workload planning.""" name: str vram_gb: int fp16_tflops: float memory_bandwidth_tb: float cost_per_hour: float def can_serve_model(self, model_params_b: float, precision: str = "fp16") -> bool: """Check if model fits in VRAM (rough estimate).""" bytes_per_param = {"fp32": 4, "fp16": 2, "int8": 1, "int4": 0.5} model_gb = model_params_b * bytes_per_param[precision] overhead_gb = model_gb * 0.20 # KV cache + runtime overhead return (model_gb + overhead_gb) <= self.vram_gb def estimated_tokens_per_second(self, model_params_b: float) -> float: """Rough estimate of inference throughput (single request).""" # Bottleneck is memory bandwidth for autoregressive generation bytes_per_token = model_params_b * 2 # fp16: 2 bytes/param read per token return (self.memory_bandwidth_tb * 1000) / bytes_per_token gpus = [ GPUConfig("A100-80GB", 80, 312, 2.0, 2.75), GPUConfig("H100-80GB", 80, 990, 3.35, 4.50), GPUConfig("L40S-48GB", 48, 362, 0.864, 1.60), GPUConfig("A10G-24GB", 24, 125, 0.600, 0.95), ] # Check which GPUs can serve Llama 3.1 8B in different precisions model_size = 8.0 # 8 billion parameters print(f"Model: Llama 3.1 8B ({model_size}B params)\n") for gpu in gpus: fits_fp16 = gpu.can_serve_model(model_size, "fp16") fits_int8 = gpu.can_serve_model(model_size, "int8") tps = gpu.estimated_tokens_per_second(model_size) print(f"{gpu.name:12s} FP16: {'Yes' if fits_fp16 else 'No ':3s} " f"INT8: {'Yes' if fits_int8 else 'No ':3s} " f"~{tps:.0f} tok/s ${gpu.cost_per_hour:.2f}/hr")
2. Self-Hosted vs. API Breakeven Analysis
The choice between API-based inference and self-hosted models depends on request volume. At low volumes, API pricing is more economical because you pay only for what you use. At high volumes, self-hosted inference becomes cheaper because the fixed GPU cost is amortized across many requests.
from dataclasses import dataclass @dataclass class BreakevenAnalysis: """Calculate breakeven between API and self-hosted inference.""" # API costs api_input_per_million: float # $/1M input tokens api_output_per_million: float # $/1M output tokens avg_input_tokens: int avg_output_tokens: int # Self-hosted costs gpu_cost_per_hour: float gpu_count: int throughput_requests_per_hour: float # per GPU, with batching ops_overhead_monthly: float # monitoring, on-call, etc. def api_cost_per_request(self) -> float: input_cost = self.avg_input_tokens / 1_000_000 * self.api_input_per_million output_cost = self.avg_output_tokens / 1_000_000 * self.api_output_per_million return input_cost + output_cost def self_hosted_cost_per_request(self, monthly_requests: int) -> float: gpu_monthly = self.gpu_cost_per_hour * 730 * self.gpu_count # 730 hrs/month total_monthly = gpu_monthly + self.ops_overhead_monthly return total_monthly / monthly_requests if monthly_requests > 0 else float("inf") def breakeven_monthly_requests(self) -> int: """Find the request volume where self-hosted = API cost.""" gpu_monthly = self.gpu_cost_per_hour * 730 * self.gpu_count total_fixed = gpu_monthly + self.ops_overhead_monthly cost_per_api = self.api_cost_per_request() if cost_per_api <= 0: return float("inf") return int(total_fixed / cost_per_api) # Scenario: Llama 3.1 8B self-hosted vs. GPT-4o-mini API analysis = BreakevenAnalysis( api_input_per_million=0.15, # GPT-4o-mini input api_output_per_million=0.60, # GPT-4o-mini output avg_input_tokens=800, avg_output_tokens=300, gpu_cost_per_hour=1.60, # L40S gpu_count=1, throughput_requests_per_hour=1800, # with vLLM batching ops_overhead_monthly=500, ) breakeven = analysis.breakeven_monthly_requests() print(f"API cost per request: ${analysis.api_cost_per_request():.5f}") print(f"Self-hosted (at 500K/mo): ${analysis.self_hosted_cost_per_request(500_000):.5f}") print(f"Breakeven at: {breakeven:,} requests/month") print(f" = ~{breakeven/30:,.0f} requests/day")
With GPT-4o-mini pricing at $0.15/$0.60 per million tokens, the API is extremely cost-competitive. Self-hosting a single L40S only becomes cheaper at over 5 million requests per month. For most organizations, API-based inference is more economical until request volumes are very high. The calculus changes dramatically when using larger models like GPT-4o ($2.50/$10.00) where the breakeven drops to under 500K requests per month.
3. Compute Budgeting
A compute budget for LLM operations must account for four workload categories, each with different capacity patterns: training (bursty, high-GPU), fine-tuning (periodic, medium-GPU), inference (steady, variable GPU), and experimentation (low-priority, opportunistic).
from dataclasses import dataclass from typing import List @dataclass class ComputeWorkload: name: str gpu_type: str gpu_count: int hours_per_month: float cost_per_gpu_hour: float def monthly_cost(self) -> float: return self.gpu_count * self.hours_per_month * self.cost_per_gpu_hour def compute_budget(workloads: List[ComputeWorkload]) -> dict: """Generate a monthly compute budget summary.""" total = sum(w.monthly_cost() for w in workloads) breakdown = {} for w in workloads: cost = w.monthly_cost() breakdown[w.name] = { "monthly_cost": round(cost), "pct_of_total": round(cost / total * 100, 1), "gpu_spec": f"{w.gpu_count}x {w.gpu_type}", } return {"total_monthly": round(total), "workloads": breakdown} workloads = [ ComputeWorkload("Inference (prod)", "L40S", 2, 730, 1.60), ComputeWorkload("Fine-tuning", "A100", 4, 40, 2.75), ComputeWorkload("Embeddings", "A10G", 1, 730, 0.95), ComputeWorkload("Experimentation", "A100", 2, 80, 2.75), ] budget = compute_budget(workloads) print(f"Total monthly compute: ${budget['total_monthly']:,}\n") for name, info in budget["workloads"].items(): print(f" {name:20s} {info['gpu_spec']:12s} " f"${info['monthly_cost']:>6,}/mo ({info['pct_of_total']:>5.1f}%)")
4. Multi-Cloud Inference Architecture
Production LLM applications should not depend on a single cloud provider or a single model provider. Multi-cloud and multi-model architectures provide resilience against outages, rate limits, and pricing changes.
import random from dataclasses import dataclass from typing import List, Optional @dataclass class InferenceProvider: name: str weight: float # routing weight (0-1) is_healthy: bool = True current_rps: float = 0 # current requests per second max_rps: float = 100 # rate limit class InferenceRouter: """Route requests across multiple inference providers.""" def __init__(self, providers: List[InferenceProvider]): self.providers = providers def select_provider(self, priority: str = "balanced") -> Optional[InferenceProvider]: """Select provider based on routing strategy.""" available = [p for p in self.providers if p.is_healthy and p.current_rps < p.max_rps] if not available: return None # all providers down or rate-limited if priority == "cost": # Prefer lowest cost (highest weight = self-hosted) return max(available, key=lambda p: p.weight) elif priority == "quality": # Always use primary provider return available[0] else: # Weighted random selection weights = [p.weight for p in available] return random.choices(available, weights=weights, k=1)[0] router = InferenceRouter([ InferenceProvider("OpenAI", weight=0.60, max_rps=500), InferenceProvider("Anthropic", weight=0.25, max_rps=200), InferenceProvider("Self-hosted", weight=0.15, max_rps=50), ]) # Simulate 1000 routing decisions counts = {} for _ in range(1000): p = router.select_provider("balanced") counts[p.name] = counts.get(p.name, 0) + 1 for name, count in sorted(counts.items(), key=lambda x: -x[1]): print(f" {name:15s} {count:>4d} requests ({count/10:.1f}%)")
Multi-provider architectures introduce complexity in prompt management (each provider may handle the same prompt differently), response format consistency, and error handling. Ensure your routing layer includes response format normalization and provider-specific prompt adaptation. Without these, switching between providers will produce inconsistent user experiences.
Spot instances and preemptible GPUs can reduce self-hosted inference costs by 60 to 70% but are not suitable for latency-sensitive production workloads. Use them for batch processing (embedding generation, offline evaluation) and experimentation, while reserving on-demand or reserved instances for real-time inference.
✔ Knowledge Check
1. Why is the A10G listed as unable to serve Llama 3.1 8B in FP16?
Show Answer
2. At what monthly request volume does self-hosted inference become cheaper than GPT-4o-mini API calls?
Show Answer
3. Which workload category consumes the largest share of the example compute budget, and why?
Show Answer
4. What are the five routing strategies mentioned for multi-provider inference?
Show Answer
5. Why should spot instances not be used for real-time LLM inference?
Show Answer
🎯 Key Takeaways
- Match GPU to workload: H100 for training at scale, A100 for general-purpose, L40S for cost-efficient inference, A10G for small models and embeddings.
- API wins at low volume: With GPT-4o-mini pricing, self-hosting only breaks even above 5M+ requests per month. Most organizations should start with APIs.
- Inference dominates budgets: Always-on production inference typically accounts for 50 to 70% of total compute spend. Optimize this workload first.
- Multi-provider for resilience: Route across multiple LLM providers with weighted selection, automatic failover, and rate-limit awareness.
- Use spot for batch, reserved for prod: Spot instances cut batch processing costs by 60 to 70% but must not be used for latency-sensitive production inference.
- Budget by workload category: Separate training, fine-tuning, inference, and experimentation to track spend and optimize each independently.