Module 27 · Section 27.5

LLM Compute Planning & Infrastructure

Compute budgeting, cloud strategy, GPU selection (A100, H100, L40S), self-hosted vs. API breakeven, inference infrastructure, and multi-cloud architecture
★ Big Picture

Compute is the single largest variable cost in LLM operations, and poor planning can result in either wasted capacity or service outages. Organizations running LLMs at scale must make strategic decisions about GPU selection, cloud provider allocation, and the breakeven point between API-based inference and self-hosted models. This section provides the quantitative frameworks for making these decisions with data rather than intuition.

1. GPU Selection for LLM Workloads

GPU selection depends on the workload type (training vs. inference), model size, and budget. The three most common GPU tiers for LLM work in 2024/2025 are the NVIDIA A100, H100, and L40S. Each has distinct cost-performance characteristics.

GPU VRAM FP16 TFLOPS Memory BW Cloud Cost/hr Best For
A100 80GB 80 GB HBM2e 312 2.0 TB/s $2.00 to $3.50 Training; large model inference
H100 80GB 80 GB HBM3 990 3.35 TB/s $3.50 to $5.50 Training at scale; high-throughput inference
L40S 48GB 48 GB GDDR6X 362 864 GB/s $1.20 to $2.00 Inference; fine-tuning small models
A10G 24GB 24 GB GDDR6X 125 600 GB/s $0.75 to $1.20 Small model inference; embeddings
from dataclasses import dataclass

@dataclass
class GPUConfig:
    """GPU configuration for LLM workload planning."""
    name: str
    vram_gb: int
    fp16_tflops: float
    memory_bandwidth_tb: float
    cost_per_hour: float

    def can_serve_model(self, model_params_b: float, precision: str = "fp16") -> bool:
        """Check if model fits in VRAM (rough estimate)."""
        bytes_per_param = {"fp32": 4, "fp16": 2, "int8": 1, "int4": 0.5}
        model_gb = model_params_b * bytes_per_param[precision]
        overhead_gb = model_gb * 0.20  # KV cache + runtime overhead
        return (model_gb + overhead_gb) <= self.vram_gb

    def estimated_tokens_per_second(self, model_params_b: float) -> float:
        """Rough estimate of inference throughput (single request)."""
        # Bottleneck is memory bandwidth for autoregressive generation
        bytes_per_token = model_params_b * 2  # fp16: 2 bytes/param read per token
        return (self.memory_bandwidth_tb * 1000) / bytes_per_token

gpus = [
    GPUConfig("A100-80GB", 80, 312, 2.0, 2.75),
    GPUConfig("H100-80GB", 80, 990, 3.35, 4.50),
    GPUConfig("L40S-48GB", 48, 362, 0.864, 1.60),
    GPUConfig("A10G-24GB", 24, 125, 0.600, 0.95),
]

# Check which GPUs can serve Llama 3.1 8B in different precisions
model_size = 8.0  # 8 billion parameters
print(f"Model: Llama 3.1 8B ({model_size}B params)\n")
for gpu in gpus:
    fits_fp16 = gpu.can_serve_model(model_size, "fp16")
    fits_int8 = gpu.can_serve_model(model_size, "int8")
    tps = gpu.estimated_tokens_per_second(model_size)
    print(f"{gpu.name:12s}  FP16: {'Yes' if fits_fp16 else 'No ':3s}  "
          f"INT8: {'Yes' if fits_int8 else 'No ':3s}  "
          f"~{tps:.0f} tok/s  ${gpu.cost_per_hour:.2f}/hr")
Model: Llama 3.1 8B (8.0B params) A100-80GB FP16: Yes INT8: Yes ~125 tok/s $2.75/hr H100-80GB FP16: Yes INT8: Yes ~209 tok/s $4.50/hr L40S-48GB FP16: Yes INT8: Yes ~54 tok/s $1.60/hr A10G-24GB FP16: No INT8: Yes ~38 tok/s $0.95/hr

2. Self-Hosted vs. API Breakeven Analysis

The choice between API-based inference and self-hosted models depends on request volume. At low volumes, API pricing is more economical because you pay only for what you use. At high volumes, self-hosted inference becomes cheaper because the fixed GPU cost is amortized across many requests.

from dataclasses import dataclass

@dataclass
class BreakevenAnalysis:
    """Calculate breakeven between API and self-hosted inference."""
    # API costs
    api_input_per_million: float     # $/1M input tokens
    api_output_per_million: float    # $/1M output tokens
    avg_input_tokens: int
    avg_output_tokens: int

    # Self-hosted costs
    gpu_cost_per_hour: float
    gpu_count: int
    throughput_requests_per_hour: float  # per GPU, with batching
    ops_overhead_monthly: float         # monitoring, on-call, etc.

    def api_cost_per_request(self) -> float:
        input_cost = self.avg_input_tokens / 1_000_000 * self.api_input_per_million
        output_cost = self.avg_output_tokens / 1_000_000 * self.api_output_per_million
        return input_cost + output_cost

    def self_hosted_cost_per_request(self, monthly_requests: int) -> float:
        gpu_monthly = self.gpu_cost_per_hour * 730 * self.gpu_count  # 730 hrs/month
        total_monthly = gpu_monthly + self.ops_overhead_monthly
        return total_monthly / monthly_requests if monthly_requests > 0 else float("inf")

    def breakeven_monthly_requests(self) -> int:
        """Find the request volume where self-hosted = API cost."""
        gpu_monthly = self.gpu_cost_per_hour * 730 * self.gpu_count
        total_fixed = gpu_monthly + self.ops_overhead_monthly
        cost_per_api = self.api_cost_per_request()
        if cost_per_api <= 0:
            return float("inf")
        return int(total_fixed / cost_per_api)

# Scenario: Llama 3.1 8B self-hosted vs. GPT-4o-mini API
analysis = BreakevenAnalysis(
    api_input_per_million=0.15,       # GPT-4o-mini input
    api_output_per_million=0.60,      # GPT-4o-mini output
    avg_input_tokens=800,
    avg_output_tokens=300,
    gpu_cost_per_hour=1.60,           # L40S
    gpu_count=1,
    throughput_requests_per_hour=1800, # with vLLM batching
    ops_overhead_monthly=500,
)

breakeven = analysis.breakeven_monthly_requests()
print(f"API cost per request:       ${analysis.api_cost_per_request():.5f}")
print(f"Self-hosted (at 500K/mo):   ${analysis.self_hosted_cost_per_request(500_000):.5f}")
print(f"Breakeven at:               {breakeven:,} requests/month")
print(f"                            = ~{breakeven/30:,.0f} requests/day")
API cost per request: $0.00030 Self-hosted (at 500K/mo): $0.00334 Breakeven at: 5,266,666 requests/month = ~175,556 requests/day
⚡ Key Insight

With GPT-4o-mini pricing at $0.15/$0.60 per million tokens, the API is extremely cost-competitive. Self-hosting a single L40S only becomes cheaper at over 5 million requests per month. For most organizations, API-based inference is more economical until request volumes are very high. The calculus changes dramatically when using larger models like GPT-4o ($2.50/$10.00) where the breakeven drops to under 500K requests per month.

Monthly Request Volume Monthly Cost ($) 0 1M 3M 5M 7M API Cost Self-hosted Breakeven ~5.3M req/mo API wins Self-hosted wins
Figure 27.9: API vs. self-hosted cost curves showing the breakeven at approximately 5.3M monthly requests

3. Compute Budgeting

A compute budget for LLM operations must account for four workload categories, each with different capacity patterns: training (bursty, high-GPU), fine-tuning (periodic, medium-GPU), inference (steady, variable GPU), and experimentation (low-priority, opportunistic).

from dataclasses import dataclass
from typing import List

@dataclass
class ComputeWorkload:
    name: str
    gpu_type: str
    gpu_count: int
    hours_per_month: float
    cost_per_gpu_hour: float

    def monthly_cost(self) -> float:
        return self.gpu_count * self.hours_per_month * self.cost_per_gpu_hour

def compute_budget(workloads: List[ComputeWorkload]) -> dict:
    """Generate a monthly compute budget summary."""
    total = sum(w.monthly_cost() for w in workloads)
    breakdown = {}
    for w in workloads:
        cost = w.monthly_cost()
        breakdown[w.name] = {
            "monthly_cost": round(cost),
            "pct_of_total": round(cost / total * 100, 1),
            "gpu_spec": f"{w.gpu_count}x {w.gpu_type}",
        }
    return {"total_monthly": round(total), "workloads": breakdown}

workloads = [
    ComputeWorkload("Inference (prod)",  "L40S",  2,  730,  1.60),
    ComputeWorkload("Fine-tuning",       "A100",  4,  40,   2.75),
    ComputeWorkload("Embeddings",        "A10G",  1,  730,  0.95),
    ComputeWorkload("Experimentation",   "A100",  2,  80,   2.75),
]

budget = compute_budget(workloads)
print(f"Total monthly compute: ${budget['total_monthly']:,}\n")
for name, info in budget["workloads"].items():
    print(f"  {name:20s}  {info['gpu_spec']:12s}  "
          f"${info['monthly_cost']:>6,}/mo  ({info['pct_of_total']:>5.1f}%)")
Total monthly compute: $3,895 Inference (prod) 2x L40S $2,336/mo ( 59.9%) Fine-tuning 4x A100 $ 440/mo ( 11.3%) Embeddings 1x A10G $ 694/mo ( 17.8%) Experimentation 2x A100 $ 440/mo ( 11.3%)

4. Multi-Cloud Inference Architecture

Production LLM applications should not depend on a single cloud provider or a single model provider. Multi-cloud and multi-model architectures provide resilience against outages, rate limits, and pricing changes.

Inference Router Primary: OpenAI GPT-4o / GPT-4o-mini Priority: Quality-critical Weight: 60% Secondary: Anthropic Claude 3.5 Sonnet / Haiku Priority: Failover + long ctx Weight: 25% Self-hosted Llama 3.1 8B (vLLM) Priority: Cost-sensitive Weight: 15% Routing Logic Cost-based | Latency-based | Quality-based | Failover | Rate-limit aware Multi-provider inference with intelligent routing and automatic failover
Figure 27.10: Multi-provider inference architecture with weighted routing
import random
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class InferenceProvider:
    name: str
    weight: float            # routing weight (0-1)
    is_healthy: bool = True
    current_rps: float = 0   # current requests per second
    max_rps: float = 100     # rate limit

class InferenceRouter:
    """Route requests across multiple inference providers."""

    def __init__(self, providers: List[InferenceProvider]):
        self.providers = providers

    def select_provider(self, priority: str = "balanced") -> Optional[InferenceProvider]:
        """Select provider based on routing strategy."""
        available = [p for p in self.providers
                     if p.is_healthy and p.current_rps < p.max_rps]

        if not available:
            return None  # all providers down or rate-limited

        if priority == "cost":
            # Prefer lowest cost (highest weight = self-hosted)
            return max(available, key=lambda p: p.weight)
        elif priority == "quality":
            # Always use primary provider
            return available[0]
        else:
            # Weighted random selection
            weights = [p.weight for p in available]
            return random.choices(available, weights=weights, k=1)[0]

router = InferenceRouter([
    InferenceProvider("OpenAI",    weight=0.60, max_rps=500),
    InferenceProvider("Anthropic", weight=0.25, max_rps=200),
    InferenceProvider("Self-hosted", weight=0.15, max_rps=50),
])

# Simulate 1000 routing decisions
counts = {}
for _ in range(1000):
    p = router.select_provider("balanced")
    counts[p.name] = counts.get(p.name, 0) + 1

for name, count in sorted(counts.items(), key=lambda x: -x[1]):
    print(f"  {name:15s}  {count:>4d} requests  ({count/10:.1f}%)")
OpenAI 601 requests (60.1%) Anthropic 249 requests (24.9%) Self-hosted 150 requests (15.0%)
⚠ Warning

Multi-provider architectures introduce complexity in prompt management (each provider may handle the same prompt differently), response format consistency, and error handling. Ensure your routing layer includes response format normalization and provider-specific prompt adaptation. Without these, switching between providers will produce inconsistent user experiences.

📝 Note

Spot instances and preemptible GPUs can reduce self-hosted inference costs by 60 to 70% but are not suitable for latency-sensitive production workloads. Use them for batch processing (embedding generation, offline evaluation) and experimentation, while reserving on-demand or reserved instances for real-time inference.

✔ Knowledge Check

1. Why is the A10G listed as unable to serve Llama 3.1 8B in FP16?

Show Answer
The A10G has only 24 GB of VRAM. Llama 3.1 8B in FP16 requires approximately 16 GB for model weights (8B params x 2 bytes) plus about 3.2 GB for KV cache and runtime overhead (20% of model size), totaling roughly 19.2 GB. While it fits in raw weight terms, the safety margin is too thin for reliable production serving. In INT8 quantization (8 GB for weights plus overhead), it fits comfortably.

2. At what monthly request volume does self-hosted inference become cheaper than GPT-4o-mini API calls?

Show Answer
The breakeven point is approximately 5.27 million requests per month (about 175,000 per day). Below this volume, GPT-4o-mini's per-token pricing is more economical because you pay nothing when idle. Above this volume, the fixed cost of a self-hosted L40S GPU is amortized across enough requests to beat the per-token API pricing.

3. Which workload category consumes the largest share of the example compute budget, and why?

Show Answer
Production inference consumes 59.9% of the budget because it runs 24/7 (730 hours per month) on 2 GPUs. Even though it uses cheaper L40S GPUs ($1.60/hr) compared to A100s ($2.75/hr), the always-on nature of production inference makes it the dominant cost. Fine-tuning and experimentation use more expensive GPUs but run for far fewer hours (40 and 80 hours per month respectively).

4. What are the five routing strategies mentioned for multi-provider inference?

Show Answer
The five routing strategies are: cost-based (prefer cheapest available provider), latency-based (prefer fastest provider), quality-based (prefer highest-quality model), failover (switch to backup when primary is down), and rate-limit aware (avoid providers approaching their rate limits). The "balanced" strategy uses weighted random selection to distribute load according to configured weights.

5. Why should spot instances not be used for real-time LLM inference?

Show Answer
Spot instances can be preempted (taken away) by the cloud provider with little notice when demand increases. For real-time inference serving user-facing requests, a preemption would cause request failures and degraded user experience. Spot instances are appropriate for batch workloads (embedding generation, offline evaluation, experimentation) where interruption can be tolerated and retried without affecting users.

🎯 Key Takeaways