Section 27.5: LLM Compute Planning & Infrastructure

★ Big Picture

Compute is the single largest variable cost in LLM operations, and poor planning can result in either wasted capacity or service outages. Organizations running LLMs at scale must make strategic decisions about GPU selection, cloud provider allocation, and the breakeven point between API-based inference and self-hosted models. This section provides the quantitative frameworks for making these decisions with data rather than intuition.

1. GPU Selection for LLM Workloads

GPU selection depends on the workload type (training vs. inference), model size, and budget. The three most common GPU tiers for LLM work in 2024/2025 are the NVIDIA A100, H100, and L40S. Each has distinct cost-performance characteristics.

GPU	VRAM	FP16 TFLOPS	Memory BW	Cloud Cost/hr	Best For
A100 80GB	80 GB HBM2e	312	2.0 TB/s	$2.00 to $3.50	Training; large model inference
H100 80GB	80 GB HBM3	990	3.35 TB/s	$3.50 to $5.50	Training at scale; high-throughput inference
L40S 48GB	48 GB GDDR6X	362	864 GB/s	$1.20 to $2.00	Inference; fine-tuning small models
A10G 24GB	24 GB GDDR6X	125	600 GB/s	$0.75 to $1.20	Small model inference; embeddings

from dataclasses import dataclass

@dataclass
class GPUConfig:
    """GPU configuration for LLM workload planning."""
    name: str
    vram_gb: int
    fp16_tflops: float
    memory_bandwidth_tb: float
    cost_per_hour: float

    def can_serve_model(self, model_params_b: float, precision: str = "fp16") -> bool:
        """Check if model fits in VRAM (rough estimate)."""
        bytes_per_param = {"fp32": 4, "fp16": 2, "int8": 1, "int4": 0.5}
        model_gb = model_params_b * bytes_per_param[precision]
        overhead_gb = model_gb * 0.20  # KV cache + runtime overhead
        return (model_gb + overhead_gb) <= self.vram_gb

    def estimated_tokens_per_second(self, model_params_b: float) -> float:
        """Rough estimate of inference throughput (single request)."""
        # Bottleneck is memory bandwidth for autoregressive generation
        bytes_per_token = model_params_b * 2  # fp16: 2 bytes/param read per token
        return (self.memory_bandwidth_tb * 1000) / bytes_per_token

gpus = [
    GPUConfig("A100-80GB", 80, 312, 2.0, 2.75),
    GPUConfig("H100-80GB", 80, 990, 3.35, 4.50),
    GPUConfig("L40S-48GB", 48, 362, 0.864, 1.60),
    GPUConfig("A10G-24GB", 24, 125, 0.600, 0.95),
]

# Check which GPUs can serve Llama 3.1 8B in different precisions
model_size = 8.0  # 8 billion parameters
print(f"Model: Llama 3.1 8B ({model_size}B params)\n")
for gpu in gpus:
    fits_fp16 = gpu.can_serve_model(model_size, "fp16")
    fits_int8 = gpu.can_serve_model(model_size, "int8")
    tps = gpu.estimated_tokens_per_second(model_size)
    print(f"{gpu.name:12s}  FP16: {'Yes' if fits_fp16 else 'No ':3s}  "
          f"INT8: {'Yes' if fits_int8 else 'No ':3s}  "
          f"~{tps:.0f} tok/s  ${gpu.cost_per_hour:.2f}/hr")

Model: Llama 3.1 8B (8.0B params) A100-80GB FP16: Yes INT8: Yes ~125 tok/s $2.75/hr H100-80GB FP16: Yes INT8: Yes ~209 tok/s $4.50/hr L40S-48GB FP16: Yes INT8: Yes ~54 tok/s $1.60/hr A10G-24GB FP16: No INT8: Yes ~38 tok/s $0.95/hr

2. Self-Hosted vs. API Breakeven Analysis

The choice between API-based inference and self-hosted models depends on request volume. At low volumes, API pricing is more economical because you pay only for what you use. At high volumes, self-hosted inference becomes cheaper because the fixed GPU cost is amortized across many requests.

from dataclasses import dataclass

@dataclass
class BreakevenAnalysis:
    """Calculate breakeven between API and self-hosted inference."""
    # API costs
    api_input_per_million: float     # $/1M input tokens
    api_output_per_million: float    # $/1M output tokens
    avg_input_tokens: int
    avg_output_tokens: int

    # Self-hosted costs
    gpu_cost_per_hour: float
    gpu_count: int
    throughput_requests_per_hour: float  # per GPU, with batching
    ops_overhead_monthly: float         # monitoring, on-call, etc.

    def api_cost_per_request(self) -> float:
        input_cost = self.avg_input_tokens / 1_000_000 * self.api_input_per_million
        output_cost = self.avg_output_tokens / 1_000_000 * self.api_output_per_million
        return input_cost + output_cost

    def self_hosted_cost_per_request(self, monthly_requests: int) -> float:
        gpu_monthly = self.gpu_cost_per_hour * 730 * self.gpu_count  # 730 hrs/month
        total_monthly = gpu_monthly + self.ops_overhead_monthly
        return total_monthly / monthly_requests if monthly_requests > 0 else float("inf")

    def breakeven_monthly_requests(self) -> int:
        """Find the request volume where self-hosted = API cost."""
        gpu_monthly = self.gpu_cost_per_hour * 730 * self.gpu_count
        total_fixed = gpu_monthly + self.ops_overhead_monthly
        cost_per_api = self.api_cost_per_request()
        if cost_per_api <= 0:
            return float("inf")
        return int(total_fixed / cost_per_api)

# Scenario: Llama 3.1 8B self-hosted vs. GPT-4o-mini API
analysis = BreakevenAnalysis(
    api_input_per_million=0.15,       # GPT-4o-mini input
    api_output_per_million=0.60,      # GPT-4o-mini output
    avg_input_tokens=800,
    avg_output_tokens=300,
    gpu_cost_per_hour=1.60,           # L40S
    gpu_count=1,
    throughput_requests_per_hour=1800, # with vLLM batching
    ops_overhead_monthly=500,
)

breakeven = analysis.breakeven_monthly_requests()
print(f"API cost per request:       ${analysis.api_cost_per_request():.5f}")
print(f"Self-hosted (at 500K/mo):   ${analysis.self_hosted_cost_per_request(500_000):.5f}")
print(f"Breakeven at:               {breakeven:,} requests/month")
print(f"                            = ~{breakeven/30:,.0f} requests/day")

API cost per request: $0.00030 Self-hosted (at 500K/mo): $0.00334 Breakeven at: 5,266,666 requests/month = ~175,556 requests/day

⚡ Key Insight

With GPT-4o-mini pricing at $0.15/$0.60 per million tokens, the API is extremely cost-competitive. Self-hosting a single L40S only becomes cheaper at over 5 million requests per month. For most organizations, API-based inference is more economical until request volumes are very high. The calculus changes dramatically when using larger models like GPT-4o ($2.50/$10.00) where the breakeven drops to under 500K requests per month.

Figure 27.9: API vs. self-hosted cost curves showing the breakeven at approximately 5.3M monthly requests

3. Compute Budgeting

A compute budget for LLM operations must account for four workload categories, each with different capacity patterns: training (bursty, high-GPU), fine-tuning (periodic, medium-GPU), inference (steady, variable GPU), and experimentation (low-priority, opportunistic).

from dataclasses import dataclass
from typing import List

@dataclass
class ComputeWorkload:
    name: str
    gpu_type: str
    gpu_count: int
    hours_per_month: float
    cost_per_gpu_hour: float

    def monthly_cost(self) -> float:
        return self.gpu_count * self.hours_per_month * self.cost_per_gpu_hour

def compute_budget(workloads: List[ComputeWorkload]) -> dict:
    """Generate a monthly compute budget summary."""
    total = sum(w.monthly_cost() for w in workloads)
    breakdown = {}
    for w in workloads:
        cost = w.monthly_cost()
        breakdown[w.name] = {
            "monthly_cost": round(cost),
            "pct_of_total": round(cost / total * 100, 1),
            "gpu_spec": f"{w.gpu_count}x {w.gpu_type}",
        }
    return {"total_monthly": round(total), "workloads": breakdown}

workloads = [
    ComputeWorkload("Inference (prod)",  "L40S",  2,  730,  1.60),
    ComputeWorkload("Fine-tuning",       "A100",  4,  40,   2.75),
    ComputeWorkload("Embeddings",        "A10G",  1,  730,  0.95),
    ComputeWorkload("Experimentation",   "A100",  2,  80,   2.75),
]

budget = compute_budget(workloads)
print(f"Total monthly compute: ${budget['total_monthly']:,}\n")
for name, info in budget["workloads"].items():
    print(f"  {name:20s}  {info['gpu_spec']:12s}  "
          f"${info['monthly_cost']:>6,}/mo  ({info['pct_of_total']:>5.1f}%)")

Total monthly compute: $3,895 Inference (prod) 2x L40S $2,336/mo ( 59.9%) Fine-tuning 4x A100 $ 440/mo ( 11.3%) Embeddings 1x A10G $ 694/mo ( 17.8%) Experimentation 2x A100 $ 440/mo ( 11.3%)

4. Multi-Cloud Inference Architecture

Production LLM applications should not depend on a single cloud provider or a single model provider. Multi-cloud and multi-model architectures provide resilience against outages, rate limits, and pricing changes.

Figure 27.10: Multi-provider inference architecture with weighted routing

import random
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class InferenceProvider:
    name: str
    weight: float            # routing weight (0-1)
    is_healthy: bool = True
    current_rps: float = 0   # current requests per second
    max_rps: float = 100     # rate limit

class InferenceRouter:
    """Route requests across multiple inference providers."""

    def __init__(self, providers: List[InferenceProvider]):
        self.providers = providers

    def select_provider(self, priority: str = "balanced") -> Optional[InferenceProvider]:
        """Select provider based on routing strategy."""
        available = [p for p in self.providers
                     if p.is_healthy and p.current_rps < p.max_rps]

        if not available:
            return None  # all providers down or rate-limited

        if priority == "cost":
            # Prefer lowest cost (highest weight = self-hosted)
            return max(available, key=lambda p: p.weight)
        elif priority == "quality":
            # Always use primary provider
            return available[0]
        else:
            # Weighted random selection
            weights = [p.weight for p in available]
            return random.choices(available, weights=weights, k=1)[0]

router = InferenceRouter([
    InferenceProvider("OpenAI",    weight=0.60, max_rps=500),
    InferenceProvider("Anthropic", weight=0.25, max_rps=200),
    InferenceProvider("Self-hosted", weight=0.15, max_rps=50),
])

# Simulate 1000 routing decisions
counts = {}
for _ in range(1000):
    p = router.select_provider("balanced")
    counts[p.name] = counts.get(p.name, 0) + 1

for name, count in sorted(counts.items(), key=lambda x: -x[1]):
    print(f"  {name:15s}  {count:>4d} requests  ({count/10:.1f}%)")

OpenAI 601 requests (60.1%) Anthropic 249 requests (24.9%) Self-hosted 150 requests (15.0%)

⚠ Warning

Multi-provider architectures introduce complexity in prompt management (each provider may handle the same prompt differently), response format consistency, and error handling. Ensure your routing layer includes response format normalization and provider-specific prompt adaptation. Without these, switching between providers will produce inconsistent user experiences.

📝 Note

Spot instances and preemptible GPUs can reduce self-hosted inference costs by 60 to 70% but are not suitable for latency-sensitive production workloads. Use them for batch processing (embedding generation, offline evaluation) and experimentation, while reserving on-demand or reserved instances for real-time inference.

✔ Knowledge Check

1. Why is the A10G listed as unable to serve Llama 3.1 8B in FP16?

Show Answer

The A10G has only 24 GB of VRAM. Llama 3.1 8B in FP16 requires approximately 16 GB for model weights (8B params x 2 bytes) plus about 3.2 GB for KV cache and runtime overhead (20% of model size), totaling roughly 19.2 GB. While it fits in raw weight terms, the safety margin is too thin for reliable production serving. In INT8 quantization (8 GB for weights plus overhead), it fits comfortably.

2. At what monthly request volume does self-hosted inference become cheaper than GPT-4o-mini API calls?

Show Answer

The breakeven point is approximately 5.27 million requests per month (about 175,000 per day). Below this volume, GPT-4o-mini's per-token pricing is more economical because you pay nothing when idle. Above this volume, the fixed cost of a self-hosted L40S GPU is amortized across enough requests to beat the per-token API pricing.

3. Which workload category consumes the largest share of the example compute budget, and why?

Show Answer

Production inference consumes 59.9% of the budget because it runs 24/7 (730 hours per month) on 2 GPUs. Even though it uses cheaper L40S GPUs ($1.60/hr) compared to A100s ($2.75/hr), the always-on nature of production inference makes it the dominant cost. Fine-tuning and experimentation use more expensive GPUs but run for far fewer hours (40 and 80 hours per month respectively).

4. What are the five routing strategies mentioned for multi-provider inference?

Show Answer

The five routing strategies are: cost-based (prefer cheapest available provider), latency-based (prefer fastest provider), quality-based (prefer highest-quality model), failover (switch to backup when primary is down), and rate-limit aware (avoid providers approaching their rate limits). The "balanced" strategy uses weighted random selection to distribute load according to configured weights.

5. Why should spot instances not be used for real-time LLM inference?

Show Answer

Spot instances can be preempted (taken away) by the cloud provider with little notice when demand increases. For real-time inference serving user-facing requests, a preemption would cause request failures and degraded user experience. Spot instances are appropriate for batch workloads (embedding generation, offline evaluation, experimentation) where interruption can be tolerated and retried without affecting users.

🎯 Key Takeaways

Match GPU to workload: H100 for training at scale, A100 for general-purpose, L40S for cost-efficient inference, A10G for small models and embeddings.
API wins at low volume: With GPT-4o-mini pricing, self-hosting only breaks even above 5M+ requests per month. Most organizations should start with APIs.
Inference dominates budgets: Always-on production inference typically accounts for 50 to 70% of total compute spend. Optimize this workload first.
Multi-provider for resilience: Route across multiple LLM providers with weighted selection, automatic failover, and rate-limit awareness.
Use spot for batch, reserved for prod: Spot instances cut batch processing costs by 60 to 70% but must not be used for latency-sensitive production inference.
Budget by workload category: Separate training, fine-tuning, inference, and experimentation to track spend and optimize each independently.