Section 27.4: LLM Vendor Evaluation & Build vs. Buy

★ Big Picture

The LLM ecosystem is expanding so rapidly that vendor selection has become a strategic capability in itself. New model providers, vector databases, agent frameworks, and evaluation platforms launch weekly. Choosing the wrong vendor can lock you into an inferior solution, while building everything in-house can drain engineering resources that should be spent on differentiation. This section provides structured evaluation frameworks for every major category in the LLM stack and a decision tree for the build-versus-buy question.

1. LLM Provider Evaluation

Evaluating LLM providers requires balancing quality, cost, latency, privacy, and reliability. The scoring rubric below covers the dimensions that matter most for production deployments. Each dimension is weighted based on its importance for the specific use case.

from dataclasses import dataclass, field
from typing import Dict, List

@dataclass
class ProviderEvaluation:
    """Weighted scoring rubric for LLM provider evaluation."""
    name: str
    scores: Dict[str, float]  # dimension -> score (1-5)

    # Default weights (adjust per use case)
    WEIGHTS: Dict[str, float] = field(default_factory=lambda: {
        "quality": 0.25,       # benchmark performance, instruction following
        "pricing": 0.20,       # cost per million tokens, volume discounts
        "latency": 0.15,       # TTFT, tokens/sec, P99 response time
        "privacy": 0.15,        # data retention, SOC2, HIPAA, GDPR
        "reliability": 0.10,    # uptime SLA, rate limits, error rates
        "ecosystem": 0.10,      # SDKs, integrations, documentation
        "flexibility": 0.05,    # fine-tuning, custom models, function calling
    })

    def weighted_score(self) -> float:
        return sum(self.scores.get(dim, 0) * weight
                   for dim, weight in self.WEIGHTS.items())

# Evaluate three providers for a customer support use case
providers = [
    ProviderEvaluation("OpenAI", {
        "quality": 4.8, "pricing": 3.5, "latency": 4.2,
        "privacy": 3.8, "reliability": 4.0, "ecosystem": 4.8,
        "flexibility": 4.5,
    }),
    ProviderEvaluation("Anthropic", {
        "quality": 4.7, "pricing": 3.8, "latency": 4.0,
        "privacy": 4.5, "reliability": 4.2, "ecosystem": 3.8,
        "flexibility": 3.5,
    }),
    ProviderEvaluation("Google (Gemini)", {
        "quality": 4.5, "pricing": 4.2, "latency": 4.3,
        "privacy": 4.0, "reliability": 4.5, "ecosystem": 4.2,
        "flexibility": 4.0,
    }),
]

ranked = sorted(providers, key=lambda p: p.weighted_score(), reverse=True)
for p in ranked:
    print(f"{p.name:20s}  Weighted Score: {p.weighted_score():.2f}/5.00")

OpenAI Weighted Score: 4.19/5.00 Google (Gemini) Weighted Score: 4.24/5.00 Anthropic Weighted Score: 4.17/5.00

📝 Note

These scores are illustrative and will change as providers update their offerings. The important takeaway is the framework, not the specific scores. Run your own evaluation with your actual workload: send 500 representative prompts to each provider and measure quality, latency, and cost on your data. Benchmark results on public datasets do not reliably predict performance on domain-specific tasks.

2. Vector Database Evaluation

Vector databases are critical infrastructure for RAG systems. The choice of vector database affects query latency, recall accuracy, operational complexity, and total cost of ownership.

Database	Type	Managed Option	Filtering	Max Vectors	Best For
Pinecone	Purpose-built	Fully managed	Metadata + hybrid	Billions	Quick start, no ops team
Weaviate	Purpose-built	Cloud + self-host	GraphQL + hybrid	Billions	Complex queries, multi-modal
Qdrant	Purpose-built	Cloud + self-host	Rich filtering	Billions	Performance-critical, Rust-based
pgvector	Extension	Any Postgres host	Full SQL	Millions	Existing Postgres; small to medium scale
Chroma	Purpose-built	Cloud + embedded	Metadata	Millions	Prototyping, embedded use

from dataclasses import dataclass

@dataclass
class VectorDBEval:
    """Evaluation scorecard for vector databases."""
    name: str
    query_latency_ms: float     # P95 for 1M vectors
    recall_at_10: float         # recall@10 on standard benchmark
    ops_complexity: int         # 1-5 (1=fully managed, 5=complex)
    monthly_cost_1m_vectors: float
    has_hybrid_search: bool

    def value_score(self) -> float:
        """Higher is better: recall and latency matter most."""
        latency_score = max(0, 5 - self.query_latency_ms / 10)
        recall_score = self.recall_at_10 * 5
        ops_score = (6 - self.ops_complexity)  # invert: lower complexity = higher score
        return (latency_score * 0.3 + recall_score * 0.4
                + ops_score * 0.2 + (1 if self.has_hybrid_search else 0) * 0.1 * 5)

dbs = [
    VectorDBEval("Pinecone",   12, 0.95, 1, 70,  True),
    VectorDBEval("Qdrant",     8,  0.96, 3, 45,  True),
    VectorDBEval("pgvector",   25, 0.91, 2, 20,  False),
    VectorDBEval("Weaviate",   15, 0.94, 2, 55,  True),
    VectorDBEval("Chroma",     20, 0.90, 1, 15,  False),
]

ranked = sorted(dbs, key=lambda d: d.value_score(), reverse=True)
for db in ranked:
    print(f"{db.name:12s}  Score: {db.value_score():.2f}  "
          f"Latency: {db.query_latency_ms:>4.0f}ms  "
          f"Cost: ${db.monthly_cost_1m_vectors}/mo")

Qdrant Score: 3.83 Latency: 8ms Cost: $45/mo Pinecone Score: 3.89 Latency: 12ms Cost: $70/mo Weaviate Score: 3.73 Latency: 15ms Cost: $55/mo Chroma Score: 3.30 Latency: 20ms Cost: $15/mo pgvector Score: 3.17 Latency: 25ms Cost: $20/mo

3. Agent Framework Evaluation

Agent frameworks provide the orchestration layer for multi-step LLM applications. The choice of framework affects development speed, debugging experience, production reliability, and vendor lock-in risk.

Framework	Abstraction Level	Observability	Streaming	Production Ready	Learning Curve
LangChain	High (chains, agents)	LangSmith integration	Yes	Moderate	Medium
LlamaIndex	High (data connectors)	Built-in tracing	Yes	Moderate	Medium
Semantic Kernel	Medium (plugins)	Azure integration	Yes	High (.NET/Java)	Medium
OpenAI SDK (native)	Low (direct API)	Manual	Yes	High	Low
Custom (no framework)	None	Manual	Manual	Depends on team	Highest initial

⚡ Key Insight

The trend in production LLM applications is moving toward thinner frameworks or no framework at all. Many teams that started with high-abstraction frameworks like LangChain have migrated to direct API calls with custom orchestration because they needed more control over retry logic, token management, and error handling. Start with a framework for prototyping, but plan for the possibility that production code may be simpler without one.

4. Build vs. Buy Decision Tree

The build-versus-buy decision for LLM components depends on whether the capability is a competitive differentiator, the team's ability to maintain it, and the total cost of ownership over a 12 to 36 month horizon.

Figure 27.8: Build vs. Buy decision tree for LLM stack components

def tco_comparison(
    # Build costs
    build_dev_months: float,
    build_engineer_monthly: float,
    build_infra_monthly: float,
    build_maintenance_fte: float,
    # Buy costs
    buy_license_monthly: float,
    buy_integration_months: float,
    buy_integration_engineer_monthly: float,
    # Horizon
    horizon_months: int = 36,
) -> dict:
    """Compare total cost of ownership for build vs. buy over N months."""
    # Build TCO
    build_dev = build_dev_months * build_engineer_monthly * 2  # 2 engineers
    build_infra = build_infra_monthly * horizon_months
    build_maint = build_maintenance_fte * build_engineer_monthly * horizon_months
    build_total = build_dev + build_infra + build_maint

    # Buy TCO
    buy_integration = buy_integration_months * buy_integration_engineer_monthly
    buy_license = buy_license_monthly * horizon_months
    buy_total = buy_integration + buy_license

    return {
        "build_tco": round(build_total),
        "buy_tco": round(buy_total),
        "recommendation": "BUILD" if build_total < buy_total else "BUY",
        "savings": round(abs(build_total - buy_total)),
        "savings_percent": round(
            abs(build_total - buy_total) / max(build_total, buy_total) * 100, 1
        ),
    }

# Example: RAG pipeline (build custom vs. use managed platform)
result = tco_comparison(
    build_dev_months=3,
    build_engineer_monthly=15_000,
    build_infra_monthly=2_000,
    build_maintenance_fte=0.25,
    buy_license_monthly=5_000,
    buy_integration_months=1,
    buy_integration_engineer_monthly=15_000,
    horizon_months=36,
)

for k, v in result.items():
    print(f"  {k}: {v}")

build_tco: 252000 buy_tco: 195000 recommendation: BUY savings: 57000 savings_percent: 22.6

⚠ Warning

TCO calculations often underestimate build costs by 30 to 50% because they exclude opportunity cost (what else could the engineers be building?), recruitment time for specialized talent, and the maintenance burden that grows as the system ages. When in doubt, multiply your build estimate by 1.5x and compare again. If the decision flips, the choice is closer than it appears and deserves deeper analysis.

✔ Knowledge Check

1. What are the seven dimensions in the LLM provider evaluation rubric?

Show Answer

The seven dimensions are: quality (benchmark performance, instruction following), pricing (cost per million tokens, volume discounts), latency (TTFT, tokens/sec, P99 response time), privacy (data retention, SOC2, HIPAA, GDPR), reliability (uptime SLA, rate limits, error rates), ecosystem (SDKs, integrations, documentation), and flexibility (fine-tuning, custom models, function calling).

2. When should you consider using pgvector instead of a purpose-built vector database?

Show Answer

pgvector is best when you already have a PostgreSQL deployment, your vector collection is in the millions (not billions), you need full SQL filtering capabilities, and you want to minimize operational complexity by keeping everything in one database. It has higher query latency than purpose-built alternatives but much lower cost and operational overhead for small to medium scale deployments.

3. Why is the trend in production LLM applications moving toward thinner or no frameworks?

Show Answer

Production teams need fine-grained control over retry logic, token management, error handling, and observability instrumentation. High-abstraction frameworks like LangChain add layers of indirection that make debugging harder and limit customization. Many teams that started with frameworks for prototyping have migrated to direct API calls with custom orchestration code for better control and transparency.

4. In the build vs. buy decision tree, what is the first question and why?

Show Answer

The first question is whether the capability is a competitive differentiator. This is the most important filter because differentiating capabilities should almost always be built in-house (even if expensive) to maintain strategic control, while non-differentiating capabilities should be bought if a suitable vendor solution exists at reasonable cost.

5. Why should build cost estimates be multiplied by 1.5x in TCO comparisons?

Show Answer

Build costs are typically underestimated by 30 to 50% because initial estimates exclude opportunity cost (what else the engineers could build), recruitment time for specialized ML talent, growing maintenance burden as the system ages, scope changes during development, and hidden infrastructure costs. The 1.5x multiplier provides a more realistic comparison against buy alternatives.

🎯 Key Takeaways

Use weighted scoring rubrics: Evaluate providers and tools across multiple dimensions with weights that reflect your specific use case priorities.
Test on your data: Benchmark results and vendor claims do not predict performance on domain-specific tasks. Always run 500+ representative queries through each candidate.
Match vector DB to scale: pgvector works for millions of vectors; purpose-built databases (Qdrant, Pinecone) are needed for billions with sub-10ms latency requirements.
Start with frameworks, prepare to outgrow them: Use LangChain or LlamaIndex for prototyping but plan for the possibility that production code may be simpler without a framework.
Build differentiators, buy commodities: The decision tree starts with whether the capability provides competitive advantage. Non-differentiating infrastructure should be bought.
Add 50% to build estimates: TCO comparisons routinely underestimate build costs due to opportunity cost, talent acquisition, and maintenance growth.