Module 27 · Section 27.4

LLM Vendor Evaluation & Build vs. Buy

Provider evaluation criteria, vector database selection, agent framework comparison, and build-versus-buy decision trees
★ Big Picture

The LLM ecosystem is expanding so rapidly that vendor selection has become a strategic capability in itself. New model providers, vector databases, agent frameworks, and evaluation platforms launch weekly. Choosing the wrong vendor can lock you into an inferior solution, while building everything in-house can drain engineering resources that should be spent on differentiation. This section provides structured evaluation frameworks for every major category in the LLM stack and a decision tree for the build-versus-buy question.

1. LLM Provider Evaluation

Evaluating LLM providers requires balancing quality, cost, latency, privacy, and reliability. The scoring rubric below covers the dimensions that matter most for production deployments. Each dimension is weighted based on its importance for the specific use case.

from dataclasses import dataclass, field
from typing import Dict, List

@dataclass
class ProviderEvaluation:
    """Weighted scoring rubric for LLM provider evaluation."""
    name: str
    scores: Dict[str, float]  # dimension -> score (1-5)

    # Default weights (adjust per use case)
    WEIGHTS: Dict[str, float] = field(default_factory=lambda: {
        "quality": 0.25,       # benchmark performance, instruction following
        "pricing": 0.20,       # cost per million tokens, volume discounts
        "latency": 0.15,       # TTFT, tokens/sec, P99 response time
        "privacy": 0.15,        # data retention, SOC2, HIPAA, GDPR
        "reliability": 0.10,    # uptime SLA, rate limits, error rates
        "ecosystem": 0.10,      # SDKs, integrations, documentation
        "flexibility": 0.05,    # fine-tuning, custom models, function calling
    })

    def weighted_score(self) -> float:
        return sum(self.scores.get(dim, 0) * weight
                   for dim, weight in self.WEIGHTS.items())

# Evaluate three providers for a customer support use case
providers = [
    ProviderEvaluation("OpenAI", {
        "quality": 4.8, "pricing": 3.5, "latency": 4.2,
        "privacy": 3.8, "reliability": 4.0, "ecosystem": 4.8,
        "flexibility": 4.5,
    }),
    ProviderEvaluation("Anthropic", {
        "quality": 4.7, "pricing": 3.8, "latency": 4.0,
        "privacy": 4.5, "reliability": 4.2, "ecosystem": 3.8,
        "flexibility": 3.5,
    }),
    ProviderEvaluation("Google (Gemini)", {
        "quality": 4.5, "pricing": 4.2, "latency": 4.3,
        "privacy": 4.0, "reliability": 4.5, "ecosystem": 4.2,
        "flexibility": 4.0,
    }),
]

ranked = sorted(providers, key=lambda p: p.weighted_score(), reverse=True)
for p in ranked:
    print(f"{p.name:20s}  Weighted Score: {p.weighted_score():.2f}/5.00")
OpenAI Weighted Score: 4.19/5.00 Google (Gemini) Weighted Score: 4.24/5.00 Anthropic Weighted Score: 4.17/5.00
📝 Note

These scores are illustrative and will change as providers update their offerings. The important takeaway is the framework, not the specific scores. Run your own evaluation with your actual workload: send 500 representative prompts to each provider and measure quality, latency, and cost on your data. Benchmark results on public datasets do not reliably predict performance on domain-specific tasks.

2. Vector Database Evaluation

Vector databases are critical infrastructure for RAG systems. The choice of vector database affects query latency, recall accuracy, operational complexity, and total cost of ownership.

Database Type Managed Option Filtering Max Vectors Best For
Pinecone Purpose-built Fully managed Metadata + hybrid Billions Quick start, no ops team
Weaviate Purpose-built Cloud + self-host GraphQL + hybrid Billions Complex queries, multi-modal
Qdrant Purpose-built Cloud + self-host Rich filtering Billions Performance-critical, Rust-based
pgvector Extension Any Postgres host Full SQL Millions Existing Postgres; small to medium scale
Chroma Purpose-built Cloud + embedded Metadata Millions Prototyping, embedded use
from dataclasses import dataclass

@dataclass
class VectorDBEval:
    """Evaluation scorecard for vector databases."""
    name: str
    query_latency_ms: float     # P95 for 1M vectors
    recall_at_10: float         # recall@10 on standard benchmark
    ops_complexity: int         # 1-5 (1=fully managed, 5=complex)
    monthly_cost_1m_vectors: float
    has_hybrid_search: bool

    def value_score(self) -> float:
        """Higher is better: recall and latency matter most."""
        latency_score = max(0, 5 - self.query_latency_ms / 10)
        recall_score = self.recall_at_10 * 5
        ops_score = (6 - self.ops_complexity)  # invert: lower complexity = higher score
        return (latency_score * 0.3 + recall_score * 0.4
                + ops_score * 0.2 + (1 if self.has_hybrid_search else 0) * 0.1 * 5)

dbs = [
    VectorDBEval("Pinecone",   12, 0.95, 1, 70,  True),
    VectorDBEval("Qdrant",     8,  0.96, 3, 45,  True),
    VectorDBEval("pgvector",   25, 0.91, 2, 20,  False),
    VectorDBEval("Weaviate",   15, 0.94, 2, 55,  True),
    VectorDBEval("Chroma",     20, 0.90, 1, 15,  False),
]

ranked = sorted(dbs, key=lambda d: d.value_score(), reverse=True)
for db in ranked:
    print(f"{db.name:12s}  Score: {db.value_score():.2f}  "
          f"Latency: {db.query_latency_ms:>4.0f}ms  "
          f"Cost: ${db.monthly_cost_1m_vectors}/mo")
Qdrant Score: 3.83 Latency: 8ms Cost: $45/mo Pinecone Score: 3.89 Latency: 12ms Cost: $70/mo Weaviate Score: 3.73 Latency: 15ms Cost: $55/mo Chroma Score: 3.30 Latency: 20ms Cost: $15/mo pgvector Score: 3.17 Latency: 25ms Cost: $20/mo

3. Agent Framework Evaluation

Agent frameworks provide the orchestration layer for multi-step LLM applications. The choice of framework affects development speed, debugging experience, production reliability, and vendor lock-in risk.

Framework Abstraction Level Observability Streaming Production Ready Learning Curve
LangChain High (chains, agents) LangSmith integration Yes Moderate Medium
LlamaIndex High (data connectors) Built-in tracing Yes Moderate Medium
Semantic Kernel Medium (plugins) Azure integration Yes High (.NET/Java) Medium
OpenAI SDK (native) Low (direct API) Manual Yes High Low
Custom (no framework) None Manual Manual Depends on team Highest initial
⚡ Key Insight

The trend in production LLM applications is moving toward thinner frameworks or no framework at all. Many teams that started with high-abstraction frameworks like LangChain have migrated to direct API calls with custom orchestration because they needed more control over retry logic, token management, and error handling. Start with a framework for prototyping, but plan for the possibility that production code may be simpler without one.

4. Build vs. Buy Decision Tree

The build-versus-buy decision for LLM components depends on whether the capability is a competitive differentiator, the team's ability to maintain it, and the total cost of ownership over a 12 to 36 month horizon.

Is this a competitive differentiator? Yes Do you have the ML talent? Yes BUILD Custom model + infra No BUILD + HIRE Recruit then build No Does a vendor solution exist? Yes Is TCO lower than build? Yes BUY Use vendor No BUILD LIGHT Minimal custom code No BUILD No alternative TCO Comparison (36 months) Build: Dev + Infra + Maintenance + Opportunity Buy: License + Integration + Vendor Lock-in Risk
Figure 27.8: Build vs. Buy decision tree for LLM stack components
def tco_comparison(
    # Build costs
    build_dev_months: float,
    build_engineer_monthly: float,
    build_infra_monthly: float,
    build_maintenance_fte: float,
    # Buy costs
    buy_license_monthly: float,
    buy_integration_months: float,
    buy_integration_engineer_monthly: float,
    # Horizon
    horizon_months: int = 36,
) -> dict:
    """Compare total cost of ownership for build vs. buy over N months."""
    # Build TCO
    build_dev = build_dev_months * build_engineer_monthly * 2  # 2 engineers
    build_infra = build_infra_monthly * horizon_months
    build_maint = build_maintenance_fte * build_engineer_monthly * horizon_months
    build_total = build_dev + build_infra + build_maint

    # Buy TCO
    buy_integration = buy_integration_months * buy_integration_engineer_monthly
    buy_license = buy_license_monthly * horizon_months
    buy_total = buy_integration + buy_license

    return {
        "build_tco": round(build_total),
        "buy_tco": round(buy_total),
        "recommendation": "BUILD" if build_total < buy_total else "BUY",
        "savings": round(abs(build_total - buy_total)),
        "savings_percent": round(
            abs(build_total - buy_total) / max(build_total, buy_total) * 100, 1
        ),
    }

# Example: RAG pipeline (build custom vs. use managed platform)
result = tco_comparison(
    build_dev_months=3,
    build_engineer_monthly=15_000,
    build_infra_monthly=2_000,
    build_maintenance_fte=0.25,
    buy_license_monthly=5_000,
    buy_integration_months=1,
    buy_integration_engineer_monthly=15_000,
    horizon_months=36,
)

for k, v in result.items():
    print(f"  {k}: {v}")
build_tco: 252000 buy_tco: 195000 recommendation: BUY savings: 57000 savings_percent: 22.6
⚠ Warning

TCO calculations often underestimate build costs by 30 to 50% because they exclude opportunity cost (what else could the engineers be building?), recruitment time for specialized talent, and the maintenance burden that grows as the system ages. When in doubt, multiply your build estimate by 1.5x and compare again. If the decision flips, the choice is closer than it appears and deserves deeper analysis.

✔ Knowledge Check

1. What are the seven dimensions in the LLM provider evaluation rubric?

Show Answer
The seven dimensions are: quality (benchmark performance, instruction following), pricing (cost per million tokens, volume discounts), latency (TTFT, tokens/sec, P99 response time), privacy (data retention, SOC2, HIPAA, GDPR), reliability (uptime SLA, rate limits, error rates), ecosystem (SDKs, integrations, documentation), and flexibility (fine-tuning, custom models, function calling).

2. When should you consider using pgvector instead of a purpose-built vector database?

Show Answer
pgvector is best when you already have a PostgreSQL deployment, your vector collection is in the millions (not billions), you need full SQL filtering capabilities, and you want to minimize operational complexity by keeping everything in one database. It has higher query latency than purpose-built alternatives but much lower cost and operational overhead for small to medium scale deployments.

3. Why is the trend in production LLM applications moving toward thinner or no frameworks?

Show Answer
Production teams need fine-grained control over retry logic, token management, error handling, and observability instrumentation. High-abstraction frameworks like LangChain add layers of indirection that make debugging harder and limit customization. Many teams that started with frameworks for prototyping have migrated to direct API calls with custom orchestration code for better control and transparency.

4. In the build vs. buy decision tree, what is the first question and why?

Show Answer
The first question is whether the capability is a competitive differentiator. This is the most important filter because differentiating capabilities should almost always be built in-house (even if expensive) to maintain strategic control, while non-differentiating capabilities should be bought if a suitable vendor solution exists at reasonable cost.

5. Why should build cost estimates be multiplied by 1.5x in TCO comparisons?

Show Answer
Build costs are typically underestimated by 30 to 50% because initial estimates exclude opportunity cost (what else the engineers could build), recruitment time for specialized ML talent, growing maintenance burden as the system ages, scope changes during development, and hidden infrastructure costs. The 1.5x multiplier provides a more realistic comparison against buy alternatives.

🎯 Key Takeaways