The LLM ecosystem is expanding so rapidly that vendor selection has become a strategic capability in itself. New model providers, vector databases, agent frameworks, and evaluation platforms launch weekly. Choosing the wrong vendor can lock you into an inferior solution, while building everything in-house can drain engineering resources that should be spent on differentiation. This section provides structured evaluation frameworks for every major category in the LLM stack and a decision tree for the build-versus-buy question.
1. LLM Provider Evaluation
Evaluating LLM providers requires balancing quality, cost, latency, privacy, and reliability. The scoring rubric below covers the dimensions that matter most for production deployments. Each dimension is weighted based on its importance for the specific use case.
from dataclasses import dataclass, field from typing import Dict, List @dataclass class ProviderEvaluation: """Weighted scoring rubric for LLM provider evaluation.""" name: str scores: Dict[str, float] # dimension -> score (1-5) # Default weights (adjust per use case) WEIGHTS: Dict[str, float] = field(default_factory=lambda: { "quality": 0.25, # benchmark performance, instruction following "pricing": 0.20, # cost per million tokens, volume discounts "latency": 0.15, # TTFT, tokens/sec, P99 response time "privacy": 0.15, # data retention, SOC2, HIPAA, GDPR "reliability": 0.10, # uptime SLA, rate limits, error rates "ecosystem": 0.10, # SDKs, integrations, documentation "flexibility": 0.05, # fine-tuning, custom models, function calling }) def weighted_score(self) -> float: return sum(self.scores.get(dim, 0) * weight for dim, weight in self.WEIGHTS.items()) # Evaluate three providers for a customer support use case providers = [ ProviderEvaluation("OpenAI", { "quality": 4.8, "pricing": 3.5, "latency": 4.2, "privacy": 3.8, "reliability": 4.0, "ecosystem": 4.8, "flexibility": 4.5, }), ProviderEvaluation("Anthropic", { "quality": 4.7, "pricing": 3.8, "latency": 4.0, "privacy": 4.5, "reliability": 4.2, "ecosystem": 3.8, "flexibility": 3.5, }), ProviderEvaluation("Google (Gemini)", { "quality": 4.5, "pricing": 4.2, "latency": 4.3, "privacy": 4.0, "reliability": 4.5, "ecosystem": 4.2, "flexibility": 4.0, }), ] ranked = sorted(providers, key=lambda p: p.weighted_score(), reverse=True) for p in ranked: print(f"{p.name:20s} Weighted Score: {p.weighted_score():.2f}/5.00")
These scores are illustrative and will change as providers update their offerings. The important takeaway is the framework, not the specific scores. Run your own evaluation with your actual workload: send 500 representative prompts to each provider and measure quality, latency, and cost on your data. Benchmark results on public datasets do not reliably predict performance on domain-specific tasks.
2. Vector Database Evaluation
Vector databases are critical infrastructure for RAG systems. The choice of vector database affects query latency, recall accuracy, operational complexity, and total cost of ownership.
| Database | Type | Managed Option | Filtering | Max Vectors | Best For |
|---|---|---|---|---|---|
| Pinecone | Purpose-built | Fully managed | Metadata + hybrid | Billions | Quick start, no ops team |
| Weaviate | Purpose-built | Cloud + self-host | GraphQL + hybrid | Billions | Complex queries, multi-modal |
| Qdrant | Purpose-built | Cloud + self-host | Rich filtering | Billions | Performance-critical, Rust-based |
| pgvector | Extension | Any Postgres host | Full SQL | Millions | Existing Postgres; small to medium scale |
| Chroma | Purpose-built | Cloud + embedded | Metadata | Millions | Prototyping, embedded use |
from dataclasses import dataclass @dataclass class VectorDBEval: """Evaluation scorecard for vector databases.""" name: str query_latency_ms: float # P95 for 1M vectors recall_at_10: float # recall@10 on standard benchmark ops_complexity: int # 1-5 (1=fully managed, 5=complex) monthly_cost_1m_vectors: float has_hybrid_search: bool def value_score(self) -> float: """Higher is better: recall and latency matter most.""" latency_score = max(0, 5 - self.query_latency_ms / 10) recall_score = self.recall_at_10 * 5 ops_score = (6 - self.ops_complexity) # invert: lower complexity = higher score return (latency_score * 0.3 + recall_score * 0.4 + ops_score * 0.2 + (1 if self.has_hybrid_search else 0) * 0.1 * 5) dbs = [ VectorDBEval("Pinecone", 12, 0.95, 1, 70, True), VectorDBEval("Qdrant", 8, 0.96, 3, 45, True), VectorDBEval("pgvector", 25, 0.91, 2, 20, False), VectorDBEval("Weaviate", 15, 0.94, 2, 55, True), VectorDBEval("Chroma", 20, 0.90, 1, 15, False), ] ranked = sorted(dbs, key=lambda d: d.value_score(), reverse=True) for db in ranked: print(f"{db.name:12s} Score: {db.value_score():.2f} " f"Latency: {db.query_latency_ms:>4.0f}ms " f"Cost: ${db.monthly_cost_1m_vectors}/mo")
3. Agent Framework Evaluation
Agent frameworks provide the orchestration layer for multi-step LLM applications. The choice of framework affects development speed, debugging experience, production reliability, and vendor lock-in risk.
| Framework | Abstraction Level | Observability | Streaming | Production Ready | Learning Curve |
|---|---|---|---|---|---|
| LangChain | High (chains, agents) | LangSmith integration | Yes | Moderate | Medium |
| LlamaIndex | High (data connectors) | Built-in tracing | Yes | Moderate | Medium |
| Semantic Kernel | Medium (plugins) | Azure integration | Yes | High (.NET/Java) | Medium |
| OpenAI SDK (native) | Low (direct API) | Manual | Yes | High | Low |
| Custom (no framework) | None | Manual | Manual | Depends on team | Highest initial |
The trend in production LLM applications is moving toward thinner frameworks or no framework at all. Many teams that started with high-abstraction frameworks like LangChain have migrated to direct API calls with custom orchestration because they needed more control over retry logic, token management, and error handling. Start with a framework for prototyping, but plan for the possibility that production code may be simpler without one.
4. Build vs. Buy Decision Tree
The build-versus-buy decision for LLM components depends on whether the capability is a competitive differentiator, the team's ability to maintain it, and the total cost of ownership over a 12 to 36 month horizon.
def tco_comparison( # Build costs build_dev_months: float, build_engineer_monthly: float, build_infra_monthly: float, build_maintenance_fte: float, # Buy costs buy_license_monthly: float, buy_integration_months: float, buy_integration_engineer_monthly: float, # Horizon horizon_months: int = 36, ) -> dict: """Compare total cost of ownership for build vs. buy over N months.""" # Build TCO build_dev = build_dev_months * build_engineer_monthly * 2 # 2 engineers build_infra = build_infra_monthly * horizon_months build_maint = build_maintenance_fte * build_engineer_monthly * horizon_months build_total = build_dev + build_infra + build_maint # Buy TCO buy_integration = buy_integration_months * buy_integration_engineer_monthly buy_license = buy_license_monthly * horizon_months buy_total = buy_integration + buy_license return { "build_tco": round(build_total), "buy_tco": round(buy_total), "recommendation": "BUILD" if build_total < buy_total else "BUY", "savings": round(abs(build_total - buy_total)), "savings_percent": round( abs(build_total - buy_total) / max(build_total, buy_total) * 100, 1 ), } # Example: RAG pipeline (build custom vs. use managed platform) result = tco_comparison( build_dev_months=3, build_engineer_monthly=15_000, build_infra_monthly=2_000, build_maintenance_fte=0.25, buy_license_monthly=5_000, buy_integration_months=1, buy_integration_engineer_monthly=15_000, horizon_months=36, ) for k, v in result.items(): print(f" {k}: {v}")
TCO calculations often underestimate build costs by 30 to 50% because they exclude opportunity cost (what else could the engineers be building?), recruitment time for specialized talent, and the maintenance burden that grows as the system ages. When in doubt, multiply your build estimate by 1.5x and compare again. If the decision flips, the choice is closer than it appears and deserves deeper analysis.
✔ Knowledge Check
1. What are the seven dimensions in the LLM provider evaluation rubric?
Show Answer
2. When should you consider using pgvector instead of a purpose-built vector database?
Show Answer
3. Why is the trend in production LLM applications moving toward thinner or no frameworks?
Show Answer
4. In the build vs. buy decision tree, what is the first question and why?
Show Answer
5. Why should build cost estimates be multiplied by 1.5x in TCO comparisons?
Show Answer
🎯 Key Takeaways
- Use weighted scoring rubrics: Evaluate providers and tools across multiple dimensions with weights that reflect your specific use case priorities.
- Test on your data: Benchmark results and vendor claims do not predict performance on domain-specific tasks. Always run 500+ representative queries through each candidate.
- Match vector DB to scale: pgvector works for millions of vectors; purpose-built databases (Qdrant, Pinecone) are needed for billions with sub-10ms latency requirements.
- Start with frameworks, prepare to outgrow them: Use LangChain or LlamaIndex for prototyping but plan for the possibility that production code may be simpler without a framework.
- Build differentiators, buy commodities: The decision tree starts with whether the capability provides competitive advantage. Non-differentiating infrastructure should be bought.
- Add 50% to build estimates: TCO comparisons routinely underestimate build costs due to opportunity cost, talent acquisition, and maintenance growth.