Module 25 · Section 25.5

Observability & Tracing

LLM tracing concepts, LangSmith, Langfuse, Phoenix, LangWatch, structured logging, and production alerting
★ Big Picture

You cannot debug what you cannot see. LLM applications involve chains of prompts, retrieval steps, tool calls, and generation steps that are invisible without proper instrumentation. When a user reports a bad response, you need to trace back through the entire execution: what was the prompt? What context was retrieved? What tool calls were made? How long did each step take? Observability is the practice of instrumenting your system so that you can answer these questions for any request, at any time. This section covers LLM-specific tracing concepts and the major platforms that provide this capability.

1. LLM Tracing Concepts

LLM tracing extends the concept of distributed tracing to LLM-specific operations. A trace represents a single end-to-end request through your application. Each trace contains spans that represent individual operations: an LLM call, a retrieval query, a tool invocation, or a custom function. Spans capture inputs, outputs, latency, token counts, model parameters, and any metadata you attach.

The trace hierarchy for a typical RAG application looks like this: a root span for the entire request, child spans for embedding the query, searching the vector store, constructing the prompt, calling the LLM, and post-processing the response. Each span records its duration, enabling you to identify performance bottlenecks at a glance.

Anatomy of an LLM Trace 0ms 1200ms RAG Request (total: 1180ms) embed_query (85ms) vector_search (120ms) rerank (65ms) llm_generate (870ms) [gpt-4o, 1420 tokens] pp Span Metadata (llm_generate) model: gpt-4o prompt_tokens: 580 completion_tokens: 840 temperature: 0.0 cost: $0.0142 Span Metadata (vector_search) collection: docs_v2 top_k: 5 results_returned: 5 similarity_min: 0.72 similarity_max: 0.91
Figure 25.12: Anatomy of an LLM trace showing spans, durations, and metadata for a RAG request.

2. Instrumenting with Langfuse

Langfuse is an open-source LLM observability platform that supports tracing, prompt management, evaluation, and cost tracking. It can be self-hosted or used as a managed service. Its Python SDK provides both a decorator-based API for easy instrumentation and a low-level API for custom spans.

from langfuse.decorators import observe, langfuse_context
from langfuse.openai import openai  # drop-in replacement with tracing

# The @observe decorator automatically creates traces and spans
@observe()
def rag_pipeline(query: str) -> str:
    """RAG pipeline with automatic Langfuse tracing."""
    # Each decorated function becomes a span in the trace
    context_docs = retrieve_documents(query)
    answer = generate_answer(query, context_docs)
    return answer

@observe()
def retrieve_documents(query: str) -> list[str]:
    """Retrieve relevant documents from vector store."""
    # Add custom metadata to the current span
    langfuse_context.update_current_observation(
        metadata={"collection": "knowledge_base_v2", "top_k": 5}
    )
    # Retrieval logic here...
    docs = ["Document 1 content...", "Document 2 content..."]
    return docs

@observe()
def generate_answer(query: str, docs: list[str]) -> str:
    """Generate answer using retrieved context."""
    context = "\n".join(docs)
    # The Langfuse OpenAI wrapper auto-traces this call
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer based on context:\n{context}"},
            {"role": "user", "content": query},
        ],
        temperature=0,
    )
    return response.choices[0].message.content

# Traces appear automatically in the Langfuse dashboard
result = rag_pipeline("What are the benefits of RAG?")
📝 Langfuse OpenAI Wrapper

The from langfuse.openai import openai import is a drop-in replacement for the standard OpenAI client. It automatically captures all LLM call details (model, tokens, latency, cost) without any code changes. This is the easiest way to add tracing to an existing application. Similar wrappers exist for other providers and frameworks.

3. Tracing with LangSmith

LangSmith is the observability platform built by the LangChain team. It provides tracing, evaluation, datasets, and prompt versioning. If you use LangChain or LangGraph, LangSmith tracing integrates automatically. For non-LangChain applications, the traceable decorator provides similar functionality.

from langsmith import traceable, Client
from openai import OpenAI
import os

# Enable LangSmith tracing via environment variable
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "my-rag-app"

client = OpenAI()

@traceable(run_type="chain")
def answer_question(question: str) -> dict:
    """Answer a question with LangSmith tracing."""
    docs = search_knowledge_base(question)
    response = call_llm(question, docs)
    return {"answer": response, "sources": docs}

@traceable(run_type="retriever")
def search_knowledge_base(query: str) -> list[str]:
    """Search vector store for relevant documents."""
    # Retrieval logic here...
    return ["Relevant document content..."]

@traceable(run_type="llm")
def call_llm(question: str, context_docs: list[str]) -> str:
    """Call LLM with retrieved context."""
    context = "\n".join(context_docs)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Use this context:\n{context}"},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

4. Platform Comparison

Platform Open Source? Self-Host? Key Differentiator Best For
LangSmith No No (cloud only) Deep LangChain integration LangChain/LangGraph users
Langfuse Yes Yes Open source; prompt management Teams wanting full control
Phoenix (Arize) Yes Yes Embedding visualization; eval integration ML teams with embedding analysis needs
LangWatch Partial Yes Guardrails integration; safety monitoring Safety-focused applications
TruLens Yes Yes Feedback functions; modular evaluation Custom evaluation workflows

5. Structured Logging Patterns

Even with dedicated tracing platforms, structured logging remains essential for debugging, auditing, and compliance. LLM-specific logging should capture prompt templates, variable values, model responses, token usage, latency, and any evaluation scores. Use structured (JSON) logging rather than plain text to enable automated parsing and analysis.

import logging
import json
import time
from datetime import datetime, timezone
from functools import wraps

# Configure JSON structured logging
class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "level": record.levelname,
            "message": record.getMessage(),
        }
        if hasattr(record, "llm_data"):
            log_data.update(record.llm_data)
        return json.dumps(log_data)

logger = logging.getLogger("llm_app")
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)

def log_llm_call(func):
    """Decorator to log LLM calls with structured metadata."""
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        try:
            result = func(*args, **kwargs)
            latency_ms = (time.time() - start) * 1000

            record = logger.makeRecord(
                "llm_app", logging.INFO, "", 0,
                f"LLM call: {func.__name__}", (), None
            )
            record.llm_data = {
                "event": "llm_call",
                "function": func.__name__,
                "latency_ms": round(latency_ms, 1),
                "status": "success",
            }
            logger.handle(record)
            return result

        except Exception as e:
            latency_ms = (time.time() - start) * 1000
            record = logger.makeRecord(
                "llm_app", logging.ERROR, "", 0,
                f"LLM call failed: {func.__name__}", (), None
            )
            record.llm_data = {
                "event": "llm_call_error",
                "function": func.__name__,
                "latency_ms": round(latency_ms, 1),
                "error": str(e),
            }
            logger.handle(record)
            raise
    return wrapper

6. Alerting for LLM Applications

Production LLM applications need alerting on metrics that traditional monitoring does not cover. In addition to standard alerts (error rate, latency p95, availability), LLM-specific alerts should track token usage spikes, cost anomalies, quality score degradation, and safety violations.

LLM Application Alert Categories Performance Latency p95 > 5s Error rate > 1% Token/request > 2x baseline Timeout rate > 0.5% Queue depth > 100 Cost & Usage Daily spend > $X threshold Cost per request spike > 3x Token usage anomaly Rate limit approaching Unused prompt cache drop Quality & Safety Faithfulness score < 0.7 Hallucination rate > 5% Safety filter triggers > N User thumbs-down spike Prompt injection detected
Figure 25.13: Alert categories for production LLM applications covering performance, cost, and quality.
from dataclasses import dataclass
from typing import Callable, Optional

@dataclass
class AlertRule:
    """Definition of an alerting rule for LLM monitoring."""
    name: str
    metric: str
    threshold: float
    comparison: str  # "gt" (greater than) or "lt" (less than)
    window_minutes: int = 15
    severity: str = "warning"  # warning, critical
    notify_channel: str = "slack"

class LLMAlertManager:
    """Manages alerting rules for LLM applications."""

    DEFAULT_RULES = [
        AlertRule("High Latency", "latency_p95_ms", 5000, "gt", severity="warning"),
        AlertRule("Error Rate", "error_rate", 0.01, "gt", severity="critical"),
        AlertRule("Cost Spike", "cost_per_request_usd", 0.10, "gt", severity="warning"),
        AlertRule("Low Faithfulness", "faithfulness_score", 0.7, "lt", severity="critical"),
        AlertRule("Hallucination Spike", "hallucination_rate", 0.05, "gt", severity="critical"),
    ]

    def __init__(self, rules: Optional[list[AlertRule]] = None):
        self.rules = rules or self.DEFAULT_RULES

    def check_metrics(self, current_metrics: dict) -> list[dict]:
        """Evaluate all rules against current metrics."""
        fired = []
        for rule in self.rules:
            if rule.metric not in current_metrics:
                continue
            value = current_metrics[rule.metric]
            triggered = (
                (rule.comparison == "gt" and value > rule.threshold)
                or (rule.comparison == "lt" and value < rule.threshold)
            )
            if triggered:
                fired.append({
                    "alert": rule.name,
                    "severity": rule.severity,
                    "metric": rule.metric,
                    "value": value,
                    "threshold": rule.threshold,
                })
        return fired
💡 Key Insight

Start with a small set of high-signal alerts and expand gradually. Alert fatigue is a real problem: if your team receives dozens of alerts per day, they will start ignoring them. Focus on the metrics that directly indicate user-facing problems (error rate, safety violations, severe quality drops) and set thresholds conservatively. Use warning-level alerts for early signals and critical-level alerts for immediate action items.

📝 Knowledge Check

1. What is the difference between a trace and a span in LLM observability?
Show Answer
A trace represents a complete end-to-end request through the application (for example, a single user query through a RAG pipeline). A span represents a single operation within that trace (for example, embedding the query, searching the vector store, or calling the LLM). Traces contain multiple spans arranged in a parent-child hierarchy. Each span records its own duration, inputs, outputs, and metadata, allowing you to drill into individual operations within a request.
2. How does the Langfuse OpenAI wrapper simplify tracing?
Show Answer
The Langfuse OpenAI wrapper (from langfuse.openai import openai) is a drop-in replacement for the standard OpenAI client. It automatically captures all LLM call details (model name, prompt, completion, token counts, latency, cost) without requiring any code changes to existing API calls. When used within a function decorated with @observe(), it automatically attaches the LLM call as a child span in the current trace.
3. Why is structured (JSON) logging preferred over plain text logging for LLM applications?
Show Answer
Structured logging produces machine-parseable records that can be automatically queried, filtered, and aggregated. For LLM applications, this enables filtering logs by model name, cost above a threshold, latency percentiles, or error types. Plain text logs require manual parsing with regex patterns, which is fragile and error-prone. Structured logs also integrate easily with log aggregation platforms (Elasticsearch, Datadog, CloudWatch) for dashboards and alerting.
4. What LLM-specific metrics should trigger production alerts that traditional monitoring would not cover?
Show Answer
LLM-specific alert metrics include: faithfulness score drops (indicating hallucination increases), token usage anomalies (indicating prompt injection or runaway generation), cost per request spikes, safety filter trigger rates, user feedback score degradation, and prompt injection detection rates. Traditional monitoring covers errors and latency but misses these quality and safety dimensions that are critical for LLM applications.
5. When would you choose Langfuse over LangSmith for your project?
Show Answer
Choose Langfuse when you want an open-source solution that you can self-host for data privacy and compliance, when you are not using LangChain and want a framework-agnostic tracing solution, when you need built-in prompt management, or when you want full control over your observability data. Choose LangSmith when you are heavily invested in the LangChain or LangGraph ecosystem and want the deepest possible integration with those tools.

Key Takeaways