Concepts

AI Observability

Practices and tools for monitoring, tracing, and debugging AI systems in production, covering token metrics, latency, response quality, costs, and hallucination detection.

seed#observability#llm#monitoring#tracing#langfuse#production#metrics

What it is

AI observability extends traditional observability practices — logs, metrics, and traces — to the domain of artificial intelligence systems. While conventional software monitors response times and error rates, AI systems additionally need to track token consumption, response quality, per-call costs, and hallucination presence.

The fundamental difference is that LLMs are non-deterministic: the same input can produce different outputs. This makes observability not just operational but also qualitative.

The three pillars applied to AI

Traces

In an AI system, a trace captures the complete journey of a request through the pipeline:

User → Prompt → Retrieval (RAG) → LLM Call → Tool Use → LLM Call → Response
  │       │           │                │           │          │          │
  └─ trace_id: abc-123 ─┴───────────────┴───────────┴──────────┴──────────┘

Each step records: input/output tokens, latency, model used, cost, and result. This is especially critical in agentic workflows where the model may iterate multiple times.

Metrics

MetricDescriptionWhy it matters
TTFT (Time to First Token)Latency to first tokenUser experience
Tokens per secondGeneration speedSystem throughput
Cost per requestTokens x model priceBudget control
Hallucination rateResponses with fabricated informationReliability
Rejection rateRequests the model couldn't completeFunctional coverage
User satisfactionExplicit or implicit feedbackPerceived quality

Logs

Detailed records of prompts, responses, tool decisions, and errors. Unlike traditional logs, AI logs include the full content of interactions to enable reproduction and debugging.

Ecosystem tools

ToolTypeFeatures
LangfuseOpen sourceTraces, evaluations, prompt management
LangSmithCommercial (LangChain)Traces, evaluation datasets, playground
Arize PhoenixOpen sourceTraces, drift detection, evaluations
BraintrustCommercialEvaluations, logging, model comparison
OpenTelemetry + extensionsOpen standardIntegration with existing distributed tracing infrastructure

Production evaluations

AI observability includes continuous evaluations — not just in development but in production:

  • LLM-as-judge: using one model to evaluate another's responses
  • Heuristic evaluations: rules on length, format, source presence
  • Human feedback: thumbs up/down, corrections, escalations
  • Business metrics: resolution rate, session time, conversion

Why it matters

Without observability, an AI system in production is an expensive black box. Teams cannot:

  • Detect quality degradation before users report it
  • Optimize costs by identifying unnecessary calls or oversized models
  • Debug why an agent made an incorrect decision
  • Meet audit and compliance requirements

References

Concepts