Jonatan Matajonmatum.com
conceptsnotesexperimentsessays
© 2026 Jonatan Mata. All rights reserved.v2.1.1
Concepts

AI Observability

Practices and tools for monitoring, tracing, and debugging AI systems in production, covering token metrics, latency, response quality, costs, and hallucination detection.

evergreen#observability#llm#monitoring#tracing#langfuse#production#metrics

What it is

AI observability extends traditional observability practices — logs, metrics, and traces — to the domain of artificial intelligence systems. While conventional software monitors response times and error rates, AI systems additionally need to track token consumption, response quality, per-call costs, and hallucination presence.

The fundamental difference is that LLMs are non-deterministic: the same input can produce different outputs. This makes observability not just operational but also qualitative — we need to measure not only "did it work?" but "was the response good?".

The three pillars applied to AI

Traces

In an AI system, a trace captures the complete journey of a request through the pipeline:

User → Prompt → Retrieval (RAG) → LLM Call → Tool Use → LLM Call → Response
  │       │           │                │           │          │          │
  └─ trace_id: abc-123 ─┴───────────────┴───────────┴──────────┴──────────┘

Each step records: input/output tokens, latency, model used, cost, and result. This is especially critical in agentic workflows where the model may iterate multiple times.

Metrics

MetricDescriptionWhy it matters
TTFT (Time to First Token)Latency to first tokenUser experience
Tokens per secondGeneration speedSystem throughput
Cost per requestTokens x model priceBudget control
Hallucination rateResponses with fabricated informationReliability
Rejection rateRequests the model couldn't completeFunctional coverage
User satisfactionExplicit or implicit feedbackPerceived quality

Logs

Detailed records of prompts, responses, tool decisions, and errors. Unlike traditional logs, AI logs include the full content of interactions to enable reproduction and debugging.

Ecosystem tools

ToolTypeFeatures
LangfuseOpen sourceTraces, evaluations, prompt management
LangSmithCommercial (LangChain)Traces, evaluation datasets, playground
Arize PhoenixOpen sourceTraces, drift detection, evaluations
BraintrustCommercialEvaluations, logging, model comparison
OpenTelemetry + extensionsOpen standardIntegration with existing distributed tracing infrastructure

Production evaluations

AI observability includes continuous evaluations — not just in development but in production:

  • LLM-as-judge: using one model to evaluate another's responses
  • Heuristic evaluations: rules on length, format, source presence
  • Human feedback: thumbs up/down, corrections, escalations
  • Business metrics: resolution rate, session time, conversion

Cost tracking

The cost of an AI system in production can grow rapidly without visibility. An effective cost dashboard tracks:

  • Cost per user/session: identifies users or flows that consume disproportionately
  • Cost per model: compares spending across providers and models to optimize selection
  • Cost per feature: attributes spending to specific product features
  • Daily/weekly trend: detects anomalies before they become surprise bills
  • Wasted tokens: identifies calls with excessive context or truncated responses that repeat

Combining traces with cost metadata enables answering questions like "how much does it cost on average to resolve a support ticket with the agent?" — critical information for product decisions.

Why it matters

Without observability, an AI system in production is an expensive black box. Teams cannot:

  • Detect quality degradation before users report it
  • Optimize costs by identifying unnecessary calls or oversized models
  • Debug why an agent made an incorrect decision
  • Meet audit and compliance requirements

References

  • OpenLLMetry — Traceloop. OpenTelemetry instrumentation for LLMs.
  • Langfuse Documentation — Langfuse. Open source LLM observability platform.
  • LLM Observability — Arize AI — Arize. Phoenix documentation for traces and evaluations.
  • GenAI Semantic Conventions — OpenTelemetry — OpenTelemetry, 2024. Semantic conventions for instrumenting generative AI systems.
  • Braintrust Documentation — Braintrust, 2024. Evaluation and logging platform for LLMs.

Related content

  • Observability

    Ability to understand a system's internal state from its external outputs: logs, metrics, and traces, enabling problem diagnosis without direct system access.

  • Distributed Tracing

    Observability technique tracking requests across multiple services in distributed systems, enabling bottleneck identification and failure diagnosis.

  • AI Evaluation Metrics

    Frameworks and metrics for measuring AI system performance, quality, and safety, from standard benchmarks to domain-specific evaluations.

  • AI Orchestration

    Patterns and frameworks for coordinating multiple AI models, tools, and data sources in production pipelines, managing flow between components, memory, and error recovery.

  • Cost Optimization

    Practices and strategies to minimize cloud spending without sacrificing performance, including right-sizing, reservations, spot instances, and eliminating idle resources.

Concepts