AI Observability
Practices and tools for monitoring, tracing, and debugging AI systems in production, covering token metrics, latency, response quality, costs, and hallucination detection.
What it is
AI observability extends traditional observability practices — logs, metrics, and traces — to the domain of artificial intelligence systems. While conventional software monitors response times and error rates, AI systems additionally need to track token consumption, response quality, per-call costs, and hallucination presence.
The fundamental difference is that LLMs are non-deterministic: the same input can produce different outputs. This makes observability not just operational but also qualitative.
The three pillars applied to AI
Traces
In an AI system, a trace captures the complete journey of a request through the pipeline:
User → Prompt → Retrieval (RAG) → LLM Call → Tool Use → LLM Call → Response
│ │ │ │ │ │ │
└─ trace_id: abc-123 ─┴───────────────┴───────────┴──────────┴──────────┘
Each step records: input/output tokens, latency, model used, cost, and result. This is especially critical in agentic workflows where the model may iterate multiple times.
Metrics
| Metric | Description | Why it matters |
|---|---|---|
| TTFT (Time to First Token) | Latency to first token | User experience |
| Tokens per second | Generation speed | System throughput |
| Cost per request | Tokens x model price | Budget control |
| Hallucination rate | Responses with fabricated information | Reliability |
| Rejection rate | Requests the model couldn't complete | Functional coverage |
| User satisfaction | Explicit or implicit feedback | Perceived quality |
Logs
Detailed records of prompts, responses, tool decisions, and errors. Unlike traditional logs, AI logs include the full content of interactions to enable reproduction and debugging.
Ecosystem tools
| Tool | Type | Features |
|---|---|---|
| Langfuse | Open source | Traces, evaluations, prompt management |
| LangSmith | Commercial (LangChain) | Traces, evaluation datasets, playground |
| Arize Phoenix | Open source | Traces, drift detection, evaluations |
| Braintrust | Commercial | Evaluations, logging, model comparison |
| OpenTelemetry + extensions | Open standard | Integration with existing distributed tracing infrastructure |
Production evaluations
AI observability includes continuous evaluations — not just in development but in production:
- LLM-as-judge: using one model to evaluate another's responses
- Heuristic evaluations: rules on length, format, source presence
- Human feedback: thumbs up/down, corrections, escalations
- Business metrics: resolution rate, session time, conversion
Why it matters
Without observability, an AI system in production is an expensive black box. Teams cannot:
- Detect quality degradation before users report it
- Optimize costs by identifying unnecessary calls or oversized models
- Debug why an agent made an incorrect decision
- Meet audit and compliance requirements
References
- OpenLLMetry — Traceloop. OpenTelemetry instrumentation for LLMs.
- Langfuse Documentation — Langfuse. Open source LLM observability platform.
- LLM Observability — Arize AI — Arize. Phoenix documentation for traces and evaluations.