Practices and tools for monitoring, tracing, and debugging AI systems in production, covering token metrics, latency, response quality, costs, and hallucination detection.
AI observability extends traditional observability practices — logs, metrics, and traces — to the domain of artificial intelligence systems. While conventional software monitors response times and error rates, AI systems additionally need to track token consumption, response quality, per-call costs, and hallucination presence.
The fundamental difference is that LLMs are non-deterministic: the same input can produce different outputs. This makes observability not just operational but also qualitative — we need to measure not only "did it work?" but "was the response good?".
In an AI system, a trace captures the complete journey of a request through the pipeline:
User → Prompt → Retrieval (RAG) → LLM Call → Tool Use → LLM Call → Response
│ │ │ │ │ │ │
└─ trace_id: abc-123 ─┴───────────────┴───────────┴──────────┴──────────┘
Each step records: input/output tokens, latency, model used, cost, and result. This is especially critical in agentic workflows where the model may iterate multiple times.
| Metric | Description | Why it matters |
|---|---|---|
| TTFT (Time to First Token) | Latency to first token | User experience |
| Tokens per second | Generation speed | System throughput |
| Cost per request | Tokens x model price | Budget control |
| Hallucination rate | Responses with fabricated information | Reliability |
| Rejection rate | Requests the model couldn't complete | Functional coverage |
| User satisfaction | Explicit or implicit feedback | Perceived quality |
Detailed records of prompts, responses, tool decisions, and errors. Unlike traditional logs, AI logs include the full content of interactions to enable reproduction and debugging.
| Tool | Type | Features |
|---|---|---|
| Langfuse | Open source | Traces, evaluations, prompt management |
| LangSmith | Commercial (LangChain) | Traces, evaluation datasets, playground |
| Arize Phoenix | Open source | Traces, drift detection, evaluations |
| Braintrust | Commercial | Evaluations, logging, model comparison |
| OpenTelemetry + extensions | Open standard | Integration with existing distributed tracing infrastructure |
AI observability includes continuous evaluations — not just in development but in production:
The cost of an AI system in production can grow rapidly without visibility. An effective cost dashboard tracks:
Combining traces with cost metadata enables answering questions like "how much does it cost on average to resolve a support ticket with the agent?" — critical information for product decisions.
Without observability, an AI system in production is an expensive black box. Teams cannot:
Ability to understand a system's internal state from its external outputs: logs, metrics, and traces, enabling problem diagnosis without direct system access.
Observability technique tracking requests across multiple services in distributed systems, enabling bottleneck identification and failure diagnosis.
Frameworks and metrics for measuring AI system performance, quality, and safety, from standard benchmarks to domain-specific evaluations.
Patterns and frameworks for coordinating multiple AI models, tools, and data sources in production pipelines, managing flow between components, memory, and error recovery.
Practices and strategies to minimize cloud spending without sacrificing performance, including right-sizing, reservations, spot instances, and eliminating idle resources.