AI Observability

What it is

AI observability extends traditional observability practices — logs, metrics, and traces — to the domain of artificial intelligence systems. While conventional software monitors response times and error rates, AI systems additionally need to track token consumption, response quality, per-call costs, and hallucination presence.

The fundamental difference is that LLMs are non-deterministic: the same input can produce different outputs. This makes observability not just operational but also qualitative — we need to measure not only "did it work?" but "was the response good?".

The three pillars applied to AI

Traces

In an AI system, a trace captures the complete journey of a request through the pipeline:

User → Prompt → Retrieval (RAG) → LLM Call → Tool Use → LLM Call → Response
  │       │           │                │           │          │          │
  └─ trace_id: abc-123 ─┴───────────────┴───────────┴──────────┴──────────┘

Each step records: input/output tokens, latency, model used, cost, and result. This is especially critical in agentic workflows where the model may iterate multiple times.

Metrics

Metric	Description	Why it matters
TTFT (Time to First Token)	Latency to first token	User experience
Tokens per second	Generation speed	System throughput
Cost per request	Tokens x model price	Budget control
Hallucination rate	Responses with fabricated information	Reliability
Rejection rate	Requests the model couldn't complete	Functional coverage
User satisfaction	Explicit or implicit feedback	Perceived quality

Logs

Detailed records of prompts, responses, tool decisions, and errors. Unlike traditional logs, AI logs include the full content of interactions to enable reproduction and debugging.

Ecosystem tools

Tool	Type	Features
Langfuse	Open source	Traces, evaluations, prompt management
LangSmith	Commercial (LangChain)	Traces, evaluation datasets, playground
Arize Phoenix	Open source	Traces, drift detection, evaluations
Braintrust	Commercial	Evaluations, logging, model comparison
OpenTelemetry + extensions	Open standard	Integration with existing distributed tracing infrastructure

Production evaluations

AI observability includes continuous evaluations — not just in development but in production:

LLM-as-judge: using one model to evaluate another's responses
Heuristic evaluations: rules on length, format, source presence
Human feedback: thumbs up/down, corrections, escalations
Business metrics: resolution rate, session time, conversion

Cost tracking

The cost of an AI system in production can grow rapidly without visibility. An effective cost dashboard tracks:

Cost per user/session: identifies users or flows that consume disproportionately
Cost per model: compares spending across providers and models to optimize selection
Cost per feature: attributes spending to specific product features
Daily/weekly trend: detects anomalies before they become surprise bills
Wasted tokens: identifies calls with excessive context or truncated responses that repeat

Combining traces with cost metadata enables answering questions like "how much does it cost on average to resolve a support ticket with the agent?" — critical information for product decisions.

Why it matters

Without observability, an AI system in production is an expensive black box. Teams cannot:

Detect quality degradation before users report it
Optimize costs by identifying unnecessary calls or oversized models
Debug why an agent made an incorrect decision
Meet audit and compliance requirements

References

OpenLLMetry — Traceloop. OpenTelemetry instrumentation for LLMs.
Langfuse Documentation — Langfuse. Open source LLM observability platform.
LLM Observability — Arize AI — Arize. Phoenix documentation for traces and evaluations.
GenAI Semantic Conventions — OpenTelemetry — OpenTelemetry, 2024. Semantic conventions for instrumenting generative AI systems.
Braintrust Documentation — Braintrust, 2024. Evaluation and logging platform for LLMs.

What it is

The three pillars applied to AI

Traces

In an AI system, a trace captures the complete journey of a request through the pipeline:

User → Prompt → Retrieval (RAG) → LLM Call → Tool Use → LLM Call → Response
  │       │           │                │           │          │          │
  └─ trace_id: abc-123 ─┴───────────────┴───────────┴──────────┴──────────┘

Each step records: input/output tokens, latency, model used, cost, and result. This is especially critical in agentic workflows where the model may iterate multiple times.

Metrics

Metric	Description	Why it matters
TTFT (Time to First Token)	Latency to first token	User experience
Tokens per second	Generation speed	System throughput
Cost per request	Tokens x model price	Budget control
Hallucination rate	Responses with fabricated information	Reliability
Rejection rate	Requests the model couldn't complete	Functional coverage
User satisfaction	Explicit or implicit feedback	Perceived quality

Logs

Detailed records of prompts, responses, tool decisions, and errors. Unlike traditional logs, AI logs include the full content of interactions to enable reproduction and debugging.

Ecosystem tools

Tool	Type	Features
Langfuse	Open source	Traces, evaluations, prompt management
LangSmith	Commercial (LangChain)	Traces, evaluation datasets, playground
Arize Phoenix	Open source	Traces, drift detection, evaluations
Braintrust	Commercial	Evaluations, logging, model comparison
OpenTelemetry + extensions	Open standard	Integration with existing distributed tracing infrastructure

Production evaluations

AI observability includes continuous evaluations — not just in development but in production:

LLM-as-judge: using one model to evaluate another's responses
Heuristic evaluations: rules on length, format, source presence
Human feedback: thumbs up/down, corrections, escalations
Business metrics: resolution rate, session time, conversion

Cost tracking

The cost of an AI system in production can grow rapidly without visibility. An effective cost dashboard tracks:

Cost per user/session: identifies users or flows that consume disproportionately
Cost per model: compares spending across providers and models to optimize selection
Cost per feature: attributes spending to specific product features
Daily/weekly trend: detects anomalies before they become surprise bills
Wasted tokens: identifies calls with excessive context or truncated responses that repeat

Combining traces with cost metadata enables answering questions like "how much does it cost on average to resolve a support ticket with the agent?" — critical information for product decisions.

Why it matters

Without observability, an AI system in production is an expensive black box. Teams cannot:

Detect quality degradation before users report it
Optimize costs by identifying unnecessary calls or oversized models
Debug why an agent made an incorrect decision
Meet audit and compliance requirements

References

OpenLLMetry — Traceloop. OpenTelemetry instrumentation for LLMs.
Langfuse Documentation — Langfuse. Open source LLM observability platform.
LLM Observability — Arize AI — Arize. Phoenix documentation for traces and evaluations.
GenAI Semantic Conventions — OpenTelemetry — OpenTelemetry, 2024. Semantic conventions for instrumenting generative AI systems.
Braintrust Documentation — Braintrust, 2024. Evaluation and logging platform for LLMs.

AI Observability

What it is

The three pillars applied to AI

Traces

Metrics

Logs

Ecosystem tools

Production evaluations

Cost tracking

Why it matters

References

Related content

AI Observability

What it is

The three pillars applied to AI

Traces

Metrics

Logs

Ecosystem tools

Production evaluations

Cost tracking

Why it matters

References

Related content