Observability
Ability to understand a system's internal state from its external outputs: logs, metrics, and traces, enabling problem diagnosis without direct system access.
What it is
Observability is the ability to understand what's happening inside a system based on the data it produces. Unlike monitoring (which checks known conditions), observability enables investigating unknown problems.
The three pillars
Logs
Textual event records:
- Structured logging (JSON) for efficient search
- Levels: DEBUG, INFO, WARN, ERROR
- Correlation with trace IDs
Metrics
Numerical measurements aggregated over time:
- Counters: values that only increment
- Gauges: values that go up and down
- Histograms: value distribution
Traces
Request tracking through distributed services:
- Span: unit of work
- Trace: set of related spans
- Context propagation: passing trace ID between services
OpenTelemetry
CNCF standard unifying logs, metrics, and traces instrumentation with SDKs for all major languages.
Tools
| Tool | Type |
|---|---|
| Grafana | Dashboards |
| Prometheus | Metrics |
| Jaeger/Tempo | Traces |
| Loki | Logs |
| Datadog | All-in-one |
| AWS CloudWatch | AWS native |
Why it matters
Observability is what enables understanding a system's behavior in production without predicting in advance what questions you will need to answer. Unlike traditional monitoring, which checks known conditions, observability enables investigating the unknown.
References
- OpenTelemetry — Observability standard.
- Observability Engineering — Charity Majors et al.
- OpenTelemetry Documentation — OpenTelemetry, 2024. Complete standard documentation.