Ability to understand a system's internal state from its external outputs: logs, metrics, and traces, enabling problem diagnosis without direct system access.
Observability is the ability to understand what's happening inside a system based on the data it produces. Unlike monitoring (which checks known conditions), observability enables investigating unknown problems.
Textual event records:
Numerical measurements aggregated over time:
Request tracking through distributed services:
CNCF standard unifying logs, metrics, and traces instrumentation with SDKs for all major languages.
| Tool | Type |
|---|---|
| Grafana | Dashboards |
| Prometheus | Metrics |
| Jaeger/Tempo | Traces |
| Loki | Logs |
| Datadog | All-in-one |
| AWS CloudWatch | AWS native |
Observability is what enables understanding a system's behavior in production without predicting in advance what questions you will need to answer. Unlike traditional monitoring, which checks known conditions, observability enables investigating the unknown.
Set of technical and cultural practices that implement DevOps principles — from Infrastructure as Code to blameless post-mortems. The "how" behind the philosophy.
Discipline designing and building internal self-service platforms so development teams can deploy and operate applications autonomously.
Practices and tools for monitoring, tracing, and debugging AI systems in production, covering token metrics, latency, response quality, costs, and hallucination detection.
Practices and strategies to minimize cloud spending without sacrificing performance, including right-sizing, reservations, spot instances, and eliminating idle resources.
Discipline applying software engineering principles to infrastructure operations, focusing on creating scalable and highly reliable systems.
Collection and visualization of numerical system measurements over time to understand performance, detect anomalies, and make data-driven decisions.
Practices for implementing effective logging in distributed systems: structured logging, levels, correlation, and centralized aggregation.
Processes and practices for detecting, responding to, resolving, and learning from production incidents in a structured and effective way.
Observability technique tracking requests across multiple services in distributed systems, enabling bottleneck identification and failure diagnosis.
Discipline of experimenting on production systems to discover weaknesses before they cause incidents, by injecting controlled failures.
AWS fully managed message queue service that decouples distributed application components, guaranteeing message delivery with unlimited scalability.
AWS pub/sub messaging service that distributes messages to multiple subscribers simultaneously, enabling fan-out patterns and notifications at scale.
AWS serverless event bus connecting applications using events, enabling decoupled event-driven architectures with rule-based routing.
Principles and practices for designing clear, consistent, and evolvable programming interfaces that facilitate integration between systems.
Practices for configuring effective alerts that notify real problems without generating fatigue from excessive notifications.