Jonatan Matajonmatum.com
conceptsnotesexperimentsessays
© 2026 Jonatan Mata. All rights reserved.v2.1.1
Concepts

Observability

Ability to understand a system's internal state from its external outputs: logs, metrics, and traces, enabling problem diagnosis without direct system access.

seed#observability#monitoring#logs#metrics#traces#opentelemetry

What it is

Observability is the ability to understand what's happening inside a system based on the data it produces. Unlike monitoring (which checks known conditions), observability enables investigating unknown problems.

The three pillars

Logs

Textual event records:

  • Structured logging (JSON) for efficient search
  • Levels: DEBUG, INFO, WARN, ERROR
  • Correlation with trace IDs

Metrics

Numerical measurements aggregated over time:

  • Counters: values that only increment
  • Gauges: values that go up and down
  • Histograms: value distribution

Traces

Request tracking through distributed services:

  • Span: unit of work
  • Trace: set of related spans
  • Context propagation: passing trace ID between services

OpenTelemetry

CNCF standard unifying logs, metrics, and traces instrumentation with SDKs for all major languages.

Tools

ToolType
GrafanaDashboards
PrometheusMetrics
Jaeger/TempoTraces
LokiLogs
DatadogAll-in-one
AWS CloudWatchAWS native

Why it matters

Observability is what enables understanding a system's behavior in production without predicting in advance what questions you will need to answer. Unlike traditional monitoring, which checks known conditions, observability enables investigating the unknown.

References

  • OpenTelemetry — Observability standard.
  • Observability Engineering — Charity Majors et al.
  • OpenTelemetry Documentation — OpenTelemetry, 2024. Complete standard documentation.

Related content

  • DevOps Practices

    Set of technical and cultural practices that implement DevOps principles — from Infrastructure as Code to blameless post-mortems. The "how" behind the philosophy.

  • Platform Engineering

    Discipline designing and building internal self-service platforms so development teams can deploy and operate applications autonomously.

  • AI Observability

    Practices and tools for monitoring, tracing, and debugging AI systems in production, covering token metrics, latency, response quality, costs, and hallucination detection.

  • Cost Optimization

    Practices and strategies to minimize cloud spending without sacrificing performance, including right-sizing, reservations, spot instances, and eliminating idle resources.

  • Site Reliability Engineering

    Discipline applying software engineering principles to infrastructure operations, focusing on creating scalable and highly reliable systems.

  • Metrics & Monitoring

    Collection and visualization of numerical system measurements over time to understand performance, detect anomalies, and make data-driven decisions.

  • Logging Strategies

    Practices for implementing effective logging in distributed systems: structured logging, levels, correlation, and centralized aggregation.

  • Incident Management

    Processes and practices for detecting, responding to, resolving, and learning from production incidents in a structured and effective way.

  • Distributed Tracing

    Observability technique tracking requests across multiple services in distributed systems, enabling bottleneck identification and failure diagnosis.

  • Chaos Engineering

    Discipline of experimenting on production systems to discover weaknesses before they cause incidents, by injecting controlled failures.

  • AWS SQS

    AWS fully managed message queue service that decouples distributed application components, guaranteeing message delivery with unlimited scalability.

  • AWS SNS

    AWS pub/sub messaging service that distributes messages to multiple subscribers simultaneously, enabling fan-out patterns and notifications at scale.

  • AWS EventBridge

    AWS serverless event bus connecting applications using events, enabling decoupled event-driven architectures with rule-based routing.

  • API Design

    Principles and practices for designing clear, consistent, and evolvable programming interfaces that facilitate integration between systems.

  • Alerting Strategies

    Practices for configuring effective alerts that notify real problems without generating fatigue from excessive notifications.

Concepts