Metrics & Monitoring
Collection and visualization of numerical system measurements over time to understand performance, detect anomalies, and make data-driven decisions.
What it is
Metrics are numerical measurements aggregated over time describing system behavior. Monitoring is the process of collecting, storing, visualizing, and alerting on those metrics.
Metric types
| Type | Behavior | Example | When to use |
|---|---|---|---|
| Counter | Only increments | Total requests, accumulated errors | Rates (requests/s) |
| Gauge | Goes up and down | Memory used, active connections | Current resource state |
| Histogram | Value distribution (server-side) | Latency p50/p95/p99 | Latency percentiles |
| Summary | Value distribution (client-side) | Pre-calculated latency | When server-side aggregation is not possible |
The Four Golden Signals (Google SRE)
- Latency: response time
- Traffic: request volume
- Errors: error rate
- Saturation: how "full" the system is
Typical stack
Application → Prometheus (collection) → Grafana (visualization) → Alertmanager (alerts)
Best practices
- USE method for resources: Utilization, Saturation, Errors
- RED method for services: Rate, Errors, Duration
- Per-service dashboards with the 4 golden signals
- Alerts based on SLOs, not arbitrary metrics
Why it matters
What is not measured is not improved. Metrics and monitoring turn intuition into data, enabling detection of degradations before they impact users and making capacity decisions based on evidence.
References
- Prometheus — CNCF monitoring system.
- Grafana — Visualization platform.
- OpenTelemetry Metrics — OpenTelemetry, 2024. Open standard for metrics.
Related content
- Observability
Ability to understand a system's internal state from its external outputs: logs, metrics, and traces, enabling problem diagnosis without direct system access.
- Site Reliability Engineering
Discipline applying software engineering principles to infrastructure operations, focusing on creating scalable and highly reliable systems.
- Alerting Strategies
Practices for configuring effective alerts that notify real problems without generating fatigue from excessive notifications.