Metrics & Monitoring
Collection and visualization of numerical system measurements over time to understand performance, detect anomalies, and make data-driven decisions.
seed#metrics#monitoring#prometheus#grafana#dashboards#alerting
What it is
Metrics are numerical measurements aggregated over time describing system behavior. Monitoring is the process of collecting, storing, visualizing, and alerting on those metrics.
Metric types
| Type | Behavior | Example | When to use |
|---|---|---|---|
| Counter | Only increments | Total requests, accumulated errors | Rates (requests/s) |
| Gauge | Goes up and down | Memory used, active connections | Current resource state |
| Histogram | Value distribution (server-side) | Latency p50/p95/p99 | Latency percentiles |
| Summary | Value distribution (client-side) | Pre-calculated latency | When server-side aggregation is not possible |
The Four Golden Signals (Google SRE)
- Latency: response time
- Traffic: request volume
- Errors: error rate
- Saturation: how "full" the system is
Typical stack
Application → Prometheus (collection) → Grafana (visualization) → Alertmanager (alerts)
Best practices
- USE method for resources: Utilization, Saturation, Errors
- RED method for services: Rate, Errors, Duration
- Per-service dashboards with the 4 golden signals
- Alerts based on SLOs, not arbitrary metrics
Why it matters
What is not measured is not improved. Metrics and monitoring turn intuition into data, enabling detection of degradations before they impact users and making capacity decisions based on evidence.
References
- Prometheus — CNCF monitoring system.
- Grafana — Visualization platform.
- OpenTelemetry Metrics — OpenTelemetry, 2024. Open standard for metrics.