Metrics & Monitoring

What it is

Metrics are numerical measurements aggregated over time describing system behavior. Monitoring is the process of collecting, storing, visualizing, and alerting on those metrics.

Metric types

Type	Behavior	Example	When to use
Counter	Only increments	Total requests, accumulated errors	Rates (requests/s)
Gauge	Goes up and down	Memory used, active connections	Current resource state
Histogram	Value distribution (server-side)	Latency p50/p95/p99	Latency percentiles
Summary	Value distribution (client-side)	Pre-calculated latency	When server-side aggregation is not possible

The Four Golden Signals (Google SRE)

Latency: response time
Traffic: request volume
Errors: error rate
Saturation: how "full" the system is

Typical stack

Application → Prometheus (collection) → Grafana (visualization) → Alertmanager (alerts)

Best practices

USE method for resources: Utilization, Saturation, Errors
RED method for services: Rate, Errors, Duration
Per-service dashboards with the 4 golden signals
Alerts based on SLOs, not arbitrary metrics

Why it matters

What is not measured is not improved. Metrics and monitoring turn intuition into data, enabling detection of degradations before they impact users and making capacity decisions based on evidence.

References

Prometheus — CNCF monitoring system.
Grafana — Visualization platform.
OpenTelemetry Metrics — OpenTelemetry, 2024. Open standard for metrics.

What it is

Metrics are numerical measurements aggregated over time describing system behavior. Monitoring is the process of collecting, storing, visualizing, and alerting on those metrics.

Metric types

Type	Behavior	Example	When to use
Counter	Only increments	Total requests, accumulated errors	Rates (requests/s)
Gauge	Goes up and down	Memory used, active connections	Current resource state
Histogram	Value distribution (server-side)	Latency p50/p95/p99	Latency percentiles
Summary	Value distribution (client-side)	Pre-calculated latency	When server-side aggregation is not possible

The Four Golden Signals (Google SRE)

Latency: response time
Traffic: request volume
Errors: error rate
Saturation: how "full" the system is

Typical stack

Application → Prometheus (collection) → Grafana (visualization) → Alertmanager (alerts)

Best practices

USE method for resources: Utilization, Saturation, Errors
RED method for services: Rate, Errors, Duration
Per-service dashboards with the 4 golden signals
Alerts based on SLOs, not arbitrary metrics

Why it matters

What is not measured is not improved. Metrics and monitoring turn intuition into data, enabling detection of degradations before they impact users and making capacity decisions based on evidence.

References

Prometheus — CNCF monitoring system.
Grafana — Visualization platform.
OpenTelemetry Metrics — OpenTelemetry, 2024. Open standard for metrics.

Metrics & Monitoring

What it is

Metric types

The Four Golden Signals (Google SRE)

Typical stack

Best practices

Why it matters

References

Related content

Metrics & Monitoring

What it is

Metric types

The Four Golden Signals (Google SRE)

Typical stack

Best practices

Why it matters

References

Related content