Practices for configuring effective alerts that notify real problems without generating fatigue from excessive notifications.
Alerting strategies are systematic methodologies for configuring notifications that detect real problems without generating fatigue from excessive noise. An effective strategy balances early incident detection with preserving team attention for problems that truly require human intervention.
Modern alerting goes beyond simple static thresholds. It incorporates business context, historical patterns, and user impact metrics to determine when and how to notify. Mature organizations treat alerts as an internal product, with owners, quality metrics, and continuous improvement processes.
The difference between reactive and proactive alerting lies in the ability to predict problems before they affect users. This requires understanding system failure patterns and configuring alerts that detect early symptoms, not just final effects.
Every alert must have a clear and specific action. If the typical response is "check tomorrow," it's not an alert — it's a dashboard metric. Critical alerts must include:
Alerting on symptoms (high latency, 5xx errors) instead of causes (high CPU, low memory) reduces false positives. A service can have high CPU without affecting users, but high latency always indicates a real problem.
The most effective alerts are based on SLOs and error budgets. When the error budget is consumed faster than expected, it's time to alert. This directly connects alerts with business objectives.
# PagerDuty routing example
routing_rules:
- conditions:
- field: "service"
operator: "is"
value: "payment-api"
- field: "severity"
operator: "is"
value: "critical"
actions:
- route:
type: "escalation_policy"
target: "payments-team-critical"
- conditions:
- field: "component"
operator: "contains"
value: "database"
actions:
- route:
type: "escalation_policy"
target: "infrastructure-team"
- suppress:
duration: "PT5M" # 5 minutesEscalation should be predictable and documented:
Burn rate measures how fast the error budget is consumed. Multi-window alerts detect both acute and chronic problems:
# Prometheus configuration
groups:
- name: slo.rules
rules:
- alert: ErrorBudgetBurn
expr: |
(
slo:sli_error:ratio_rate1h > (14.4 * 0.001)
and
slo:sli_error:ratio_rate5m > (14.4 * 0.001)
)
or
(
slo:sli_error:ratio_rate6h > (6 * 0.001)
and
slo:sli_error:ratio_rate30m > (6 * 0.001)
)
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget burning too fast"
description: "SLO {{ $labels.slo }} is burning error budget at {{ $value }}x rate"To detect gradual degradation:
- alert: LatencyTrend
expr: |
(
rate(http_request_duration_seconds_sum[1h]) /
rate(http_request_duration_seconds_count[1h])
) > 1.5 * (
rate(http_request_duration_seconds_sum[24h] offset 24h) /
rate(http_request_duration_seconds_count[24h] offset 24h)
)
for: 30m
labels:
severity: warning
annotations:
summary: "Latency trending upward"To measure alerting system health:
| Metric | Target | Description |
|---|---|---|
| MTTA (Mean Time to Acknowledge) | < 5 min | Average time to acknowledge alert |
| Alert-to-Incident Ratio | > 0.7 | % of alerts that become real incidents |
| False Positive Rate | < 10% | % of alerts requiring no action |
| Alert Volume per Engineer | < 50/week | Alerts received per on-call engineer |
| Escalation Rate | < 20% | % of alerts requiring escalation |
# Prometheus metrics for fatigue
- record: alerting:false_positive_rate
expr: |
(
sum(rate(alertmanager_alerts_received_total[24h])) -
sum(rate(incident_created_total[24h]))
) / sum(rate(alertmanager_alerts_received_total[24h]))
- record: alerting:mtta_seconds
expr: |
histogram_quantile(0.5,
rate(alert_acknowledgment_duration_seconds_bucket[24h])
)# Alert with integrated runbook
- alert: DatabaseConnectionPool
expr: db_connection_pool_active / db_connection_pool_max > 0.8
for: 5m
labels:
severity: warning
team: infrastructure
runbook: "https://runbooks.company.com/db-connection-pool"
annotations:
summary: "Database connection pool utilization high"
description: |
Connection pool for {{ $labels.database }} is {{ $value | humanizePercentage }} full.
Immediate actions:
1. Check for connection leaks: kubectl logs -f deployment/api-server | grep "connection"
2. Scale application if needed: kubectl scale deployment/api-server --replicas=6
3. Monitor pool recovery: grafana.company.com/d/db-pool
Full runbook: {{ $labels.runbook }}For frequent alerts, automate the first steps:
# GitHub Actions triggered by webhook
name: Auto-remediation
on:
repository_dispatch:
types: [pagerduty-alert]
jobs:
auto-scale:
if: contains(github.event.client_payload.alert.summary, 'High CPU')
runs-on: ubuntu-latest
steps:
- name: Scale deployment
run: |
kubectl scale deployment/${{ github.event.client_payload.service }} \
--replicas=$(( $(kubectl get deployment/${{ github.event.client_payload.service }} -o jsonpath='{.spec.replicas}') + 2 ))| Severity | Response time | Channel | Example |
|---|---|---|---|
| P0 - Critical | Immediate (wake up) | PagerDuty + SMS + Call | Service completely down |
| P1 - High | 15 minutes | PagerDuty + Slack | Severe performance degradation |
| P2 - Medium | 2 hours | Slack + Email | Concerning trend |
| P3 - Low | Next business day | Email + Dashboard | Preventive maintenance |
| P4 - Info | When convenient | Dashboard only | Capacity metrics |
# OpsGenie configuration
teams:
- name: "backend-team"
escalations:
- type: "user"
delay: "PT0M"
users: ["oncall-primary"]
- type: "user"
delay: "PT15M"
users: ["oncall-secondary"]
- type: "team"
delay: "PT30M"
teams: ["engineering-leads"]
integrations:
- type: "prometheus"
url: "https://prometheus.company.com"
routing:
critical: "immediate"
warning: "business-hours"
info: "suppress"Alerts that are always ignored but never deleted. Identify through:
-- Query to find ignored alerts
SELECT
alert_name,
COUNT(*) as total_alerts,
COUNT(CASE WHEN acknowledged = false THEN 1 END) as ignored_count,
(COUNT(CASE WHEN acknowledged = false THEN 1 END) * 100.0 / COUNT(*)) as ignore_rate
FROM alert_history
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY alert_name
HAVING ignore_rate > 80
ORDER BY ignore_rate DESC;Use historical percentiles instead of "round" numbers:
# Bad: arbitrary threshold
- alert: HighLatency
expr: http_request_duration_seconds > 1.0
# Good: based on historical percentiles
- alert: LatencyAnomaly
expr: |
http_request_duration_seconds:p99 >
1.5 * http_request_duration_seconds:p99[7d] offset 7dFrom a staff+ engineering perspective, alerting strategies are fundamental to organizational scalability. A poorly designed alerting system doesn't just cause fatigue — it erodes trust in observability and creates a reactive culture where engineers learn to ignore important signals.
The real cost of misconfigured alerts manifests in three dimensions: degraded response time during real incidents, on-call team burnout, and loss of context when alerts don't provide actionable information. Organizations that invest in mature alerting strategies can scale their teams without proportionally increasing on-call cognitive load.
The difference between junior and senior teams is evident in how they treat alerts: mature teams see them as an internal product requiring product management, quality metrics, and continuous iteration. This mindset is essential for maintaining operational effectiveness as systems grow in complexity.
Collection and visualization of numerical system measurements over time to understand performance, detect anomalies, and make data-driven decisions.
Discipline applying software engineering principles to infrastructure operations, focusing on creating scalable and highly reliable systems.
Framework for defining, measuring, and communicating service reliability through service level objectives (SLOs), indicators (SLIs), and agreements (SLAs).
Ability to understand a system's internal state from its external outputs: logs, metrics, and traces, enabling problem diagnosis without direct system access.
Processes and practices for detecting, responding to, resolving, and learning from production incidents in a structured and effective way.