Alerting Strategies
Practices for configuring effective alerts that notify real problems without generating fatigue from excessive notifications.
seed#alerting#monitoring#pagerduty#on-call#sre#notifications
What it is
Alerting is the process of notifying teams when something requires attention. Poorly configured alerts cause fatigue (too many) or missed incidents (too few).
Principles
- Actionable: every alert should have a clear action
- Symptom-based: alert on user impact, not causes
- SLO-based: alert when error budget is consumed
- Escalation: if not responded, escalate automatically
Severities
| Severity | Response | Example |
|---|---|---|
| Critical | Immediate (wake up) | Service down |
| Warning | Business hours | Performance degradation |
| Info | Review when possible | Concerning trend |
Anti-patterns
| Anti-pattern | Consequence | Solution |
|---|---|---|
| Alerts that are always ignored | Alert fatigue, real alerts get missed | Delete or convert to dashboard |
| Alerts without runbook | Slow response, depends on tribal knowledge | Link runbook to each alert |
| Alerts for internal metrics | Noise without visible user impact | Alert on symptoms, not causes |
| Arbitrary thresholds | Frequent false positives | Base thresholds on SLOs |
Why it matters
Poorly designed alerts cause fatigue, and fatigue causes real alerts to be ignored. An effective alerting strategy is the difference between detecting an incident in minutes and learning about it from users hours later.
References
- My Philosophy on Alerting — Rob Ewaschuk, Google.
- Monitoring Distributed Systems — Google SRE Book, 2016. Chapter on monitoring and alerting.
- Grafana Alerting — Grafana, 2024. Unified alerting system.