Concepts

Alerting Strategies

Practices for configuring effective alerts that notify real problems without generating fatigue from excessive notifications.

seed#alerting#monitoring#pagerduty#on-call#sre#notifications

What it is

Alerting is the process of notifying teams when something requires attention. Poorly configured alerts cause fatigue (too many) or missed incidents (too few).

Principles

  • Actionable: every alert should have a clear action
  • Symptom-based: alert on user impact, not causes
  • SLO-based: alert when error budget is consumed
  • Escalation: if not responded, escalate automatically

Severities

SeverityResponseExample
CriticalImmediate (wake up)Service down
WarningBusiness hoursPerformance degradation
InfoReview when possibleConcerning trend

Anti-patterns

Anti-patternConsequenceSolution
Alerts that are always ignoredAlert fatigue, real alerts get missedDelete or convert to dashboard
Alerts without runbookSlow response, depends on tribal knowledgeLink runbook to each alert
Alerts for internal metricsNoise without visible user impactAlert on symptoms, not causes
Arbitrary thresholdsFrequent false positivesBase thresholds on SLOs

Why it matters

Poorly designed alerts cause fatigue, and fatigue causes real alerts to be ignored. An effective alerting strategy is the difference between detecting an incident in minutes and learning about it from users hours later.

References

Concepts