Jonatan Matajonmatum.com
conceptsnotesexperimentsessays
© 2026 Jonatan Mata. All rights reserved.v2.1.1
Concepts

Alerting Strategies

Practices for configuring effective alerts that notify real problems without generating fatigue from excessive notifications.

evergreen#alerting#monitoring#pagerduty#on-call#sre#notifications

What it is

Alerting strategies are systematic methodologies for configuring notifications that detect real problems without generating fatigue from excessive noise. An effective strategy balances early incident detection with preserving team attention for problems that truly require human intervention.

Modern alerting goes beyond simple static thresholds. It incorporates business context, historical patterns, and user impact metrics to determine when and how to notify. Mature organizations treat alerts as an internal product, with owners, quality metrics, and continuous improvement processes.

The difference between reactive and proactive alerting lies in the ability to predict problems before they affect users. This requires understanding system failure patterns and configuring alerts that detect early symptoms, not just final effects.

Fundamental principles

Actionable alerts

Every alert must have a clear and specific action. If the typical response is "check tomorrow," it's not an alert — it's a dashboard metric. Critical alerts must include:

  • Specific user or business impact
  • Immediate mitigation steps
  • Direct link to corresponding runbook
  • Sufficient context to make decisions without additional investigation

Symptom-based, not cause-based

Alerting on symptoms (high latency, 5xx errors) instead of causes (high CPU, low memory) reduces false positives. A service can have high CPU without affecting users, but high latency always indicates a real problem.

Aligned with SLOs

The most effective alerts are based on SLOs and error budgets. When the error budget is consumed faster than expected, it's time to alert. This directly connects alerts with business objectives.

Routing and escalation

Intelligent routing configuration

# PagerDuty routing example
routing_rules:
  - conditions:
      - field: "service"
        operator: "is"
        value: "payment-api"
      - field: "severity"
        operator: "is"
        value: "critical"
    actions:
      - route:
          type: "escalation_policy"
          target: "payments-team-critical"
  
  - conditions:
      - field: "component"
        operator: "contains"
        value: "database"
    actions:
      - route:
          type: "escalation_policy"
          target: "infrastructure-team"
      - suppress:
          duration: "PT5M"  # 5 minutes

Automatic escalation

Escalation should be predictable and documented:

  1. Level 1 (0-15 min): Primary on-call engineer
  2. Level 2 (15-30 min): Team technical lead
  3. Level 3 (30+ min): Engineering manager
  4. Level 4 (1+ hour): Executive escalation

SLO-based alerting

Burn rate alerting

Burn rate measures how fast the error budget is consumed. Multi-window alerts detect both acute and chronic problems:

# Prometheus configuration
groups:
- name: slo.rules
  rules:
  - alert: ErrorBudgetBurn
    expr: |
      (
        slo:sli_error:ratio_rate1h > (14.4 * 0.001)
        and
        slo:sli_error:ratio_rate5m > (14.4 * 0.001)
      )
      or
      (
        slo:sli_error:ratio_rate6h > (6 * 0.001)
        and
        slo:sli_error:ratio_rate30m > (6 * 0.001)
      )
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Error budget burning too fast"
      description: "SLO {{ $labels.slo }} is burning error budget at {{ $value }}x rate"

Trend alerts

To detect gradual degradation:

- alert: LatencyTrend
  expr: |
    (
      rate(http_request_duration_seconds_sum[1h]) /
      rate(http_request_duration_seconds_count[1h])
    ) > 1.5 * (
      rate(http_request_duration_seconds_sum[24h] offset 24h) /
      rate(http_request_duration_seconds_count[24h] offset 24h)
    )
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Latency trending upward"

Alert fatigue metrics

Key indicators

To measure alerting system health:

MetricTargetDescription
MTTA (Mean Time to Acknowledge)< 5 minAverage time to acknowledge alert
Alert-to-Incident Ratio> 0.7% of alerts that become real incidents
False Positive Rate< 10%% of alerts requiring no action
Alert Volume per Engineer< 50/weekAlerts received per on-call engineer
Escalation Rate< 20%% of alerts requiring escalation

Alert health dashboard

# Prometheus metrics for fatigue
- record: alerting:false_positive_rate
  expr: |
    (
      sum(rate(alertmanager_alerts_received_total[24h])) -
      sum(rate(incident_created_total[24h]))
    ) / sum(rate(alertmanager_alerts_received_total[24h]))
 
- record: alerting:mtta_seconds
  expr: |
    histogram_quantile(0.5,
      rate(alert_acknowledgment_duration_seconds_bucket[24h])
    )

Runbook integration

Linked runbook structure

# Alert with integrated runbook
- alert: DatabaseConnectionPool
  expr: db_connection_pool_active / db_connection_pool_max > 0.8
  for: 5m
  labels:
    severity: warning
    team: infrastructure
    runbook: "https://runbooks.company.com/db-connection-pool"
  annotations:
    summary: "Database connection pool utilization high"
    description: |
      Connection pool for {{ $labels.database }} is {{ $value | humanizePercentage }} full.
      
      Immediate actions:
      1. Check for connection leaks: kubectl logs -f deployment/api-server | grep "connection"
      2. Scale application if needed: kubectl scale deployment/api-server --replicas=6
      3. Monitor pool recovery: grafana.company.com/d/db-pool
      
      Full runbook: {{ $labels.runbook }}

Runbook automation

For frequent alerts, automate the first steps:

# GitHub Actions triggered by webhook
name: Auto-remediation
on:
  repository_dispatch:
    types: [pagerduty-alert]
 
jobs:
  auto-scale:
    if: contains(github.event.client_payload.alert.summary, 'High CPU')
    runs-on: ubuntu-latest
    steps:
      - name: Scale deployment
        run: |
          kubectl scale deployment/${{ github.event.client_payload.service }} \
            --replicas=$(( $(kubectl get deployment/${{ github.event.client_payload.service }} -o jsonpath='{.spec.replicas}') + 2 ))

Severities and channels

Severity matrix

SeverityResponse timeChannelExample
P0 - CriticalImmediate (wake up)PagerDuty + SMS + CallService completely down
P1 - High15 minutesPagerDuty + SlackSevere performance degradation
P2 - Medium2 hoursSlack + EmailConcerning trend
P3 - LowNext business dayEmail + DashboardPreventive maintenance
P4 - InfoWhen convenientDashboard onlyCapacity metrics

Channel configuration

# OpsGenie configuration
teams:
  - name: "backend-team"
    escalations:
      - type: "user"
        delay: "PT0M"
        users: ["oncall-primary"]
      - type: "user" 
        delay: "PT15M"
        users: ["oncall-secondary"]
      - type: "team"
        delay: "PT30M"
        teams: ["engineering-leads"]
 
integrations:
  - type: "prometheus"
    url: "https://prometheus.company.com"
    routing:
      critical: "immediate"
      warning: "business-hours"
      info: "suppress"

Common anti-patterns

Zombie alerts

Alerts that are always ignored but never deleted. Identify through:

-- Query to find ignored alerts
SELECT 
  alert_name,
  COUNT(*) as total_alerts,
  COUNT(CASE WHEN acknowledged = false THEN 1 END) as ignored_count,
  (COUNT(CASE WHEN acknowledged = false THEN 1 END) * 100.0 / COUNT(*)) as ignore_rate
FROM alert_history 
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY alert_name
HAVING ignore_rate > 80
ORDER BY ignore_rate DESC;

Arbitrary thresholds

Use historical percentiles instead of "round" numbers:

# Bad: arbitrary threshold
- alert: HighLatency
  expr: http_request_duration_seconds > 1.0
 
# Good: based on historical percentiles
- alert: LatencyAnomaly
  expr: |
    http_request_duration_seconds:p99 > 
    1.5 * http_request_duration_seconds:p99[7d] offset 7d

Why it matters

From a staff+ engineering perspective, alerting strategies are fundamental to organizational scalability. A poorly designed alerting system doesn't just cause fatigue — it erodes trust in observability and creates a reactive culture where engineers learn to ignore important signals.

The real cost of misconfigured alerts manifests in three dimensions: degraded response time during real incidents, on-call team burnout, and loss of context when alerts don't provide actionable information. Organizations that invest in mature alerting strategies can scale their teams without proportionally increasing on-call cognitive load.

The difference between junior and senior teams is evident in how they treat alerts: mature teams see them as an internal product requiring product management, quality metrics, and continuous iteration. This mindset is essential for maintaining operational effectiveness as systems grow in complexity.

References

  • My Philosophy on Alerting — Rob Ewaschuk, Google SRE. Fundamental philosophy on effective alerting.
  • The USE Method — Brendan Gregg, 2012. Methodology for system metrics and alerting.
  • Monitoring Distributed Systems — Google SRE Book, 2016. Complete chapter on monitoring and alerting strategies.
  • PagerDuty Incident Response — PagerDuty, 2024. Complete guide to incident response and alerting.
  • Implementing SLOs — Google SRE Workbook, 2018. Practical implementation of SLO-based alerting.
  • Alerting on SLOs — Google Cloud, 2020. Practical strategies for SLO-based alerting.

Related content

  • Metrics & Monitoring

    Collection and visualization of numerical system measurements over time to understand performance, detect anomalies, and make data-driven decisions.

  • Site Reliability Engineering

    Discipline applying software engineering principles to infrastructure operations, focusing on creating scalable and highly reliable systems.

  • SLOs, SLIs & SLAs

    Framework for defining, measuring, and communicating service reliability through service level objectives (SLOs), indicators (SLIs), and agreements (SLAs).

  • Observability

    Ability to understand a system's internal state from its external outputs: logs, metrics, and traces, enabling problem diagnosis without direct system access.

  • Incident Management

    Processes and practices for detecting, responding to, resolving, and learning from production incidents in a structured and effective way.

Concepts