Concepts

Incident Management

Processes and practices for detecting, responding to, resolving, and learning from production incidents in a structured and effective way.

seed#incident-management#on-call#postmortem#sre#response#blameless

What it is

Incident management is the structured process for handling production problems: from detection to resolution and subsequent learning. A good process minimizes impact and prevents recurrences.

Phases

  1. Detection: alerts, user reports, monitoring
  2. Triage: evaluate severity and impact
  3. Response: assign roles, communicate, mitigate
  4. Resolution: restore service
  5. Postmortem: analyze and learn

Roles during an incident

  • Incident Commander: coordinates response
  • Tech Lead: leads technical investigation
  • Communications: updates stakeholders
  • Scribe: documents timeline and actions

Blameless postmortems

The goal is to learn, not blame. Key questions:

  • What happened? (timeline)
  • Why did it happen? (5 whys)
  • How do we prevent it from happening again? (action items)

Tools

  • PagerDuty, Opsgenie (on-call)
  • Statuspage (communication)
  • Jira, Linear (action item tracking)

Why it matters

How a team responds to incidents defines its operational maturity. A clear process — detection, triage, communication, resolution, postmortem — reduces recovery time and turns every incident into an opportunity for systemic improvement.

References

Concepts