Incident Management
Processes and practices for detecting, responding to, resolving, and learning from production incidents in a structured and effective way.
seed#incident-management#on-call#postmortem#sre#response#blameless
What it is
Incident management is the structured process for handling production problems: from detection to resolution and subsequent learning. A good process minimizes impact and prevents recurrences.
Phases
- Detection: alerts, user reports, monitoring
- Triage: evaluate severity and impact
- Response: assign roles, communicate, mitigate
- Resolution: restore service
- Postmortem: analyze and learn
Roles during an incident
- Incident Commander: coordinates response
- Tech Lead: leads technical investigation
- Communications: updates stakeholders
- Scribe: documents timeline and actions
Blameless postmortems
The goal is to learn, not blame. Key questions:
- What happened? (timeline)
- Why did it happen? (5 whys)
- How do we prevent it from happening again? (action items)
Tools
- PagerDuty, Opsgenie (on-call)
- Statuspage (communication)
- Jira, Linear (action item tracking)
Why it matters
How a team responds to incidents defines its operational maturity. A clear process — detection, triage, communication, resolution, postmortem — reduces recovery time and turns every incident into an opportunity for systemic improvement.
References
- Incident Management - PagerDuty — Complete guide.
- Managing Incidents — SRE Book — Google, 2016. Chapter on incident management.
- Incident Management Guide — FireHydrant, 2024. Practical incident management guide.