Discipline applying software engineering principles to infrastructure operations, focusing on creating scalable and highly reliable systems.
Site Reliability Engineering (SRE) is the discipline created by Google that applies software engineering principles to system operations. The goal: create scalable and reliable systems through automation, not manual work.
If your SLO is 99.9%, you have 0.1% error budget (~43 min/month). While you have budget:
If it's exhausted:
| SLO | Error budget/month | Error budget/year | Typical profile |
|---|---|---|---|
| 99% | 7.3 hours | 3.65 days | Internal tools, batch jobs |
| 99.9% | 43.8 minutes | 8.77 hours | Production APIs, web services |
| 99.95% | 21.9 minutes | 4.38 hours | Business-critical services |
| 99.99% | 4.38 minutes | 52.6 minutes | Payment infrastructure, auth |
SRE applies software engineering principles to operations. Instead of manual processes and heroism, it defines measurable SLOs, automates incident response, and treats reliability as a feature that is designed, not something that just happens.
Culture and set of practices that unify development (Dev) and operations (Ops) to deliver software with greater speed, quality, and reliability. It's not a role — it's a way of working.
Ability to understand a system's internal state from its external outputs: logs, metrics, and traces, enabling problem diagnosis without direct system access.
Framework for defining, measuring, and communicating service reliability through service level objectives (SLOs), indicators (SLIs), and agreements (SLAs).
Collection and visualization of numerical system measurements over time to understand performance, detect anomalies, and make data-driven decisions.
Processes and practices for detecting, responding to, resolving, and learning from production incidents in a structured and effective way.
Discipline of experimenting on production systems to discover weaknesses before they cause incidents, by injecting controlled failures.
Practices for configuring effective alerts that notify real problems without generating fatigue from excessive notifications.