Site Reliability Engineering
Discipline applying software engineering principles to infrastructure operations, focusing on creating scalable and highly reliable systems.
seed#sre#reliability#toil#error-budget#automation#operations
What it is
Site Reliability Engineering (SRE) is the discipline created by Google that applies software engineering principles to system operations. The goal: create scalable and reliable systems through automation, not manual work.
Key concepts
- SLO (Service Level Objective): reliability target (e.g., 99.9% uptime)
- SLI (Service Level Indicator): metric measuring the SLO
- SLA (Service Level Agreement): contractual commitment with the customer
- Error Budget: allowed error margin (100% - SLO)
- Toil: manual, repetitive, automatable work
Error Budget
If your SLO is 99.9%, you have 0.1% error budget (~43 min/month). While you have budget:
- You can deploy new features
- You can take calculated risks
If it's exhausted:
- Deploy freeze
- Focus on reliability
SLO and error budget in practice
| SLO | Error budget/month | Error budget/year | Typical profile |
|---|---|---|---|
| 99% | 7.3 hours | 3.65 days | Internal tools, batch jobs |
| 99.9% | 43.8 minutes | 8.77 hours | Production APIs, web services |
| 99.95% | 21.9 minutes | 4.38 hours | Business-critical services |
| 99.99% | 4.38 minutes | 52.6 minutes | Payment infrastructure, auth |
Practices
- Eliminate toil through automation
- Blameless postmortems after incidents
- Data-driven capacity planning
- Chaos engineering to test resilience
Why it matters
SRE applies software engineering principles to operations. Instead of manual processes and heroism, it defines measurable SLOs, automates incident response, and treats reliability as a feature that is designed, not something that just happens.
References
- Site Reliability Engineering — Google, free book.
- The Site Reliability Workbook — Google, practical exercises.
- SRE Resources — Google, 2024. Additional SRE resources including articles and presentations.