Concepts

Site Reliability Engineering

Discipline applying software engineering principles to infrastructure operations, focusing on creating scalable and highly reliable systems.

seed#sre#reliability#toil#error-budget#automation#operations

What it is

Site Reliability Engineering (SRE) is the discipline created by Google that applies software engineering principles to system operations. The goal: create scalable and reliable systems through automation, not manual work.

Key concepts

  • SLO (Service Level Objective): reliability target (e.g., 99.9% uptime)
  • SLI (Service Level Indicator): metric measuring the SLO
  • SLA (Service Level Agreement): contractual commitment with the customer
  • Error Budget: allowed error margin (100% - SLO)
  • Toil: manual, repetitive, automatable work

Error Budget

If your SLO is 99.9%, you have 0.1% error budget (~43 min/month). While you have budget:

  • You can deploy new features
  • You can take calculated risks

If it's exhausted:

  • Deploy freeze
  • Focus on reliability

SLO and error budget in practice

SLOError budget/monthError budget/yearTypical profile
99%7.3 hours3.65 daysInternal tools, batch jobs
99.9%43.8 minutes8.77 hoursProduction APIs, web services
99.95%21.9 minutes4.38 hoursBusiness-critical services
99.99%4.38 minutes52.6 minutesPayment infrastructure, auth

Practices

  • Eliminate toil through automation
  • Blameless postmortems after incidents
  • Data-driven capacity planning
  • Chaos engineering to test resilience

Why it matters

SRE applies software engineering principles to operations. Instead of manual processes and heroism, it defines measurable SLOs, automates incident response, and treats reliability as a feature that is designed, not something that just happens.

References

Concepts