Site Reliability Engineering

What it is

Site Reliability Engineering (SRE) is the discipline created by Google that applies software engineering principles to system operations. The goal: create scalable and reliable systems through automation, not manual work.

Key concepts

SLO (Service Level Objective): reliability target (e.g., 99.9% uptime)
SLI (Service Level Indicator): metric measuring the SLO
SLA (Service Level Agreement): contractual commitment with the customer
Error Budget: allowed error margin (100% - SLO)
Toil: manual, repetitive, automatable work

Error Budget

If your SLO is 99.9%, you have 0.1% error budget (~43 min/month). While you have budget:

You can deploy new features
You can take calculated risks

If it's exhausted:

Deploy freeze
Focus on reliability

SLO and error budget in practice

SLO	Error budget/month	Error budget/year	Typical profile
99%	7.3 hours	3.65 days	Internal tools, batch jobs
99.9%	43.8 minutes	8.77 hours	Production APIs, web services
99.95%	21.9 minutes	4.38 hours	Business-critical services
99.99%	4.38 minutes	52.6 minutes	Payment infrastructure, auth

Practices

Eliminate toil through automation
Blameless postmortems after incidents
Data-driven capacity planning
Chaos engineering to test resilience

Why it matters

SRE applies software engineering principles to operations. Instead of manual processes and heroism, it defines measurable SLOs, automates incident response, and treats reliability as a feature that is designed, not something that just happens.

References

Site Reliability Engineering — Google, free book.
The Site Reliability Workbook — Google, practical exercises.
SRE Resources — Google, 2024. Additional SRE resources including articles and presentations.

What it is

Key concepts

SLO (Service Level Objective): reliability target (e.g., 99.9% uptime)
SLI (Service Level Indicator): metric measuring the SLO
SLA (Service Level Agreement): contractual commitment with the customer
Error Budget: allowed error margin (100% - SLO)
Toil: manual, repetitive, automatable work

Error Budget

If your SLO is 99.9%, you have 0.1% error budget (~43 min/month). While you have budget:

You can deploy new features
You can take calculated risks

If it's exhausted:

Deploy freeze
Focus on reliability

SLO and error budget in practice

SLO	Error budget/month	Error budget/year	Typical profile
99%	7.3 hours	3.65 days	Internal tools, batch jobs
99.9%	43.8 minutes	8.77 hours	Production APIs, web services
99.95%	21.9 minutes	4.38 hours	Business-critical services
99.99%	4.38 minutes	52.6 minutes	Payment infrastructure, auth

Practices

Eliminate toil through automation
Blameless postmortems after incidents
Data-driven capacity planning
Chaos engineering to test resilience

Why it matters

References

Site Reliability Engineering — Google, free book.
The Site Reliability Workbook — Google, practical exercises.
SRE Resources — Google, 2024. Additional SRE resources including articles and presentations.

Site Reliability Engineering

What it is

Key concepts

Error Budget

SLO and error budget in practice

Practices

Why it matters

References

Related content

Site Reliability Engineering

What it is

Key concepts

Error Budget

SLO and error budget in practice

Practices

Why it matters

References

Related content