Jonatan Matajonmatum.com
conceptsnotesexperimentsessays
© 2026 Jonatan Mata. All rights reserved.v2.1.1
Concepts

Site Reliability Engineering

Discipline applying software engineering principles to infrastructure operations, focusing on creating scalable and highly reliable systems.

seed#sre#reliability#toil#error-budget#automation#operations

What it is

Site Reliability Engineering (SRE) is the discipline created by Google that applies software engineering principles to system operations. The goal: create scalable and reliable systems through automation, not manual work.

Key concepts

  • SLO (Service Level Objective): reliability target (e.g., 99.9% uptime)
  • SLI (Service Level Indicator): metric measuring the SLO
  • SLA (Service Level Agreement): contractual commitment with the customer
  • Error Budget: allowed error margin (100% - SLO)
  • Toil: manual, repetitive, automatable work

Error Budget

If your SLO is 99.9%, you have 0.1% error budget (~43 min/month). While you have budget:

  • You can deploy new features
  • You can take calculated risks

If it's exhausted:

  • Deploy freeze
  • Focus on reliability

SLO and error budget in practice

SLOError budget/monthError budget/yearTypical profile
99%7.3 hours3.65 daysInternal tools, batch jobs
99.9%43.8 minutes8.77 hoursProduction APIs, web services
99.95%21.9 minutes4.38 hoursBusiness-critical services
99.99%4.38 minutes52.6 minutesPayment infrastructure, auth

Practices

  • Eliminate toil through automation
  • Blameless postmortems after incidents
  • Data-driven capacity planning
  • Chaos engineering to test resilience

Why it matters

SRE applies software engineering principles to operations. Instead of manual processes and heroism, it defines measurable SLOs, automates incident response, and treats reliability as a feature that is designed, not something that just happens.

References

  • Site Reliability Engineering — Google, free book.
  • The Site Reliability Workbook — Google, practical exercises.
  • SRE Resources — Google, 2024. Additional SRE resources including articles and presentations.

Related content

  • DevOps

    Culture and set of practices that unify development (Dev) and operations (Ops) to deliver software with greater speed, quality, and reliability. It's not a role — it's a way of working.

  • Observability

    Ability to understand a system's internal state from its external outputs: logs, metrics, and traces, enabling problem diagnosis without direct system access.

  • SLOs, SLIs & SLAs

    Framework for defining, measuring, and communicating service reliability through service level objectives (SLOs), indicators (SLIs), and agreements (SLAs).

  • Metrics & Monitoring

    Collection and visualization of numerical system measurements over time to understand performance, detect anomalies, and make data-driven decisions.

  • Incident Management

    Processes and practices for detecting, responding to, resolving, and learning from production incidents in a structured and effective way.

  • Chaos Engineering

    Discipline of experimenting on production systems to discover weaknesses before they cause incidents, by injecting controlled failures.

  • Alerting Strategies

    Practices for configuring effective alerts that notify real problems without generating fatigue from excessive notifications.

Concepts