SLOs, SLIs & SLAs
Framework for defining, measuring, and communicating service reliability through service level objectives (SLOs), indicators (SLIs), and agreements (SLAs).
seed#slo#sli#sla#reliability#metrics#sre
What it is
SLOs, SLIs, and SLAs are a framework for defining and measuring service reliability:
- SLI (Service Level Indicator): metric measuring a service aspect (e.g., p99 latency)
- SLO (Service Level Objective): internal target for the SLI (e.g., p99 < 200ms)
- SLA (Service Level Agreement): contractual commitment with consequences (e.g., 99.9% uptime or credits)
Relationship
SLI (what we measure) → SLO (what we want) → SLA (what we promise)
The SLO should always be stricter than the SLA to have margin.
Common SLIs
| SLI | Measurement |
|---|---|
| Availability | % of successful requests |
| Latency | Response time percentile |
| Throughput | Requests per second |
| Error rate | % of requests with errors |
| Freshness | Data age |
Error Budget
Error budget = 100% - SLO. If SLO = 99.9%, you have 0.1% margin (~43 min/month). This budget is "spent" on deploys, experiments, and failures.
Why it matters
SLOs turn reliability into a quantifiable engineering decision. Without them, teams don't know how much reliability is enough and oscillate between over-investing in stability or ignoring operational debt until an incident forces them to act.
References
- SRE Book - Service Level Objectives — Google.
- SLA vs SLO vs SLI — Atlassian, 2024. Practical comparison between SLA, SLO, and SLI.
- Implementing SLOs — SRE Workbook — Google, 2018. Practical guide for implementing SLOs.