SLOs and Error Budgets: A Practical Guide

How to define meaningful SLOs, track error budgets, and use them to make real engineering decisions about reliability investment.


Why most SLOs fail before they start

SLOs fail when they are defined in isolation — by a single team, without input from product, without connection to real user impact, and without the tooling to measure them continuously. The result is a document that exists but drives no decisions.

What an error budget actually is

An error budget is the inverse of your SLO. If your availability SLO is 99.9%, your error budget is 0.1% — roughly 43 minutes of allowable downtime per month. The budget is consumed by incidents, degraded performance, and failed deployments.

Defining your first SLO

Start with one service, one user journey, and one metric. Pick something you can measure today with existing telemetry. Latency at the p99 and availability (successful requests / total requests) are the most common starting points.

  • Choose a service with clear user-facing impact
  • Define "good" requests explicitly — 5xx means bad, 2xx/3xx means good
  • Set a realistic target based on historical performance, not aspiration
  • Build a dashboard that shows current burn rate against the budget
  • Review the budget in every engineering planning session

When to burn the budget deliberately

Error budgets create a shared language for tradeoffs. When the budget is healthy, teams can move faster and take more deployment risk. When the budget is nearly exhausted, teams should slow down, freeze deployments, and invest in reliability work. This is the discipline.