SLA, SLO, SLI. Three letters that get used like synonyms. They aren’t. They’re a stack, and the order matters.
What each means
SLI, Service Level Indicator. The thing you measure. “99.94% of GET requests in the last 5 minutes returned successfully.” A number from telemetry that describes reality.
SLO, Service Level Objective. The target you commit to internally. “GET availability ≥ 99.95% over any 30-day window.” What your team owes its future self.
SLA, Service Level Agreement. The contract you sign with the customer. “If GET availability drops below 99.9% in a month, service credits apply.” What you owe the world when reality misses the target.
SLI is what is. SLO is what we promise ourselves. SLA is what we promise the world.
Why the order matters
The math: SLI ≥ SLO ≥ SLA. The flow: Signal → Buffer → Contract. If your SLO equals your SLA, you’ve left no buffer.
Object Storage across hyperscale clouds
| Provider | Service | Public SLA | Typical internal SLO |
|---|---|---|---|
| AWS | S3 Standard | 99.9% | ≥ 99.99% |
| Microsoft Azure | Blob Storage (Hot) | 99.9% | ≥ 99.99% |
| Google Cloud | Cloud Storage (multi-region) | 99.95% | ≥ 99.99% |
| Oracle Cloud | OCI Object Storage | 99.9% | ≥ 99.99% |
Three nines is 43 minutes of allowed downtime a month. Operating teams don’t run at that line. They target 99.99% internally, which works out to about 4 minutes. That 39-minute gap is the error budget: room to absorb noise, ship risky changes, and run experiments without breaching the contract.
Missing an SLA
Missing the SLO is a signal. Missing the SLA is an event. Service credits go out automatically. An RCA is owed inside a defined window. A problem ticket gets opened for the systemic fix. The trust loss is the one you’ll feel longest.
Where AI helps
The best AI use cases here aren’t auto-remediation. They’re earlier:
- Spotting SLO erosion before humans would. Slow drift. P99 creeping up 4% week-over-week with no single deploy to blame.
- Correlating SLI changes with deploys, traffic, and dependency health in seconds, not the hour it takes a human at 3 AM.
- Drafting the RCA. Timeline, impacted SLOs, blast radius, customer-note draft. The engineer’s time goes to the only part that matters: what to actually fix.
What AI shouldn’t do is decide what the fix is. SLO definition, error-budget policy, incident severity. Those are judgment calls. Judgment is where senior operators earn their keep.
AI augments operators. It doesn’t replace them. That’s the whole game.