For eight years I ran the SRE team behind a storage system measured in exabytes. Over time, the dashboard I checked every morning shrank to a handful of numbers. These are the seven that told me whether the service was healthy.
Availability tells you if the system is up. Durability tells you if your data is still there. The two are not the same.
Here’s the short version.
| KPI | What it measures | How we tracked it |
|---|---|---|
| Availability | Percent of requests succeeding | 99.99% per region, per service |
| Durability | Probability your data survives | 11 nines (10^-11 annual loss) |
| TTFB | Time to first byte returned | p50, p95, p99 latency per object size |
| Canaries | Synthetic test traffic | Continuous PUT/GET from every region |
| Hotspots | Skew across storage nodes | Top-N node load vs cluster median |
| IOPS | Operations per second | Read/write IOPS per shard, per disk |
| DB Shards | Metadata partition health | Shard CPU, lag, hot-key skew |
Availability and durability are the two non-negotiables.
Availability is uptime. Durability is whether the data survives. You can be 100% available and lose data, you can be 100% durable and offline. Customers care about both. We hit 11 nines of durability by writing every object to multiple availability domains with erasure coding, and proved it monthly with a recovery drill.
TTFB is what users actually feel.
Aggregate availability hides slow tails. A 99.99% available service with a p99 TTFB of 2 seconds feels broken. Always track latency by object size bucket. A 10 MB read should not share an SLO with a 100 byte HEAD.
Canaries are your truth.
Customers don’t tell you when they’re sad. They leave. Canaries are synthetic PUT/GET/LIST traffic running continuously from every region. If a canary fails for 30 seconds, you find out before your customer’s pager goes off.
Hotspots and IOPS surface the silent failures.
A storage cluster can be 99.99% available while one node is on fire. Track per-node IOPS and bytes-served, and alert on the top-N nodes diverging from cluster median. Hotspots are the leading indicator of a customer key range overwhelming a shard.
DB shards are the part nobody talks about.
Object storage looks stateless, but the metadata layer is a sharded database. One hot shard, one rebalance gone wrong, and your control plane stalls. Watch shard CPU, replication lag, and hot-key skew the same way you watch the data plane.
The data plane scales. The control plane bites.
Those seven numbers, watched together, told me almost everything I needed to know about whether the service was healthy.