Storage at scale: what I actually watched

For eight years I ran the SRE team behind a storage system measured in exabytes. Over time, the dashboard I checked every morning shrank to a handful of numbers. These are the seven that told me whether the service was healthy.

Availability tells you if the system is up. Durability tells you if your data is still there. The two are not the same.

Here’s the short version.

KPI	What it measures	How we tracked it
Availability	Percent of requests succeeding	99.99% per region, per service
Durability	Probability your data survives	11 nines (10^-11 annual loss)
TTFB	Time to first byte returned	p50, p95, p99 latency per object size
Canaries	Synthetic test traffic	Continuous PUT/GET from every region
Hotspots	Skew across storage nodes	Top-N node load vs cluster median
IOPS	Operations per second	Read/write IOPS per shard, per disk
DB Shards	Metadata partition health	Shard CPU, lag, hot-key skew

Availability and durability are the two non-negotiables.

Availability is uptime. Durability is whether the data survives. You can be 100% available and lose data, you can be 100% durable and offline. Customers care about both. We hit 11 nines of durability by writing every object to multiple availability domains with erasure coding, and proved it monthly with a recovery drill.

TTFB is what users actually feel.

Aggregate availability hides slow tails. A 99.99% available service with a p99 TTFB of 2 seconds feels broken. Always track latency by object size bucket. A 10 MB read should not share an SLO with a 100 byte HEAD.

Canaries are your truth.

Customers don’t tell you when they’re sad. They leave. Canaries are synthetic PUT/GET/LIST traffic running continuously from every region. If a canary fails for 30 seconds, you find out before your customer’s pager goes off.

Hotspots and IOPS surface the silent failures.

A storage cluster can be 99.99% available while one node is on fire. Track per-node IOPS and bytes-served, and alert on the top-N nodes diverging from cluster median. Hotspots are the leading indicator of a customer key range overwhelming a shard.

DB shards are the part nobody talks about.

Object storage looks stateless, but the metadata layer is a sharded database. One hot shard, one rebalance gone wrong, and your control plane stalls. Watch shard CPU, replication lag, and hot-key skew the same way you watch the data plane.

The data plane scales. The control plane bites.

Those seven numbers, watched together, told me almost everything I needed to know about whether the service was healthy.

Subscribe