·  sre, storage, reliability, metrics


Storage at scale: what I actually watched

For eight years I ran SRE for a storage system measured in exabytes. The dashboard I checked every morning shrank to seven numbers. Here they are.

For eight years I ran the SRE team behind a storage system measured in exabytes. Over time, the dashboard I checked every morning shrank to a handful of numbers. These are the seven that told me whether the service was healthy.

Availability tells you if the system is up. Durability tells you if your data is still there. The two are not the same.

Here’s the short version.

KPIWhat it measuresHow we tracked it
AvailabilityPercent of requests succeeding99.99% per region, per service
DurabilityProbability your data survives11 nines (10^-11 annual loss)
TTFBTime to first byte returnedp50, p95, p99 latency per object size
CanariesSynthetic test trafficContinuous PUT/GET from every region
HotspotsSkew across storage nodesTop-N node load vs cluster median
IOPSOperations per secondRead/write IOPS per shard, per disk
DB ShardsMetadata partition healthShard CPU, lag, hot-key skew

Availability and durability are the two non-negotiables.

Availability is uptime. Durability is whether the data survives. You can be 100% available and lose data, you can be 100% durable and offline. Customers care about both. We hit 11 nines of durability by writing every object to multiple availability domains with erasure coding, and proved it monthly with a recovery drill.

TTFB is what users actually feel.

Aggregate availability hides slow tails. A 99.99% available service with a p99 TTFB of 2 seconds feels broken. Always track latency by object size bucket. A 10 MB read should not share an SLO with a 100 byte HEAD.

Canaries are your truth.

Customers don’t tell you when they’re sad. They leave. Canaries are synthetic PUT/GET/LIST traffic running continuously from every region. If a canary fails for 30 seconds, you find out before your customer’s pager goes off.

Hotspots and IOPS surface the silent failures.

A storage cluster can be 99.99% available while one node is on fire. Track per-node IOPS and bytes-served, and alert on the top-N nodes diverging from cluster median. Hotspots are the leading indicator of a customer key range overwhelming a shard.

DB shards are the part nobody talks about.

Object storage looks stateless, but the metadata layer is a sharded database. One hot shard, one rebalance gone wrong, and your control plane stalls. Watch shard CPU, replication lag, and hot-key skew the same way you watch the data plane.

The data plane scales. The control plane bites.

Those seven numbers, watched together, told me almost everything I needed to know about whether the service was healthy.

← All writing