·  sre, incident-management, reliability


Five rules for running an incident

The difference between a 10-minute incident and a 3-hour outage is rarely technical. Five things I wish every on-call team locked in before their first big page.

When a system breaks at scale, the difference between a 10-minute incident and a 3-hour outage is rarely technical. It’s about how the team responds, who decides what, and how cleanly information flows.

An incident is anything degrading customer experience right now that needs coordinated action. Not every alert is an incident. Not every incident is a SEV1.

1. Severity is about customer pain, not technology.

A failed deployment no customer noticed isn’t a SEV1. A small region outage blocking a top 10 customer is. Get severity wrong and you either burn the team out on noise or under-respond when it matters.

SevCustomer impactTypical response
SEV1Major outage, broad impactAll hands, exec paged, status page within 15 min
SEV2Significant degradation or critical narrow impactOn-call + manager, status page within 30 min
SEV3Partial degradation, workaround existsOn-call team, internal comms only
SEV4Minor issue, no customer impactTracked, fixed during business hours

2. The Incident Commander runs the incident.

Not the most senior person. Not the engineer who wrote the code. The IC coordinates, assigns work, drives comms, and makes the call to escalate. If your IC is also typing commands into the terminal, you don’t have an IC.

3. Mitigate first, root-cause later.

Stop the bleed. Roll back, fail over, drain traffic, flip a feature flag. The point of the incident is to restore service. Root cause analysis is a postmortem activity. Teams that try to debug live in production take longer to recover and often make it worse.

4. Comms cadence is the discipline.

Internal updates every 15 to 30 minutes, even if the update is “still investigating.” External status page updates on every meaningful change. Silence makes people assume the worst.

5. Postmortems are for learning, not blame.

Write what happened, why, what worked, what didn’t. Action items must have owners and dates. If postmortems become punishment, people stop being honest and you lose the most valuable signal you have.

Speed of recovery is not about how fast you find the cause. It’s about how fast you stop the impact.

The teams that handle incidents well aren’t the ones with the best tools. They’re the ones who’ve practiced the roles, agreed on severity rules, and built the muscle memory.

← All writing