Five rules for running an incident

When a system breaks at scale, the difference between a 10-minute incident and a 3-hour outage is rarely technical. It’s about how the team responds, who decides what, and how cleanly information flows.

An incident is anything degrading customer experience right now that needs coordinated action. Not every alert is an incident. Not every incident is a SEV1.

1. Severity is about customer pain, not technology.

A failed deployment no customer noticed isn’t a SEV1. A small region outage blocking a top 10 customer is. Get severity wrong and you either burn the team out on noise or under-respond when it matters.

Sev	Customer impact	Typical response
SEV1	Major outage, broad impact	All hands, exec paged, status page within 15 min
SEV2	Significant degradation or critical narrow impact	On-call + manager, status page within 30 min
SEV3	Partial degradation, workaround exists	On-call team, internal comms only
SEV4	Minor issue, no customer impact	Tracked, fixed during business hours

2. The Incident Commander runs the incident.

Not the most senior person. Not the engineer who wrote the code. The IC coordinates, assigns work, drives comms, and makes the call to escalate. If your IC is also typing commands into the terminal, you don’t have an IC.

3. Mitigate first, root-cause later.

Stop the bleed. Roll back, fail over, drain traffic, flip a feature flag. The point of the incident is to restore service. Root cause analysis is a postmortem activity. Teams that try to debug live in production take longer to recover and often make it worse.

4. Comms cadence is the discipline.

Internal updates every 15 to 30 minutes, even if the update is “still investigating.” External status page updates on every meaningful change. Silence makes people assume the worst.

5. Postmortems are for learning, not blame.

Write what happened, why, what worked, what didn’t. Action items must have owners and dates. If postmortems become punishment, people stop being honest and you lose the most valuable signal you have.

Speed of recovery is not about how fast you find the cause. It’s about how fast you stop the impact.

The teams that handle incidents well aren’t the ones with the best tools. They’re the ones who’ve practiced the roles, agreed on severity rules, and built the muscle memory.

Subscribe