An escalation process is the documented path a customer problem takes from “the CSM noticed it” to “an accountable owner is fixing it on a clock.” Without one, every fire gets worked at the speed of whoever happens to be in the Slack channel, and the customer’s perception of severity drifts from yours. This is a how-to: build the four pieces — severity levels, routing, communication cadence, and root-cause follow-up — in that order, because each depends on the one before it.
Prerequisites
- A ticketing or CS platform that can hold a severity field and timestamp state changes. Zendesk, Pylon, or Gainsight all work; the field matters more than the tool.
- A Slack workspace with the ability to create per-incident channels.
- A named on-call rotation for engineering, or at minimum one accountable owner per product area.
- A health-score or account-tier signal so a Sev assignment can factor in account value, not just technical impact.
Step 1 — Define severity levels with response and resolution SLAs
Severity is a two-axis decision: business impact times account weight. Write the matrix down once and apply it mechanically, so a Sev call is not a debate at 4pm on a Friday.
| Sev | Trigger | Response SLA | Update cadence | Resolution target |
|---|---|---|---|---|
| Sev 1 | Production down, data loss, or security exposure for a paying account | 15 min | Every 30 min | 4 hours |
| Sev 2 | Major feature broken, no workaround, or any issue on a top-tier/at-risk account | 1 hour | Every 2 hours | 1 business day |
| Sev 3 | Degraded function with a workaround | 1 business day | Daily | 5 business days |
| Sev 4 | Cosmetic, question, or feature request mislabeled as a bug | 2 business days | On change | Backlog |
The account-weight axis is what CS adds that pure support triage misses: a Sev 3 technical issue on a renewal-quarter account at 70% health is a Sev 2 in practice. Encode that — if account_tier = strategic or health_score < 60, bump the Sev by one. Do not let every CSM hand-bump; the rule does it.
Step 2 — Route to a named owner, not a queue
A queue is where escalations go to age. Routing means each Sev has a pre-agreed owner and a pre-agreed channel before the incident exists.
- On Sev assignment, auto-create a dedicated Slack channel named
#esc-<account>-<date>. Per-incident channels beat one shared firehose because history stays searchable and the customer-facing summary is one scroll, not a reconstruction. - Auto-invite the accountable owner from the on-call rotation, the account’s CSM, and — for Sev 1/2 — the CS manager. Page, don’t @-mention, for Sev 1: a mention is a notification, a page is an obligation.
- Open a tracking issue in Linear (or your engineering tracker) linked to the channel, with the Sev, the account, and the SLA clock in the description. The CS-side record (in Gainsight or your CRM) links to the same issue so renewal and CSM context travel with it.
- The CSM owns the customer relationship throughout; the engineering owner owns the fix. Splitting these two roles explicitly prevents the failure where the CSM goes silent because they are waiting on engineering, and the customer hears nothing.
Step 3 — Run the communication cadence
The customer’s anxiety is a function of silence, not severity. A Sev 1 with an update every 30 minutes feels handled; a Sev 3 with three days of silence feels like Sev 1.
- Acknowledge inside the response SLA with a human message: what you understand the problem to be, the Sev you have assigned, and when the next update lands. Naming the next-update time is the single highest-impact move — it converts open-ended waiting into a kept promise.
- Update on the cadence even when there is no news. “Still investigating, root cause not yet isolated, next update at 3pm” is a valid update. Skipping it because nothing changed is the most common avoidable escalation-of-the-escalation.
- Separate internal and external language. The internal channel can say “the retry storm is hammering the queue”; the customer hears “we have identified the cause and are deploying a fix.” Never paste raw engineering chatter to the customer.
- Close explicitly. State what was fixed, confirm the customer sees it resolved, and ask for their confirmation before you mark it closed. A unilateral close reads as dismissive and frequently reopens.
Step 4 — Root-cause follow-up
Closing the ticket is not closing the loop. Every Sev 1 and Sev 2 gets a blameless post-incident review within five business days.
- Timeline. Reconstruct from the channel and tracker timestamps: detection, acknowledgement, mitigation, resolution. Measure time-to-acknowledge and time-to-resolve against the SLA.
- Five whys to a systemic cause. Stop at the process or system gap, not the person. “The CSM did not see the alert” is not a root cause; “alerts route to email, which the CSM does not monitor during EU hours” is.
- Corrective actions with owners and dates. Each action is a tracked issue, not a meeting note. No-owner actions do not exist.
- Feed it back into Step 1. If the same Sev keeps recurring, the matrix or the routing is wrong — fix the system, not the next instance.
Common pitfalls
- Severity inflation. Everything becomes Sev 1, so nothing is. Guard: the matrix is the only authority; an override requires the CS manager’s name in the channel and a one-line reason.
- Routing to a queue instead of a person. Tickets age in shared inboxes. Guard: every Sev auto-assigns a named owner on creation; an unowned Sev older than its response SLA pages the manager.
- Silence between updates. Guard: the update cadence is enforced by a reminder bot in the incident channel, not by the owner’s memory.
- Closing without root cause. The same outage recurs because nobody fixed the system. Guard: a Sev 1/2 cannot be marked “resolved” until a linked post-incident issue exists with corrective actions and owners.
- No account-weight factor. Pure technical triage under-prioritizes a small bug on a churning whale. Guard: the Sev-bump rule on tier and health score runs automatically.
Related
- NRR vs GRR — what repeated escalations cost you in retention
- Gainsight — health scores and account context for the Sev-bump rule
- Pylon — B2B support tooling that holds severity and per-account context
- Linear — the engineering-side tracking issue