Community Article

The On-Call Handbook for Engineers Who Hate Being On-Call

The 12 hours before, the first hour of an incident, the playbook discipline that makes 3am pages survivable, and the post-rotation rituals that have stopped on-call from wrecking my health.

The On-Call Handbook for Engineers Who Hate Being On-Call

The 12 hours before, the first hour of an incident, the playbook discipline that makes 3am pages survivable, and the post-rotation rituals that have stopped on-call from wrecking my health.

on-call

reliability

alerting

handling-failure

craftsmanship

By @gabrielkhalil

March 4, 2026

Updated May 18, 2026

548 views

Rate

I do not love being on call. I get paged less than I used to, but the dread before a rotation has not gone away in eight years of doing this. What has changed is that the rotation is no longer a constant low-grade panic. There is a small set of habits, before, during, and after, that turn on-call from "the worst week of my month" into "a week where I get less hobby time but otherwise survive".

This is the handbook I have written and rewritten for myself and the teams I have led. It is not for the senior SRE who genuinely enjoys incidents. It is for the rest of us: the engineers who would rather be writing features, who get paged, and who need a structure for not dropping the ball at 3am.

The 12 hours before the rotation starts

The single thing that has done the most for my on-call experience is treating the day before a rotation as part of the rotation. There are six things I do, in this order, and they take about an hour total.

Sync the runbooks. Pull the latest runbook repo. Skim the README and any runbook for a service that has changed in the last week.
Check the incident log. Read the last two weeks of incidents. Anything that recurred is anything that is likely to page me.
Verify my paging setup. Open PagerDuty (or whatever your tool is). Trigger a test page to my phone. Make sure it actually rings, even on Do Not Disturb. The single worst on-call story I have heard from a colleague was a missed page because the phone was on silent and the secondary did not get woken up either.
Confirm the on-call escalation chain. Who is my secondary? Who is the manager I escalate to if I am drowning? Their phone numbers go in my phone, not in a wiki I will not be able to load at 3am.
Set up the on-call laptop. VPN works. SSO works. Production access works. Logs query interface loads. The dashboards bookmarked. Test each before the rotation starts.
Block low-importance work on my calendar. Cancel non-essential meetings. Push deep-work tasks. The week is going to be reactive; planning otherwise sets me up for guilt when I cannot deliver.

This hour pays back tenfold. The first time I get paged, I am not also debugging my VPN.

The first hour of an incident: the only structure I can hold at 3am

The one structure that survives sleep deprivation, panic, and an unfamiliar service is also the simplest. I have it on a sticky note next to my monitor.

The first-hour incident structure (sticky-note version)
  T+0    Acknowledge the page. Open the runbook. Open the dashboard.
  T+5    Open an incident channel. Post symptoms, time, what I am looking at.
  T+10   Hypothesize the smallest possible cause. Verify or rule out in 10 minutes.
  T+20   If still down, page the next person. Two heads at 3am are 5x one head.
  T+30   Decision point: mitigate first, debug later (rollback, scale up, kill switch).
  T+45   Write a one-line update for the channel. Assign a scribe.
  T+60   Mitigation has either worked or you have escalated to a wider group.

The single most useful line in there is T+30. The instinct as an engineer is to find the root cause first and fix it correctly. The instinct that protects production is to mitigate first (roll back, scale up, flip the kill switch) and find the root cause when nobody is currently being affected. I have wasted hours of customer impact because I wanted to fix the bug rather than restore the service. Now I do the inverse: revert the deploy at T+10 if I have any reason to suspect it, debug at T+90 once everything is back to normal.

The runbook discipline that pays for itself

A runbook is the thing you wish you had at 3am: the symptoms of a known failure mode, the dashboard to check, the command to run, the rollback procedure. Most teams have runbooks; few of them are maintained.

Four rules that have kept my team's runbooks alive:

Rule 1: every alert has a runbook link or it gets deleted. If the alert is real but there is no runbook, write a stub before promoting the alert to production. If the alert is not real (no impact, no action), delete it. Runbook-link-required is the discipline that catches both.

Rule 2: runbooks are tested by the alert firing. When an alert fires for the first time on someone other than the author, the runbook is exercised. Whoever is on call at that moment grades the runbook (followed verbatim, what was missing, what was wrong) and updates it before the next rotation.

Rule 3: runbooks are short and decision-oriented. A page-long runbook with five paragraphs of context is unread at 3am. The format that survives:

Runbook template that I actually use
  Symptom         (the alert text or what the user sees)
  First check     (the dashboard URL or query)
  Common cause 1  (signal: ____  fix: ____)
  Common cause 2  (signal: ____  fix: ____)
  Mitigation      (the safe rollback or kill switch, with the exact command)
  Escalation      (who to page if mitigation fails)

No prose. No history. No design rationale. Those go in a separate doc that someone can read in a calmer moment.

Rule 4: a quarterly tabletop exercise. Once a quarter, the team picks an old incident, recreates it as a scenario, and walks through it as if paged. The runbook gaps surface immediately. Two hours, four times a year, and the team's collective on-call confidence is a measurably different thing.

Mitigation tools that have to be in arm's reach

The four buttons I want to be able to press without thinking, on every service I am on call for:

Four mitigation buttons that should be one click away
  1. Rollback the last deploy            (single command or single button)
  2. Scale the service up by 2x          (autoscaling override or manual asg edit)
  3. Flip the global kill switch         (the feature flag that disables the broken path)
  4. Drain traffic from a region         (route 53 / load balancer weight to zero)

If any of those four takes longer than a minute to find at 3am, the team has a pre-incident-prep gap. I have shipped tickets to add the missing one in the first week of every team I have joined; the cost is small (a script in the runbook repo) and the payback is the time you do not waste reading internal docs while customers are affected.

Severity: the only call you have to make in the first 60 seconds

The instinct in the first 60 seconds of a page is to start debugging. The discipline that has served me better is to declare a severity first, before opening any dashboard. The severity is the question "how many users are affected, right now", and it has only three meaningful answers.

Severity buckets that I commit to within 60 seconds of paging
  SEV1   widely visible (homepage down, payments failing, login broken)
         action: page secondary, open incident channel, mitigate first
  SEV2   partially visible (one feature degraded, one region affected)
         action: open incident channel, follow runbook, escalate at T+30
  SEV3   internal-only or edge case (a cron lagging, a dashboard stale)
         action: ack, fix calmly, no incident channel needed

Why committing to a level matters: SEV1 and SEV2 deserve different mental modes (mitigate first vs. diagnose patiently), and the wrong mode burns time. I have caused outages by treating a SEV1 like a SEV2 (poking around in logs while customers were impacted, instead of rolling back); I have caused team noise by treating SEV3s like SEV1s (paging the whole team to debug a stale dashboard).

The rule that has worked: pick the severity in the first 60 seconds, write it in the incident channel, and adjust if new information arrives. Anchoring early is more important than picking exactly right.

After-action: the part I used to skip

The post-incident review is the step engineers most often phone in. "Cause: deployed bad code. Fix: do not deploy bad code." That is not an after-action; it is a way to feel like the meeting happened.

The shape of after-action that has actually changed my team's behaviour:

After-action questions worth answering, in this order
  1. What did the customer experience? (in numbers, not adjectives)
  2. What was the timeline? (alert -> ack -> mitigate -> resolve)
  3. What detection or response could have shaved 5 minutes?
  4. What contributed to it (multiple causes, not one)?
  5. What habit / runbook / alert would prevent the same shape?
  6. What is the action item, owner, and date for each finding?

The finding/action separation is critical. "We learned that the alert was too noisy" is a finding. "Reduce the alert threshold from 1% to 5% by April 30, owned by Priya" is an action. Findings without actions show up in the next incident; actions without findings get done but address the wrong thing.

The rotation-aftermath rituals that protect me

The rotation does not end when my week ends. The cumulative tax of a hard week takes about three days to leave the body. Three rituals I follow after every hard rotation:

The next morning: a coffee and a list. I write down everything that bugged me about the rotation: a flaky alert, a missing dashboard, a runbook that was wrong, a mitigation that took too long. The list is for me; some of it becomes tickets, some is just venting on paper.

The next afternoon: I file the tickets. Every item from the list that has a clear improvement attached becomes a ticket. The team's on-call-improvement backlog is the most reliably-funded pool of work I have ever seen, because the people filing the tickets are also the people who will benefit when they are done.

The next weekend: a deliberate break. Friends, exercise, a long walk, anything that is not Slack. The temptation to compensate for a week of low-output engineering by working through the weekend is real. It is also a good way to never recover. I have learned the hard way that the rotation tax is paid by the body, and the body's payment plan is sleep and movement. There is no faster route.

What I would tell someone starting their first rotation

If I could send a single message back to the version of me about to start their first on-call rotation, it would be: most of the dread is anticipation, most of the actual incidents are routine, and the few that are not routine are why the senior engineers exist. Page them. Ask the question that feels stupid. Mitigate before debugging. Take the next morning to update the runbook that almost-failed you. Take the weekend off. The rotation is a system, not a heroic test of character; the engineers who thrive in it have built a small set of habits, not stronger willpower. Build the habits, follow them, and the rotation stops being a thing you survive and starts being a thing you operate.

Back to Articles