Community Question Bundle

SRE Incident Drill Questions From Our On-Call

Four incident-shaped questions our on-call rotation uses to interview SREs. Each one starts with a symptom, asks you to write the smallest diagnostic snippet, and the discussion is more important than the code.

SRE Incident Drill Questions From Our On-Call

Four incident-shaped questions our on-call rotation uses to interview SREs. Each one starts with a symptom, asks you to write the smallest diagnostic snippet, and the discussion is more important than the code.

Question Bundle
Python
4 questions
reliability
on-call
monitoring
interview-prep
alexsaeed

By @alexsaeed

March 30, 2026

·

Updated May 20, 2026

827 views

26

4.3 (14)

Pager fires at 03:14: "p99 latency on /checkout doubled in the last 5 minutes." Walk me through your first 5 minutes. Include the one-liner you would actually run, and the wrong move I am watching for.

On the pager

On the pager at 03:14, the sequence I'd run: first_five_minutes(alert) returns ['confirm_in_dashboard', 'check_recent_deploys', 'check_dependency_health', 'form_hypothesis', 'decide_action']. The wrong move I'm watching for is kubectl rollout undo before a hypothesis is formed: if the upstream resolves on its own at the same moment, you cannot tell whether your rollback fixed it or the issue cleared.