Community Question Bundle
SRE Incident Drill Questions From Our On-Call
Four incident-shaped questions our on-call rotation uses to interview SREs. Each one starts with a symptom, asks you to write the smallest diagnostic snippet, and the discussion is more important than the code.
SRE Incident Drill Questions From Our On-Call
Four incident-shaped questions our on-call rotation uses to interview SREs. Each one starts with a symptom, asks you to write the smallest diagnostic snippet, and the discussion is more important than the code.
By @alexsaeed
March 30, 2026
·
Updated May 20, 2026
827 views
26
4.3 (14)
Pager fires at 03:14: "p99 latency on /checkout doubled in the last 5 minutes." Walk me through your first 5 minutes. Include the one-liner you would actually run, and the wrong move I am watching for.
On the pager
On the pager at 03:14, the sequence I'd run: first_five_minutes(alert) returns ['confirm_in_dashboard', 'check_recent_deploys', 'check_dependency_health', 'form_hypothesis', 'decide_action']. The wrong move I'm watching for is kubectl rollout undo before a hypothesis is formed: if the upstream resolves on its own at the same moment, you cannot tell whether your rollback fixed it or the issue cleared.
We are hitting connection-pool exhaustion on the Postgres primary. Write a tiny script that lists current connections grouped by application name and state, and tell me what the values mean.
On the pager
On the pager during pool exhaustion, list_connections(conn) on a healthy primary returns rows like { application_name: 'orders-api', state: 'active', n: 12, oldest_secs: 0.4 }. During the incident you'd see { state: 'idle in transaction', n: 47, oldest_secs: 312.5 }: that's a leaked ORM session and the PIDs need terminating while the code path is patched.
A service is OOM-killed every 30 minutes. The pod request is 512Mi, limit 1Gi, and the on-call dashboard shows steady RSS around 700Mi between kills. Walk me through the diagnosis, then write the snippet that would help confirm a memory leak.
On the pager
On the pager:
start_leak_watch(interval_s=60)
# Prints allocation deltas at 1-minute snapshots, e.g.:
# +12 MiB at requests/sessions.py:312 (cache append)
# +8 MiB at logging/__init__.py:1100 (records retained)
# Steady 700Mi RSS with periodic OOM at 1Gi is the smoking gun for a slow climb;
# tracemalloc surfaces the growing allocation sites so you can find the leak.Tail latency on a downstream HTTP client is high because the connection pool is exhausted under load. Show me a small Python httpx-style client setup with a sensible pool, plus the metric you would emit to actually catch this next time.
On the pager
On the pager: sustained 500 RPS with default httpx.Client() and no explicit Limits saturates the keepalive pool, queueing requests and pushing p99 above 5s. After switching to httpx.Limits(max_connections=200, max_keepalive_connections=50) with a pool=0.2s timeout, p99 returns under SLA and a WaitMetric.total_wait_ms / count histogram exposes pool pressure so on-call catches it earlier next time.
