Community Question Bundle
Incident Debrief Questions They Asked Me
A 4-question set drawn from the debrief portion of an SRE-flavored loop. Every behavioral prompt about an on-call story got paired with a design follow-up the interviewer used to stress-test the takeaway.
Incident Debrief Questions They Asked Me
A 4-question set drawn from the debrief portion of an SRE-flavored loop. Every behavioral prompt about an on-call story got paired with a design follow-up the interviewer used to stress-test the takeaway.
By @lilyadeyemi
December 18, 2025
·
Updated May 18, 2026
798 views
25
Rate
They opened with: "Walk me through the worst on-call page you owned." After the story they asked: "Now show me the retry-with-jitter you would have added to that downstream call." Trace the backoff I sketched.
Post-incident
I traced one call on the board:
attempts = [retry_delay(0), retry_delay(1), retry_delay(2), retry_delay(3)]
# attempts roughly = [random in (0, 0.1), (0, 0.2), (0, 0.4), (0, 0.8)] seconds
# uniform jitter spreads retries instead of stampeding the downstream at fixed intervals.The debrief question: "What did you change in the runbook after that incident?" Follow-up: "Show me the structured log line you added so the next on-call would not have to grep three services." Walk through the schema I drew.
Post-incident
I drew the log entry on the board:
log_event(trace_id="abc-123", service="orders", level="ERROR",
event="payment_decline", customer_id="c-42",
downstream="stripe", status_code=429)
# Single line, machine-parseable, contains enough to pivot in Datadog without opening three dashboards.Asked: "What detection signal would have caught the regression earlier?" Then: "Show me the alert rule you would have added on that metric." Walk through the rate-of-change check I drew.
Post-incident
I drew the check on the board:
breach = alert_if_rate_drops(success_rate_per_minute, window=5, threshold=0.2)
# If the rolling 5-minute success rate falls more than 20% relative to the previous 5 minutes, page.Final ask: "What would you ship before next on-call rotation?" Follow-up: "Sketch the circuit breaker you would put in front of that flaky dependency." Walk through the three-state machine I drew.
Post-incident
I drew the breaker state transitions:
breaker = CircuitBreaker(failure_threshold=5, recovery_seconds=30)
breaker.call(downstream) # closed -> failing requests trip after 5 -> open
breaker.call(downstream) # open -> raises immediately, no downstream hit
# 30s later: half-open -> one trial call, success closes the breaker.