Interview Experience

DevOps / SRE Interview: The Production Postmortem Round

An SRE loop at a Series D infra company, anchored on the round where they handed me a real-feeling postmortem and asked me to find what was missing.

DevOps / SRE Interview: The Production Postmortem Round

An SRE loop at a Series D infra company, anchored on the round where they handed me a real-feeling postmortem and asked me to find what was missing.

monitoring
interview-prep
leadership
system-design-interview
behavioral-interview
arjunrivera

By @arjunrivera

April 28, 2026

·

Updated May 18, 2026

860 views

28

4.3 (14)

I had been an SRE at a 200 engineer company for three years before I tried to move to a Series D infrastructure company (about 600 engineers, public-cloud-adjacent product, well-known in the SRE world for their internal tooling). The loop was 4 rounds, two of them shaped exactly like the SRE rounds I had done at other companies, and one of them was a postmortem round that I had not seen before. The postmortem round was the one that decided the loop. I want to walk through it.

The four rounds, and the one that decided it

  • Round 1: 60 min coding round (a queue-rate-limiter shaped problem, write the code)
  • Round 2: 60 min systems debugging round (here is a small system, here is a symptom, find the cause)
  • Round 3: 60 min postmortem round (here is a real-feeling postmortem document, find what is missing or wrong)
  • Round 4: 45 min behavioral with the engineering manager

The coding round was straightforward: implement a token-bucket rate limiter, then extend it to support per-tenant limits with a fallback global limit. Standard. The debugging round was also familiar: the interviewer described a service that was returning the wrong cache header on a small percentage of responses, and I had to walk through the debugging tree (start with the load balancer, check upstream caches, check the service code, check the response middleware), eventually narrowing it to a middleware ordering bug. Both rounds were normal. I am skipping them because the postmortem round is the one worth reading.

What the postmortem round was

60 minutes. The interviewer joined the call, shared a Google Doc that was about 3 pages long, and said: "this is a real postmortem from one of our internal teams, anonymized. Read it. Then tell me what is wrong with it. Take 15 minutes to read; I will be quiet". The doc was a real-feeling postmortem with all the standard sections.

I read it twice. Here is what was in it, summarized:

incident: cache-tier saturation, customer-visible 504s
start:    2025-02-14, 03:42 UTC
detected: 2025-02-14, 03:51 UTC, alarm on 5xx rate
resolved: 2025-02-14, 04:34 UTC, after manual cache flush + capacity bump
impact:   ~3.4% of read traffic, 52 minutes
sections in the doc:
  - timeline (decent, with timestamps)
  - root cause (a misconfigured cache eviction policy)
  - what went well (the on-call detected fast, the runbook had the right command)
  - what went poorly (the cache config change had no review)
  - action items (3 items: add review for cache configs, alarm on cache hit rate,
                 add a cache capacity dashboard)

The interviewer let me read it twice and then asked: "what is wrong with this postmortem".

What I said, and what was actually missing

I started with the obvious things. The action items were vague: "add review for cache configs" did not specify who, by when, or what "review" meant. The detection time of 9 minutes was actually slow given the customer impact, and the postmortem did not surface that. The blast radius (3.4% of read traffic) felt small but they did not break it down by customer tier; if 3.4% included the largest customer, that was a different incident.

The interviewer let me list those, then said "keep going".

I thought for a minute and surfaced the bigger ones. The root cause as written (misconfigured cache eviction policy) was a proximate cause, not a contributing cause. Why was the change unreviewed? Why did the alarm threshold for the cache subsystem not fire before the customer-visible alarm did? What other config changes had landed in the same window? The postmortem treated the misconfig as the entire story, which is the most common postmortem failure mode I have seen in real life.

The interviewer pushed harder: "now tell me what is missing about the impact". This is where I think I actually earned the round. The impact section had numbers but no narrative. There was no description of what a customer experienced ("a customer trying to read their dashboard at 3:50 UTC saw a 504 and a partially-rendered page"), no mention of whether downstream systems amplified the effect (queues filling up, retries adding load), and no explicit statement of whether any customer data was lost or delayed permanently. I have read postmortems where the engineering team thought a 504 was a 504, and the customer-facing team had to retro-fit the customer experience three days later.

The follow-up that was the actual round

With 15 minutes left, the interviewer pivoted. "Now imagine you are joining the team that wrote this. Your first week. The senior engineer who wrote this postmortem is the one you sit next to. How do you raise the things you just said without telling him his postmortem is bad".

This is what the round was actually grading. SRE work at the senior level is not about finding the bug. It is about getting a team to internalize the postmortem culture you carry. The interviewer wanted to see whether I could land hard feedback in a way that the recipient would act on.

I thought out loud for a minute. My answer was:

  • I would not raise it as a list of things wrong with the postmortem
  • I would offer to co-author the next one with him, sit through his draft, and ask the questions I had asked here as questions, not assertions ("how would we know if downstream systems amplified the effect; do we have a metric for it")
  • I would ask the engineering manager whether the team had a postmortem template, and if not, propose one as a low-stakes artifact I could bring from my last team
  • I would not raise the action-item vagueness directly; I would propose adding a "by when, by whom" column to the action item table for the next postmortem and let the team feel the difference

The interviewer asked one follow-up: "what if the senior engineer pushes back". I said the pushback is the round; if he pushed back I would ask him to walk me through why his current shape was working, and I would either learn something I had missed or I would have a real concrete disagreement to take to the manager. Either is fine.

The behavioral round and offer

The behavioral round was 45 minutes with the engineering manager. He had read a one-line summary of each technical round before the call. He spent 35 minutes asking about a real on-call rotation I had restructured at my previous job. The story was specific (the rotation went from 14 days on-call every quarter to 7 days every six weeks, and the on-call burnout score dropped from a self-reported 6.4 to 2.9 (a calibrated number rather than a vague "we improved on-call"). The hiring manager pushed back gently: "did you measure the engineers who left during that period, or only the ones who stayed?" That question was the round inside the behavioral, and my honest answer was that the rotation restructure had not survived its first attrition cycle yet. I told her that, and she nodded and wrote it down in the team's quarterly survey). He pushed on the parts that had not gone well, which there were several of. I told him about them honestly. The offer came two days later.

What I would change before the next postmortem round

I took the offer and joined. Two things I would do differently. The first is that I should have started the postmortem round with the impact section, not the action items. The action items are the easiest thing to critique because they are visibly weak; the impact section is where the deeper failure mode usually lives. Reordering my critique would have surfaced the bigger problems faster and given the interviewer more time on the follow-up.

The second is that I should have asked, before I started reading, whether the postmortem had been written by an engineer or by the manager. Postmortems written by managers and postmortems written by engineers fail in different ways, and the answer would have changed which questions I prioritized.