Interview Experience

System Design Interview at Stripe

A senior backend system design round at Stripe centered on idempotent webhooks, the failure mode I missed, and how the interviewer pushed me from a clean diagram to a defensible one.

System Design Interview at Stripe

A senior backend system design round at Stripe centered on idempotent webhooks, the failure mode I missed, and how the interviewer pushed me from a clean diagram to a defensible one.

stripe
system-design
system-design-interview
interview-prep
senior-interviews
mianair

By @mianair

February 8, 2026

·

Updated May 18, 2026

1,127 views

34

Rate

The Stripe system design round I sat through was 60 minutes, anchored on a single prompt: design a webhook delivery system. I had been warned that Stripe loops over-index on payments-shaped reliability questions, and this round was the cleanest example of why. The diagram I produced in the first 20 minutes was fine. The diagram I produced in the last 20 minutes, after the interviewer pushed back twice, was the one that actually got me to the next round.

I received the offer about two weeks later and signed it. I am writing this so the next person who walks into a Stripe design round understands what the interviewer is actually grading, which is not the topology.

How a payments-shaped design round differs from a generic one

The prompt was deliberately under-specified. Verbatim paraphrase: "Design a system that lets us deliver event notifications to merchant endpoints over HTTPS, with delivery guarantees we can defend." That phrasing, with "guarantees we can defend", was the tell. In the loops I have done, FAANG-style design rounds tend to grade on whether you cover the standard four boxes (load balancer, queue, worker pool, store). The Stripe round graded on whether I could state the failure modes I was choosing to live with, by name, with the merchant impact spelled out.

I started with the obvious skeleton:

First pass (20 min in)
  event source -> queue -> worker pool -> merchant endpoint
                              |
                              -> retry queue with exponential backoff

The interviewer let me run for about eight minutes on this and then asked the question I was not prepared for: "What does the merchant see when our worker crashes after the HTTPS request goes out but before we record the ack?"

The failure mode I missed

I had implicitly assumed at-least-once delivery and an idempotency key on the merchant side. That is the textbook answer. The interviewer pushed me past textbook by asking what the merchant sees in the specific failure window where the request lands at their server, their handler runs (charging a card, sending an email, mutating their DB), and our worker dies before persisting the ack. On the next retry, the merchant gets the same event again and, if their idempotency layer is anything less than rigorous, the side effect runs twice.

My first instinct was to wave at the merchant: "They should be idempotent." The interviewer's response, paraphrased: "That is true, and it is also the response that gets us paged at 2am when a real merchant has a real outage. What can our system do?"

This was the moment the round turned. The expected answer was not a single fix. It was a layered set of mitigations, each with a cost stated:

Layered mitigations (revised diagram, 35 min in)
  1. Persistent attempt log written BEFORE the HTTPS call       cost: extra write per attempt
  2. Per-event idempotency key surfaced in the request header   cost: protocol contract on merchants
  3. Bounded retry budget with explicit dead-letter             cost: events can permanently fail
  4. Webhook signature includes a delivery attempt counter      cost: merchants must accept replays
  5. Public dashboard so merchants can see attempted vs acked   cost: ops surface to maintain

Each line of that list was a 4-5 minute sub-discussion. The interviewer was not looking for me to invent these. They were looking for me to acknowledge that any honest answer involved an explicit set of tradeoffs the platform owner has to defend in writing.

The artifact that closed the round

With about 12 minutes left, the interviewer asked for a sketch of the worker's persistence write path. I drew this, in pseudocode, on the whiteboard:

def deliver(event_id: str, attempt: int) -> None:
    # Persist the attempt BEFORE the network call so a worker crash
    # cannot lose the fact that we tried.
    log_attempt_started(event_id, attempt)
    try:
        response = http_post(
            url=merchant_endpoint(event_id),
            headers={
                "Stripe-Signature": sign(event_id, attempt),
                "Stripe-Delivery-Attempt": str(attempt),
            },
            body=event_payload(event_id),
            timeout=10.0,
        )
    except (Timeout, NetworkError):
        log_attempt_failed(event_id, attempt, reason="network")
        schedule_retry(event_id, attempt + 1)
        return

    # The crash window we cannot eliminate sits between these two lines.
    # We document it, alarm on it, and surface it to the merchant.
    log_attempt_finished(event_id, attempt, response.status)

The comment about the crash window mattered more than the code. I said it out loud while writing it: "There is a window between the response landing and the persistence write where a process crash will cause a duplicate on retry. We cannot remove this window without two-phase commit on the merchant side, which they will not do. We make it observable instead."

The interviewer wrote that down. After the round, the recruiter relayed that the design panel had specifically called out "named the unfixable window and instrumented around it" as the moment the round turned positive.

The unfixable window is the round

The first 20 minutes of a Stripe design round will feel like a normal load-balancer-and-queue exercise. The signal you are being graded on starts when the interviewer asks about a specific failure window and you have to choose between three uncomfortable answers (live with it, push the cost to the merchant, or carry the cost on the platform). Have a position. Defend it with the cost stated. The interviewer is not testing the topology. They are testing whether you have ever owned the pager on a system shaped like this.