Interview Experience

Datadog Onsite: Five Hours of System Design

A Datadog senior backend onsite where four of the five rounds were system design, anchored on real telemetry-shaped problems.

Datadog Onsite: Five Hours of System Design

A Datadog senior backend onsite where four of the five rounds were system design, anchored on real telemetry-shaped problems.

system-design

interview-prep

distributed-systems

monitoring

reliability

By @chloesaeed

April 30, 2026

Updated May 20, 2026

730 views

4.3 (11)

The Datadog senior backend onsite I did last year had five rounds. Four of them were system design. The fifth was a coding round that, in retrospect, was also design-flavored. That ratio is not a quirk of my loop. It tracks the team's day-to-day work, which is mostly about ingesting, indexing, querying, and aggregating telemetry at a scale where the design choices dominate the implementation choices. If you walk into a Datadog senior loop expecting a FAANG-shaped four-coding-plus-one-design ratio, you will be surprised in the wrong direction.

I signed an offer. I am writing this down because the design-heavy shape of the loop changes how you should prep, and the standard system design books are aimed at one design round, not four in a row.

Five rounds, four of them design

Loop sequence (one virtual onsite day, five hours total)
  R1   System design: ingest pipeline (60 min)
  R2   System design: query path (60 min)
  R3   Coding (45 min)
  R4   System design: storage and retention (60 min)
  R5   System design plus behavioral hybrid (75 min)

A fifteen-minute break between rounds and a thirty-minute lunch in the middle. By the third design round my whiteboarding hand was tired, which I had not seen coming and which is a thing you can prep for.

R1: Ingest Pipeline

Design a service that ingests a high-volume metrics stream from many agents, batches the writes, and forwards to a downstream store. The interviewer wanted me to anchor in the real failure modes early, so I did: backpressure, agent disconnects, partial batches, the long tail of bad data.

I sketched a pipeline with three stages: an edge ingestor, a per-shard aggregator, and a writer to the downstream store. The interviewer pushed on backpressure for about fifteen minutes. The shape of the discussion was: what does the ingestor do when the aggregator is slow, and what does the agent do when the ingestor is slow.

I had a real answer for both, anchored on a specific tradeoff. For ingestor-to-aggregator, I argued for a bounded queue with drop-oldest semantics, on the grounds that for metrics, freshness matters more than completeness in degraded mode. For agent-to-ingestor, I argued for client-side buffering with a cap, on the grounds that you want the agent to absorb a short outage but not run the host out of memory during a long one. The interviewer agreed with the first, pushed on the second, and we landed on a hybrid: client buffering plus a circuit breaker that drops to sampling when the buffer is over half full.

The one thing this round taught me is that for telemetry-shaped problems, the right answer almost always lives in the tradeoff between freshness, completeness, and cost. Naming those three out loud at the start of the round saved me time later.

R2: Query Path

Design the read side: time-series queries over a large index, with tag filtering and aggregation. The interviewer pushed on the tag-cardinality problem in the first ten minutes, which I had read about in public engineering posts but had not designed for from scratch.

The shape of my answer:

Read path layout
  Query parser  ->  query planner  ->  shard fanout  ->  per-shard executor  ->  merger

I used a column-by-column index per metric and a separate inverted index for tags. The interviewer asked what happens when a single tag has tens of millions of distinct values. I had a real answer (cardinality-aware planning, with a fallback to full-scan plus sampling for tags above a threshold), but the answer was thinner than I wanted because I had not pre-computed the threshold.

This round was on bar but not strong. The recruiter later told me it was the lowest of the four design rounds.

R3: Coding

Forty-five minutes, one problem with a follow-up. The base problem looked algorithmic but was really about choosing the right data structure for streaming aggregation. I picked a count-min sketch in the second minute, the interviewer nodded, and the rest of the round was about correctness and the parameter trade-off (width and depth versus memory).

class CountMinSketch:
    def __init__(self, width: int, depth: int, hashes):
        self.width = width
        self.table = [[0] * width for _ in range(depth)]
        self.hashes = hashes  # depth-many independent hash functions

    def add(self, key: str, count: int = 1) -> None:
        for row, h in enumerate(self.hashes):
            self.table[row][h(key) % self.width] += count

    def estimate(self, key: str) -> int:
        return min(self.table[row][h(key) % self.width] for row, h in enumerate(self.hashes))

The follow-up was about how you would tune width and depth for a target error rate. I gave the standard analytical answer (width controls accuracy, depth controls confidence) with the math sketched. This round was clean.

R4: Storage and Retention

Design the storage layer for telemetry data with mixed retention tiers. The interviewer set the constraint up front: hot data must be query-fast, cold data must be cheap, and the migration from hot to cold must not require a downtime window.

I sketched a tiered storage layout with a hot tier on a fast columnar store, a warm tier on object storage with a cached metadata index, and a cold tier on object storage with no index at all (queries against cold data run a full scan over a date range, which is slow on purpose).

The interviewer drilled on the migration. The thing they wanted me to talk about was how you handle a query that crosses the tier boundary. I worked through it on the whiteboard: split the query at the tier boundary, run each half against its own tier, merge the results. The detail they pushed on was what happens when a query crosses a tier during a migration. I gave a serializable-snapshot-style answer (read the metadata at query start, route based on that snapshot, ignore in-flight migrations) and the interviewer accepted it.

This round was the strongest of the design rounds. The recruiter confirmed.

R5: System Design Plus Behavioral

Seventy-five minutes with the hiring manager. The first half was a smaller design problem (an alerting pipeline) and the second half was behavioral. The behavioral prompts were standard: a conflict, a project I would do differently, a recent technical decision. I had stories for all three.

The one thing that made this round work was that the design half and the behavioral half were not really separated. The hiring manager would ask a behavioral question, I would answer with a story, and they would pull a design question out of the story. ("You said the schema migration took six months. Walk me through what the schema actually looked like.") The smooth transition between modes was, I now believe, the point of the round.

Surviving Five Hours of Design

The non-obvious thing the loop tested was endurance. By round four my whiteboard hand was tired and my voice was hoarse. The things that helped, in retrospect:

I had practiced two design problems back-to-back in mocks the week before. I had not practiced four. I should have.
I drank water between every round, deliberately.
I had cards with my standard design framework written on them, by my hand, so I did not have to summon the structure cold each round.
I had pre-decided on a single design vocabulary (load balancer, work queue, sharded store, etc.) so I was not switching between near-synonyms in different rounds.

Three things I would do differently for a Datadog-shaped loop

Three things, in order of impact for a Datadog-shaped loop:

Practice the cardinality and retention questions specifically. These are core to the team's domain and they came up in two rounds. Generic system design prep does not cover them.
Pre-compute thresholds. When you say "we switch strategy when the tag cardinality crosses X," have a number for X. "Some threshold" is a senior-versus-staff tell.
Do a four-design mock. Fatigue is real and the standard one-design mock does not surface it.

The loop ended in an offer. The single most useful thing I learned was that for design-heavy companies the bar is not raw design talent: it is having enough fluency in the company's actual domain that you can show up to the fourth round of the day with the same crisp framing you had in the first one.

Back to Interview Experiences