Interview Experience

ML Engineer Onsite: The Whiteboard Math Round

An ML onsite at a Series D recommendation-systems company, anchored on the math round where I had to derive a logistic regression gradient on a whiteboard.

ML Engineer Onsite: The Whiteboard Math Round

An ML onsite at a Series D recommendation-systems company, anchored on the math round where I had to derive a logistic regression gradient on a whiteboard.

machine-learning

math

ml-system-design

interview-prep

whiteboard-interview

By @priyasharma

March 27, 2026

Updated May 20, 2026

241 views

4.3 (13)

I had been an ML engineer for four years at a small adtech company before I took an onsite at a Series D recommendation-systems company (about 400 people, ML-heavy product, headquartered in San Francisco, not a FAANG, well-known in the ML world). The loop was 5 rounds in one day, all onsite, and the third round was the math round. I had been told it would be there. I underestimated how much of an actual whiteboard round it was, and I want to write down what happened, what they were grading, and how I would prep for it now.

The five-round shape

Five 60 minute rounds with a 30 minute lunch:

Round 1, ML coding: implement k-means from scratch, no libraries
Round 2, ML system design: design a recommendation pipeline for a marketplace, end to end
Round 3, the math round: derive a model objective on the whiteboard, solve a probability problem, and answer a follow-up that turned out to be the actual round
Round 4, applied ML: a problem-framing round, take a vague business goal and translate it to ML problem statements
Round 5, behavioral with the hiring manager

I am going to focus on round 3 because it was the round I had not properly prepped for and the round that other ML candidates I have talked to since have also been surprised by.

The math round, minute by minute

The interviewer was a senior research engineer, late 30s, the kind of person whose office had three textbooks visible on the wall. He started with: "derive the gradient of the logistic regression loss with respect to the weights, on the whiteboard, no calculator, no laptop".

I froze for about 8 seconds. I knew the gradient, I had implemented it dozens of times, but I had not actually sat down and derived it from first principles in maybe two years. I started slowly.

I wrote the model out:

logistic regression
  prediction:    y_hat = sigmoid(w . x + b)
  sigmoid:       sigma(z) = 1 / (1 + exp(-z))
  loss per ex:   L = -[ y log(y_hat) + (1 - y) log(1 - y_hat) ]

Then I derived the gradient piece by piece, narrating. The trick I needed to remember was the sigmoid derivative identity: d/dz sigma(z) = sigma(z) * (1 - sigma(z)). With that, the chain rule collapses into a clean form:

dL/dw_j
  = dL/dy_hat * dy_hat/dz * dz/dw_j
  = -(y/y_hat - (1-y)/(1-y_hat)) * y_hat (1 - y_hat) * x_j
  = (y_hat - y) * x_j

The interviewer let me get to the second-to-last line and asked me to walk through the algebra that collapses to (y_hat - y) * x_j. I had to do it on the whiteboard, slowly. The collapse felt like magic the first time you see it; doing it under interview pressure with the marker in your hand is a different experience.

That took 12 minutes. He then asked: "why does this collapse to such a clean form, and is it a coincidence". The answer is no, it is not a coincidence: logistic regression with cross-entropy loss is in the exponential family, and the gradient form (prediction - target) * input is the canonical form for any exponential-family generalized linear model. I knew this enough to say it. He asked a follow-up: "so what does that imply for softmax with cross-entropy". The answer was that the same form applies, with y_hat - y now being a vector difference. I got it, with one false start.

The second part of the round was a probability question. "Suppose I roll two fair dice. Given that at least one of them shows a 6, what is the probability both show 6?" This is the classic conditional probability trap: most people answer 1/6, the right answer is 1/11. I worked through the conditional formula slowly, listing the 11 outcomes in the conditioning event and the 1 outcome in the joint event. I got it right but the interviewer made me defend why it was not 1/6 by writing the sample space out.

The follow-up that was actually the round

With 15 minutes left, the interviewer asked: "in your last job, when did you actually use this kind of math, and when did you not use it because the framework hid it from you". This is what the round was actually grading. The first 45 minutes were a screen for whether I had the math. The last 15 minutes were a screen for whether I knew when to reach for it and when to trust the framework.

My answer was a real story. Two years earlier I had been debugging a model that was overconfident on its negative class. Reading the loss curves was not enough; I had to look at the empirical calibration of the predicted probabilities versus the true rates, which led me to platt scaling, which is a tiny logistic regression on top of the model's outputs. The math I had just derived was the math I had used. The interviewer asked one follow-up: "why did you use platt scaling instead of isotonic regression". The honest answer was that platt scaling fit on less data and our calibration set was small. He nodded.

I think the round closed there, in his head. The 45 minutes of derivation had been the entry ticket. The last 15 minutes were the actual selection.

What the rest of the loop looked like

The ML coding round was straightforward k-means: I wrote the assignment step, the update step, the convergence check, and a small synthetic test. I did not implement k-means++, but I named it as the production thing to do. The system design round was a recommendation pipeline (candidate generation with embedding ANN, ranking with a gradient-boosted tree, online feature freshness, the cold-start problem). I had done this kind of design at my previous job; this round was comfortable. The applied ML round was framing a fuzzy business problem as a model; I struggled because I tried to over-formalize the problem in the first 10 minutes instead of asking what the actual decision being optimized was. I recovered, but not cleanly.

The behavioral round was 60 minutes with the hiring manager and was the round where I most wanted to talk about the math round. I did not. I let the hiring manager run the conversation. He spent 30 minutes asking about a project where I had killed a model that was technically working but had a fairness issue. The story took most of the round.

Why I would prep the muscle, not the math, next time

I got the offer. I joined. The single thing I would do differently if I prepped this loop again: I would spend a week, not a day, on the whiteboard math round. Not because the math was hard but because the muscle of deriving on a whiteboard while talking is a different skill from knowing the math. I had the math. I did not have the muscle. The interviewer was kind, but a less kind one would have read my 8 second freeze as a real signal rather than a performance issue.

The second thing I would do differently is the applied ML round. I would walk in with a one-line script for fuzzy business goals: "what is the decision this model would inform, and what does the right answer look like to the user". That single question reframes any vague ML prompt into a tractable problem statement, and I left mine on the table because I jumped to formalism too early.

Back to Interview Experiences