System Design Article
Design Food Delivery (DoorDash)
Difficulty: Medium
Design a food delivery service like DoorDash that links three actors (customer, restaurant, courier) with an end-to-end SLA of <40 minutes per order at 10M orders per day across 500K restaurants. The interview centerpiece is the courier dispatch problem, which is fundamentally different from ride-sharing: it is a 3-leg trip (courier-to-restaurant, wait for food, restaurant-to-customer) and the platform routinely batches multiple orders onto one courier to cut cost. We compare Uber's 1:1 matching to DoorDash's many-to-1 batching, design the ETA composition (prep time + assignment time + drive time + handoff), and walk through the order state machine that coordinates three independent humans.
Design Food Delivery (DoorDash)
Design a food delivery service like DoorDash that links three actors (customer, restaurant, courier) with an end-to-end SLA of <40 minutes per order at 10M orders per day across 500K restaurants. The interview centerpiece is the courier dispatch problem, which is fundamentally different from ride-sharing: it is a 3-leg trip (courier-to-restaurant, wait for food, restaurant-to-customer) and the platform routinely batches multiple orders onto one courier to cut cost. We compare Uber's 1:1 matching to DoorDash's many-to-1 batching, design the ETA composition (prep time + assignment time + drive time + handoff), and walk through the order state machine that coordinates three independent humans.
1,068 views
17
Requirements
Functional Requirements
- Browse restaurants: customer sees nearby restaurants ordered by distance, rating, ETA, price.
- Place an order: select items, customize, pay; order is sent to the restaurant.
- Restaurant accepts and prepares: tablet at restaurant accepts the order, marks 'ready' when food is bagged.
- Courier dispatch: platform assigns a nearby courier; courier may carry 2-3 orders at once (batching).
- Live tracking: customer sees courier moving on the map; gets ETA updates.
- Delivery confirmation: courier marks 'delivered'; photo or contactless drop-off.
- Ratings and tips: customer rates restaurant and courier, leaves a tip after delivery.
Out of Scope (state explicitly)
- Restaurant menu management UI (separate product).
- Payment processing details (we use the design-payment-system lesson).
- Routing engine internals (we use design-google-maps for the routing API).
- Fraud detection at depth.
Non-Functional Requirements
- Scale: 10M orders/day globally, ~700K orders peak hour, ~200 orders/sec average.
- Courier fleet: 1M active couriers; ~200K concurrent on-shift at peak.
- Restaurants: 500K active restaurants on platform.
- End-to-end SLA: 40 min p50 from order to delivery; 60 min p95.
- Dispatch latency: assign a courier within 30 s of restaurant accepting the order.
- Availability: 99.95%; an outage stops revenue immediately.
Back-of-the-Envelope Estimation
Order Volume
---------- Order load ----------
Orders/day: 10M globally
Orders/sec average: ~115
Orders/sec peak (lunch): ~700 (concentrated 11:30 AM - 1:30 PM local)
Per order data: ~3 KB (items, addresses, status, courier_id)
Daily order storage: ~30 GB (over years: ~10 TB)Dispatch QPS
---------- Dispatch load ----------
Dispatch attempts/order: ~1.3 (some couriers decline)
Dispatch decisions/sec: ~150 average, ~900 peak
Per-decision compute: ~50-200 ms (more for batching)
Per-decision data read: ~2K candidates from H3 indexCourier Location Updates
---------- Courier telemetry ----------
Concurrent on-shift couriers (peak): 200K
Update interval while moving: 5 seconds
Updates/sec: 200K / 5 = 40K
Ingress: 40K * 100 B = ~4 MB/s (small vs Uber's 25 MB/s)Restaurant Catalog
---------- Catalog sizing ----------
Restaurants: 500K
Menu items per restaurant:~50 average
Total menu items: 25M
Per-item data: ~1 KB (name, photo URL, modifiers, price)
Menu storage: ~25 GB (small; cached aggressively)High-Level Design
---------- High-level architecture ----------
[ Customer App ] [ Restaurant Tablet ] [ Courier App ]
| | |
v v v
+--------------------------------------------------------+
| API Gateway |
+--------------------------------------------------------+
| | | |
v v v v
+-----------+ +-----------+ +-----------+ +-----------+
| Catalog | | Order Svc | | Dispatch | | Courier |
| Service | | | | Engine | | Service |
+-----------+ +-----------+ +-----------+ +-----------+
| | | |
| v v v
| +-----------+ +----------+ +-----------+
| | Order DB | | H3 Index | | Courier |
| | (Postgres)| | (Redis) | | DB |
| +-----------+ +----------+ +-----------+
| ^
v |
+-----------+ +---------------+
| Menu KV | | ETA Service |
+-----------+ +---------------+
^
|
+-----------------+
| Stream Job: |
| prep-time model |
+-----------------+Core Services
| Service | Responsibility |
|---|---|
| Catalog Service | Restaurant + menu browse, ranking, search |
| Order Service | Owns order state machine; persists orders |
| Dispatch Engine | Assigns one or more orders to a courier |
| Courier Service | Courier profile, shift state, location ingest |
| ETA Service | Predicts prep time, drive time, handoff time |
| Restaurant Service | Restaurant tablet API, accept/reject, mark ready |
Order State Machine
---------- Order states ----------
created : customer placed; payment authorized
restaurant_pending : sent to restaurant tablet
accepted : restaurant confirmed; ETA broadcast to customer
preparing : food being made
ready_for_pickup : restaurant marked food ready
courier_assigned : dispatch picked a courier; courier en route
picked_up : courier collected the food
in_transit : courier en route to customer
delivered : courier marked delivered
cancelled : terminal failure (restaurant declined, no courier, customer cancel)
refunded : payment reversedThis is more complex than a ride trip because it involves three independent humans, each of whom can cancel or fail.
API Design
// Customer places an order
POST /api/v1/orders
{
"restaurant_id": "r_123",
"items": [
{ "menu_item_id": "mi_456", "qty": 2, "modifiers": [...] }
],
"delivery_address": { "lat": 37.78, "lng": -122.41, "line": "..." },
"payment_method_id": "pm_xyz",
"client_request_id": "req_xyz_789"
}
// Response
{
"order_id": "o_999",
"status": "created",
"eta": { "min_minutes": 28, "max_minutes": 40 },
"estimated_total_cents": 2350
}
// Restaurant marks ready
POST /api/v1/orders/o_999/ready
// Courier accepts dispatch (over WebSocket)
{
"event": "dispatch_offered",
"batch": [
{ "order_id": "o_999", "restaurant": {...}, "customer": {...} },
{ "order_id": "o_1000", "restaurant": {...}, "customer": {...} }
],
"expires_in_s": 30
}Detailed Design
The two interesting components are the batched courier dispatch (vehicle routing with time windows) and the ETA composition pipeline.
Why Food Delivery Dispatch Differs From Ride-Sharing
In ride-sharing the driver picks up a rider at point A and drops them at point B. One trip, two stops. In food delivery, the dispatcher must:
- Send the courier from their current position to the restaurant.
- Arrive at the right time (not too early, not too late). Couriers waste time waiting for food; restaurants want quick pickup.
- Pick up the food.
- Drive from restaurant to customer.
And the platform routinely batches orders so one courier picks up 2-3 orders from nearby restaurants and drops them at nearby customers. This is the Vehicle Routing Problem with Time Windows (VRPTW), which is NP-hard in general. We use heuristics.
Online Greedy Insertion (Production Default)
Most food delivery platforms use a streaming variant of greedy insertion:
---------- Greedy insertion algorithm ----------
When a new order O arrives (status: ready_for_pickup or close to it):
1. Find candidate couriers within ~3 km of the restaurant (H3 ring query).
2. For each candidate:
- If courier has 0 active orders: simulate inserting O; cost = drive_time(courier -> restaurant -> customer).
- If courier has 1-2 active orders: try inserting O at each position in the route;
pick the position with min total marginal time and total order time within SLA.
3. Pick the courier whose marginal cost is lowest AND who keeps all current orders within SLA.
4. Soft-hold the courier; offer the dispatch.A full vehicle routing solver would optimize globally but takes seconds. Greedy insertion is myopic but fast (10-50 ms per dispatch) and is the production default at DoorDash, Uber Eats, and Deliveroo.
The Batching Decision
Whether to batch is a tunable. Factors:
- Density: high order density makes batching natural; low density means waiting too long for a second order.
- Restaurant distance: orders from the same restaurant or adjacent restaurants batch easily.
- Customer distance: customers within ~1 km of each other are batchable.
- Premium tier: 'priority' customers pay extra to skip batching and get faster delivery.
A batched delivery saves the courier time and the platform pays one trip fee for two orders, but adds 5-10 min to each order's ETA. Most platforms target 30-50% of orders being batched at peak.
ETA Composition
---------- ETA composition ----------
ETA(order) = prep_time(restaurant, item_count, hour_of_day)
+ courier_assignment_delay
+ drive_time(courier_pos -> restaurant)
+ wait_at_restaurant
+ drive_time(restaurant -> customer)
+ handoff_timeEach component is its own predictor:
- prep_time: ML model trained on historical accepted-to-ready times per restaurant per hour. Updated nightly.
- courier_assignment_delay: rolling estimate from recent dispatches in the same H3 cell.
- drive_time: routing engine call (or great-circle / urban-speed proxy for fast estimates during browsing).
- wait_at_restaurant: slack to handle imprecise prep prediction. Couriers arriving early reduce customer ETA but waste courier time.
- handoff_time: ~2 min default; longer for apartment buildings.
We surface ETA as a range (min - max) rather than a single number; over-promising kills CSAT.
Prep Time Prediction
The biggest source of ETA error is restaurant prep time, which varies by 50%+ within the same restaurant on the same day. We predict per restaurant per item-count per hour bucket:
---------- prep time features ----------
Features: restaurant_id, item_count, complex_items_count, hour_of_day, day_of_week,
recent_avg_prep_time_15min, current_open_orders_at_restaurant.
Label: actual_minutes from accepted -> ready.
Model: Gradient-boosted trees, retrained nightly.
Serving: Feature store online; ~10 ms per inference.Live 'current_open_orders_at_restaurant' is the most powerful real-time feature: a backed-up restaurant takes longer per order.
Dispatch Engine in More Detail
async function dispatch(order) {
const candidates = await findCouriersInRing(
order.restaurant.h3,
2 // hex rings, ~750m
);
const scored = await Promise.all(
candidates.map(async (c) => {
const marginal = await simulateInsertion(c, order);
return { courier: c, marginalCost: marginal.timeMinutes };
})
);
scored.sort((a, b) => a.marginalCost - b.marginalCost);
for (const { courier } of scored) {
const held = await tryHoldCourier(courier.id, order.id, 30);
if (held) return offerDispatch(courier, order);
}
return null; // no courier available
}Restaurant Tablet Reliability
The restaurant tablet is a weak link: WiFi drops, tablet froze, staff missed the buzzer. We use:
- WebSocket with auto-reconnect; offline queue on the tablet.
- Audible alarm escalates if not accepted within 60 s.
- Restaurant ops team gets a phone call if the tablet has been offline for 5 min during business hours.
- A failed-to-accept order auto-cancels at 5 min and refunds.
Data Model
PostgreSQL (sharded by region): orders
CREATE TABLE orders (
id UUID PRIMARY KEY,
customer_id UUID NOT NULL,
restaurant_id UUID NOT NULL,
courier_id UUID,
state VARCHAR(32) NOT NULL,
items JSONB NOT NULL,
subtotal_cents INT NOT NULL,
fees_cents INT NOT NULL,
tip_cents INT,
delivery_lat DOUBLE PRECISION NOT NULL,
delivery_lng DOUBLE PRECISION NOT NULL,
delivery_h3 BIGINT NOT NULL,
eta_min_min INT,
eta_max_min INT,
created_at TIMESTAMPTZ NOT NULL,
accepted_at TIMESTAMPTZ,
ready_at TIMESTAMPTZ,
picked_up_at TIMESTAMPTZ,
delivered_at TIMESTAMPTZ,
region VARCHAR(8) NOT NULL
);
CREATE INDEX idx_orders_customer ON orders (customer_id, created_at DESC);
CREATE INDEX idx_orders_restaurant ON orders (restaurant_id, created_at DESC);
CREATE INDEX idx_orders_courier ON orders (courier_id, created_at DESC);Shard by region (city / metro). Cross-region orders do not exist (delivery is local).
Redis: courier index, dispatch holds, batch state
---------- Redis keys ----------
couriers:hex:<h3> SET courier_ids in cell
courier:loc:<id> HASH { lat, lng, status, current_batch }
courier:batch:<id> LIST ordered list of order_ids in active batch
dispatch:hold:<courier_id> STRING order_id TTL 30s
restaurant:queue_depth:<id> INT current open orders, updated by Order SvcMenu Cache (Redis)
Menus change rarely (a few times per day per restaurant). Cached as menu:<restaurant_id> -> JSON with 1 h TTL. Browse traffic almost never hits Postgres for menus.
Kafka topics
| Topic | Partitions | Key | Use |
|---|---|---|---|
order.events | 100 | order_id | feeds analytics, dispatch trigger |
courier.locations | 50 | courier_id | feeds H3 index update |
restaurant.events | 30 | restaurant_id | accepted/ready signals |
Scaling and Bottlenecks
The lunch rush
11:30 AM - 1:30 PM local time across all metros, order volume spikes 5x. Dispatch and ETA services scale horizontally. The bottleneck is courier supply (couriers cannot multiply on demand), so the platform uses surge pay to bring more couriers online and may temporarily disable batching for high-paying tiers.
Restaurant overload
A popular restaurant during peak gets 30+ orders queued; prep time blows out. We:
- Cap concurrent open orders per restaurant via the catalog (mark restaurant 'busy' to incoming customers).
- Surface a longer ETA in the customer's UI when queue depth is high.
- Throttle dispatch to that restaurant: pause new pickups so couriers do not pile up.
Courier supply imbalance
A suburban area has 5 orders but 1 courier; an urban area has 100 couriers and 50 orders. The platform offers 'reposition' bonuses to nudge couriers to under-served zones. The dispatch engine will also assign a courier to a longer drive if no one closer is available, accepting the higher cost.
Multi-region
Each metro is its own dispatch domain. A courier in San Francisco never appears in San Jose dispatch. The order DB is sharded by region; Redis runs per region. Cross-region failover is not needed: a metro's outage is contained.
Restaurant tablet outage
The single highest-impact incident is the tablet going offline. We monitor heartbeats and have a fallback: phone the restaurant manually for orders if offline > 5 min. This is operational, not algorithmic, but interviewers expect you to mention it.
Order DB growth
10M orders/day = ~30 GB/day of orders. With ~10 TB after several years, sharding by region keeps any one shard manageable. Older orders archive to cold storage after 1 year.
Trade-offs and Alternatives
Greedy insertion vs full VRPTW solver
Greedy insertion is myopic but runs in 10-50 ms; a full vehicle routing solver could improve total fleet efficiency by 5-15% but takes seconds and assumes a known workload, which we do not have (orders arrive online). Greedy is the right default; some platforms run a periodic 'rebalance' solver that re-bundles existing batches.
Batched vs unbatched delivery
Batching saves the platform money and increases courier per-hour pay, but adds 5-10 min to each customer's ETA. Customers prefer unbatched (faster); platforms prefer batched (cheaper). Modern platforms charge a premium ('priority delivery') for unbatched and offer batched as the default.
Same-region restaurant pickup vs multi-region (DoorDash 'Drive')
We focused on consumer delivery. DoorDash also runs a B2B service ('Drive') that delivers from one address to another (e.g., business courier). The dispatch problem is similar but the SLA is longer (1-2 h) and batching is more aggressive.
Restaurant tablet vs phone-in vs API integration
Tablet: simple, but a single point of failure per restaurant. Phone-in: deeply unscalable, used only as fallback. API integration with the restaurant POS: best for chains (Chipotle, Domino's), avoids a separate device, accept/ready signals come from the POS automatically. We support all three; the chain integrations are the most reliable.
ETA as point estimate vs range
A point estimate (32 min) is anchoring; if we deliver in 35 min the customer feels late even if it was within reasonable variance. A range (28 - 40 min) sets expectations correctly. Showing the upper bound encourages tipping for early delivery and tolerates the inevitable variance.
Why three-sided is harder than two-sided
In ride-sharing, you optimize wait time and cost for the rider; the driver is one party with consistent objectives. In food delivery you have customer (fast), restaurant (timely pickup but not too early), and courier (efficient route, fair pay). The dispatch objective is a weighted combination, and the weights themselves are tunable per metro per time of day. This is the senior insight.
Couriers as gig workers vs employees
Gig courier model gives the platform supply elasticity (more couriers come online when pay is high) but loses control over quality. Employee model gives reliable supply at higher cost. Different markets pick differently (UK Deliveroo went hybrid; some EU countries forced employment classifications). This is a business trade-off but worth mentioning.
Real-World Examples
How real systems implement this in production
DoorDash's published dispatch system 'DeepRed' uses online greedy insertion with ML-predicted prep times and drive times. They report ~30% of US orders being batched at peak hours.
Trade-off: Greedy insertion sacrifices a few percent of theoretical fleet efficiency for sub-100 ms dispatch decisions; the alternative (full VRPTW solver) takes seconds and cannot keep up with online order arrivals.
Uber Eats reuses much of Uber's dispatch infrastructure (H3 indexing, soft-hold pattern, Kafka location streams) but added a separate dispatch service tuned for batching. They publish on the use of two-stage assignment (candidate generation + ML scorer).
Trade-off: Sharing infrastructure with ride-sharing reduced engineering cost but created tension where the same dispatch engine had to handle both 1:1 (rides) and many:1 (food); they eventually split the dispatch path.
Deliveroo's 'Frank' algorithm matches orders, restaurants, and riders globally rather than per-restaurant. Frank computes a global cost function and batches across multiple restaurants when the geometry works.
Trade-off: Global optimization yields better aggregate efficiency but is operationally harder to debug than per-restaurant greedy assignment; Deliveroo invested heavily in observability and replay tools.
Meituan, China's largest food delivery platform, serves 60M+ orders/day, around 6x DoorDash's volume. They use deep reinforcement learning for dispatch and have published on real-time multi-agent reinforcement learning for courier routing.
Trade-off: RL-based dispatch can outperform heuristics by 5-10% in dense markets but requires massive offline simulation infrastructure and is hard to interpret; most platforms outside China stick with greedy + handcrafted scoring for now.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Customer places order -> Order Service creates row in `created` state, payment authorized. Order routed to restaurant tablet via WebSocket; state -> `restaurant_pending`. Restaurant accepts within 60 s; state -> `accepted`. ETA Service computes ETA range, broadcasts to customer. Restaurant prepares food; periodic prep_time predictor refreshes ETA. About 5 min before predicted ready, Dispatch Engine triggers: queries H3 ring around the restaurant, scores candidates by marginal insertion cost, soft-holds the best courier (Redis SET NX EX 30), offers dispatch. Courier accepts; state -> `courier_assigned`. Restaurant marks ready; state -> `ready_for_pickup`. Courier arrives at restaurant, picks up; state -> `picked_up`. Courier drives to customer (live tracking via WebSocket); state -> `in_transit`. Courier marks delivered with photo; state -> `delivered`. Payment captured (was authorized at create); tip option presented to customer.
Batching means one courier carries multiple orders simultaneously. We batch when: (1) two restaurants are within ~1 km, (2) two customers are within ~1 km, (3) the second order is ready around the same time as the first, (4) inserting the second into the courier's route adds <5 min total. The greedy-insertion algorithm tries inserting the new order at each position in an existing courier's route; picks the position with minimum marginal time; assigns if SLA still holds for all orders. Target ~30-50% batched at peak. Premium tier customers pay extra to opt out and get a single-order delivery.
ETA = prep_time + courier_assignment_delay + drive_to_restaurant + wait_at_restaurant + drive_to_customer + handoff_time. Each is its own predictor. The dominant variance is prep_time (5-30 min range for the same restaurant on the same day). The biggest real-time signal is the restaurant's current queue depth: a restaurant with 15 open orders takes longer per order than one with 2. We retrain the prep model nightly with features (restaurant, hour, item count, queue depth, day of week) and serve from a feature store at 10 ms per inference. ETA refreshes every 30 s during the order. Surfaced as a range (28-40 min) not a point, because over-promising tanks CSAT.
Heartbeat from tablet to server every 30 s. If 2 minutes pass without heartbeat: alert restaurant operations team and switch new orders for that restaurant to 'phone confirmation' mode (rare). For in-flight orders: if the restaurant has not marked 'ready' within 5 min past predicted ready time, escalate via automated phone call. If still unreachable after 10 min, cancel the order and refund the customer. The customer is notified proactively at each escalation step. We also throttle new orders to that restaurant after 1 min of tablet downtime to prevent compounding failures.
Three layers. (1) Demand smoothing: the catalog can show 'fully booked, try in 15 min' for restaurants whose queue depth exceeds capacity. (2) Surge incentives: increase courier pay in the affected H3 cells via the courier app to bring more couriers online and prevent existing couriers from logging off. (3) Dispatch tuning: batch more aggressively (allow 3-4 orders per courier), accept slightly longer drives, defer non-priority orders. Capacity planning ensures dispatch and ETA services scale horizontally; the bottleneck during the spike is human supply (restaurants, couriers), not compute. Communicate longer ETAs honestly to customers; surprise lateness is worse than expected lateness.
Interview Tips
How to discuss this topic effectively
Open by contrasting with ride-sharing: 'food delivery is a 3-leg trip with a third human who can fail (the restaurant), and the platform routinely batches.' This frames the problem as harder, not just longer.
Use the term Vehicle Routing Problem with Time Windows (VRPTW) and immediately say 'NP-hard in general; we use online greedy insertion in production.' Both halves of that statement are senior-level.
Walk through ETA composition explicitly. Say 'ETA is prep + assignment + drive + wait + handoff; each is its own predictor; we surface a range, not a point.' This shows systems thinking.
Mention the restaurant queue-depth signal for prep prediction. Saying 'a backed-up restaurant takes longer per order; we use live open-order count as a real-time feature' is the production reality.
End with the batching trade-off. Quantify it: '30-50% of orders are batched at peak; batching saves ~30% courier hours but adds 5-10 min per order; premium tier opts out.' Numbers separate you from candidates who hand-wave.
Common Mistakes
Pitfalls to avoid in interviews
Treating courier dispatch as identical to ride-sharing
Ride-sharing is a 1:1 match (driver picks up rider, drops at destination). Food delivery is a 3-leg trip (courier to restaurant, wait for food, restaurant to customer) AND involves batching multiple orders onto one courier. The dispatch algorithm is different (greedy insertion or VRPTW solver, not simple nearest-neighbor) and the state machine is more complex.
Computing ETA as 'driving distance / speed'
Customer-visible ETA is the sum of: restaurant prep time (highly variable, 5-30 min), courier assignment delay, drive time to restaurant, wait at restaurant, drive to customer, handoff. The dominant variance is prep time, not drive time. Predict each component separately and surface a range, not a point estimate.
Assigning couriers as soon as the order is placed
If the food takes 20 min to prepare and the courier is 3 min away, the courier sits idle for 15 min. Production systems delay assignment until the food is close to ready (predicted ~5 min before ready), so the courier arrives just-in-time. Early assignment wastes courier time and lowers per-hour pay.
Ignoring restaurant tablet failure modes
Restaurant tablets routinely go offline (WiFi drops, staff distracted, tablet froze). Without escalation logic, orders sit unaccepted indefinitely. Use audible alarms, automatic phone-call escalation after 5 min, and auto-cancel/refund if the restaurant cannot be reached. Operational reliability is half the game.
Optimizing dispatch for one party only
Customers want speed; restaurants want orderly pickup; couriers want efficient routes. Optimizing only customer ETA causes courier exploitation; optimizing only courier efficiency causes long customer waits. The dispatch objective is a weighted combination, tunable per metro per hour. Showing you understand all three sides is the senior signal.
