System Design Article

Design Food Delivery (DoorDash)

Difficulty: Medium

Design a food delivery service like DoorDash that links three actors (customer, restaurant, courier) with an end-to-end SLA of <40 minutes per order at 10M orders per day across 500K restaurants. The interview centerpiece is the courier dispatch problem, which is fundamentally different from ride-sharing: it is a 3-leg trip (courier-to-restaurant, wait for food, restaurant-to-customer) and the platform routinely batches multiple orders onto one courier to cut cost. We compare Uber's 1:1 matching to DoorDash's many-to-1 batching, design the ETA composition (prep time + assignment time + drive time + handoff), and walk through the order state machine that coordinates three independent humans.

Design Food Delivery (DoorDash)

System Design

Medium

design-food-delivery

case-study

ride-sharing-and-maps

food-delivery

doordash

courier-dispatch

batched-dispatch

vehicle-routing-problem

eta-prediction

three-sided-marketplace

geospatial

h3-hex-grid

ride-sharing

kafka

system-design

intermediate

premium

1,068 views

Requirements

Functional Requirements

Browse restaurants: customer sees nearby restaurants ordered by distance, rating, ETA, price.
Place an order: select items, customize, pay; order is sent to the restaurant.
Restaurant accepts and prepares: tablet at restaurant accepts the order, marks 'ready' when food is bagged.
Courier dispatch: platform assigns a nearby courier; courier may carry 2-3 orders at once (batching).
Live tracking: customer sees courier moving on the map; gets ETA updates.
Delivery confirmation: courier marks 'delivered'; photo or contactless drop-off.
Ratings and tips: customer rates restaurant and courier, leaves a tip after delivery.

Out of Scope (state explicitly)

Restaurant menu management UI (separate product).
Payment processing details (we use the design-payment-system lesson).
Routing engine internals (we use design-google-maps for the routing API).
Fraud detection at depth.

Non-Functional Requirements

Scale: 10M orders/day globally, ~700K orders peak hour, ~200 orders/sec average.
Courier fleet: 1M active couriers; ~200K concurrent on-shift at peak.
Restaurants: 500K active restaurants on platform.
End-to-end SLA: 40 min p50 from order to delivery; 60 min p95.
Dispatch latency: assign a courier within 30 s of restaurant accepting the order.
Availability: 99.95%; an outage stops revenue immediately.

Back-of-the-Envelope Estimation

Order Volume

Text

---------- Order load ----------
Orders/day:               10M globally
Orders/sec average:       ~115
Orders/sec peak (lunch):  ~700 (concentrated 11:30 AM - 1:30 PM local)
Per order data:           ~3 KB (items, addresses, status, courier_id)
Daily order storage:      ~30 GB (over years: ~10 TB)

Dispatch QPS

Text

---------- Dispatch load ----------
Dispatch attempts/order:  ~1.3 (some couriers decline)
Dispatch decisions/sec:   ~150 average, ~900 peak
Per-decision compute:     ~50-200 ms (more for batching)
Per-decision data read:   ~2K candidates from H3 index

Courier Location Updates

Text

---------- Courier telemetry ----------
Concurrent on-shift couriers (peak):  200K
Update interval while moving:         5 seconds
Updates/sec:                          200K / 5 = 40K
Ingress:                              40K * 100 B = ~4 MB/s (small vs Uber's 25 MB/s)

Restaurant Catalog

Text

---------- Catalog sizing ----------
Restaurants:              500K
Menu items per restaurant:~50 average
Total menu items:         25M
Per-item data:            ~1 KB (name, photo URL, modifiers, price)
Menu storage:             ~25 GB (small; cached aggressively)

High-Level Design

Text

---------- High-level architecture ----------
  [ Customer App ]    [ Restaurant Tablet ]    [ Courier App ]
        |                     |                      |
        v                     v                      v
  +--------------------------------------------------------+
  |                  API Gateway                           |
  +--------------------------------------------------------+
        |              |              |              |
        v              v              v              v
  +-----------+  +-----------+  +-----------+  +-----------+
  | Catalog   |  | Order Svc |  | Dispatch  |  | Courier   |
  | Service   |  |           |  | Engine    |  | Service   |
  +-----------+  +-----------+  +-----------+  +-----------+
        |              |              |              |
        |              v              v              v
        |       +-----------+   +----------+   +-----------+
        |       | Order DB  |   | H3 Index |   | Courier   |
        |       | (Postgres)|   | (Redis)  |   | DB        |
        |       +-----------+   +----------+   +-----------+
        |                            ^
        v                            |
  +-----------+              +---------------+
  | Menu KV   |              | ETA Service   |
  +-----------+              +---------------+
                                     ^
                                     |
                            +-----------------+
                            | Stream Job:     |
                            | prep-time model |
                            +-----------------+

Core Services

Service	Responsibility
Catalog Service	Restaurant + menu browse, ranking, search
Order Service	Owns order state machine; persists orders
Dispatch Engine	Assigns one or more orders to a courier
Courier Service	Courier profile, shift state, location ingest
ETA Service	Predicts prep time, drive time, handoff time
Restaurant Service	Restaurant tablet API, accept/reject, mark ready

Order State Machine

Text

---------- Order states ----------
created            : customer placed; payment authorized
restaurant_pending : sent to restaurant tablet
accepted           : restaurant confirmed; ETA broadcast to customer
preparing          : food being made
ready_for_pickup   : restaurant marked food ready
courier_assigned   : dispatch picked a courier; courier en route
picked_up          : courier collected the food
in_transit         : courier en route to customer
delivered          : courier marked delivered
cancelled          : terminal failure (restaurant declined, no courier, customer cancel)
refunded           : payment reversed

This is more complex than a ride trip because it involves three independent humans, each of whom can cancel or fail.

API Design

Jsonc

// Customer places an order
POST /api/v1/orders
{
    "restaurant_id": "r_123",
    "items": [
        { "menu_item_id": "mi_456", "qty": 2, "modifiers": [...] }
    ],
    "delivery_address": { "lat": 37.78, "lng": -122.41, "line": "..." },
    "payment_method_id": "pm_xyz",
    "client_request_id": "req_xyz_789"
}
// Response
{
    "order_id": "o_999",
    "status": "created",
    "eta": { "min_minutes": 28, "max_minutes": 40 },
    "estimated_total_cents": 2350
}

// Restaurant marks ready
POST /api/v1/orders/o_999/ready

// Courier accepts dispatch (over WebSocket)
{
    "event": "dispatch_offered",
    "batch": [
        { "order_id": "o_999", "restaurant": {...}, "customer": {...} },
        { "order_id": "o_1000", "restaurant": {...}, "customer": {...} }
    ],
    "expires_in_s": 30
}

Detailed Design

The two interesting components are the batched courier dispatch (vehicle routing with time windows) and the ETA composition pipeline.

Why Food Delivery Dispatch Differs From Ride-Sharing

In ride-sharing the driver picks up a rider at point A and drops them at point B. One trip, two stops. In food delivery, the dispatcher must:

Send the courier from their current position to the restaurant.
Arrive at the right time (not too early, not too late). Couriers waste time waiting for food; restaurants want quick pickup.
Pick up the food.
Drive from restaurant to customer.

And the platform routinely batches orders so one courier picks up 2-3 orders from nearby restaurants and drops them at nearby customers. This is the Vehicle Routing Problem with Time Windows (VRPTW), which is NP-hard in general. We use heuristics.

Online Greedy Insertion (Production Default)

Most food delivery platforms use a streaming variant of greedy insertion:

Text

---------- Greedy insertion algorithm ----------
When a new order O arrives (status: ready_for_pickup or close to it):
  1. Find candidate couriers within ~3 km of the restaurant (H3 ring query).
  2. For each candidate:
     - If courier has 0 active orders: simulate inserting O; cost = drive_time(courier -> restaurant -> customer).
     - If courier has 1-2 active orders: try inserting O at each position in the route;
       pick the position with min total marginal time and total order time within SLA.
  3. Pick the courier whose marginal cost is lowest AND who keeps all current orders within SLA.
  4. Soft-hold the courier; offer the dispatch.

A full vehicle routing solver would optimize globally but takes seconds. Greedy insertion is myopic but fast (10-50 ms per dispatch) and is the production default at DoorDash, Uber Eats, and Deliveroo.

The Batching Decision

Whether to batch is a tunable. Factors:

Density: high order density makes batching natural; low density means waiting too long for a second order.
Restaurant distance: orders from the same restaurant or adjacent restaurants batch easily.
Customer distance: customers within ~1 km of each other are batchable.
Premium tier: 'priority' customers pay extra to skip batching and get faster delivery.

A batched delivery saves the courier time and the platform pays one trip fee for two orders, but adds 5-10 min to each order's ETA. Most platforms target 30-50% of orders being batched at peak.

ETA Composition

Text

---------- ETA composition ----------
ETA(order) = prep_time(restaurant, item_count, hour_of_day)
           + courier_assignment_delay
           + drive_time(courier_pos -> restaurant)
           + wait_at_restaurant
           + drive_time(restaurant -> customer)
           + handoff_time

Each component is its own predictor:

prep_time: ML model trained on historical accepted-to-ready times per restaurant per hour. Updated nightly.
courier_assignment_delay: rolling estimate from recent dispatches in the same H3 cell.
drive_time: routing engine call (or great-circle / urban-speed proxy for fast estimates during browsing).
wait_at_restaurant: slack to handle imprecise prep prediction. Couriers arriving early reduce customer ETA but waste courier time.
handoff_time: ~2 min default; longer for apartment buildings.

We surface ETA as a range (min - max) rather than a single number; over-promising kills CSAT.

Prep Time Prediction

The biggest source of ETA error is restaurant prep time, which varies by 50%+ within the same restaurant on the same day. We predict per restaurant per item-count per hour bucket:

Text

---------- prep time features ----------
Features: restaurant_id, item_count, complex_items_count, hour_of_day, day_of_week,
          recent_avg_prep_time_15min, current_open_orders_at_restaurant.
Label:    actual_minutes from accepted -> ready.
Model:    Gradient-boosted trees, retrained nightly.
Serving:  Feature store online; ~10 ms per inference.

Live 'current_open_orders_at_restaurant' is the most powerful real-time feature: a backed-up restaurant takes longer per order.

Dispatch Engine in More Detail

JavaScript

Python

async function dispatch(order) {
    const candidates = await findCouriersInRing(
        order.restaurant.h3,
        2 // hex rings, ~750m
    );
    const scored = await Promise.all(
        candidates.map(async (c) => {
            const marginal = await simulateInsertion(c, order);
            return { courier: c, marginalCost: marginal.timeMinutes };
        })
    );
    scored.sort((a, b) => a.marginalCost - b.marginalCost);
    for (const { courier } of scored) {
        const held = await tryHoldCourier(courier.id, order.id, 30);
        if (held) return offerDispatch(courier, order);
    }
    return null; // no courier available
}

Restaurant Tablet Reliability

The restaurant tablet is a weak link: WiFi drops, tablet froze, staff missed the buzzer. We use:

WebSocket with auto-reconnect; offline queue on the tablet.
Audible alarm escalates if not accepted within 60 s.
Restaurant ops team gets a phone call if the tablet has been offline for 5 min during business hours.
A failed-to-accept order auto-cancels at 5 min and refunds.

Data Model

PostgreSQL (sharded by region): orders

SQL

CREATE TABLE orders (
    id              UUID PRIMARY KEY,
    customer_id     UUID NOT NULL,
    restaurant_id   UUID NOT NULL,
    courier_id      UUID,
    state           VARCHAR(32) NOT NULL,
    items           JSONB NOT NULL,
    subtotal_cents  INT NOT NULL,
    fees_cents      INT NOT NULL,
    tip_cents       INT,
    delivery_lat    DOUBLE PRECISION NOT NULL,
    delivery_lng    DOUBLE PRECISION NOT NULL,
    delivery_h3     BIGINT NOT NULL,
    eta_min_min     INT,
    eta_max_min     INT,
    created_at      TIMESTAMPTZ NOT NULL,
    accepted_at     TIMESTAMPTZ,
    ready_at        TIMESTAMPTZ,
    picked_up_at    TIMESTAMPTZ,
    delivered_at    TIMESTAMPTZ,
    region          VARCHAR(8) NOT NULL
);

CREATE INDEX idx_orders_customer ON orders (customer_id, created_at DESC);
CREATE INDEX idx_orders_restaurant ON orders (restaurant_id, created_at DESC);
CREATE INDEX idx_orders_courier ON orders (courier_id, created_at DESC);

Shard by region (city / metro). Cross-region orders do not exist (delivery is local).

Redis: courier index, dispatch holds, batch state

Text

---------- Redis keys ----------
couriers:hex:<h3>            SET     courier_ids in cell
courier:loc:<id>             HASH    { lat, lng, status, current_batch }
courier:batch:<id>            LIST    ordered list of order_ids in active batch
dispatch:hold:<courier_id>    STRING  order_id  TTL 30s
restaurant:queue_depth:<id>   INT     current open orders, updated by Order Svc

Menu Cache (Redis)

Menus change rarely (a few times per day per restaurant). Cached as menu:<restaurant_id> -> JSON with 1 h TTL. Browse traffic almost never hits Postgres for menus.

Kafka topics

Topic	Partitions	Key	Use
`order.events`	100	order_id	feeds analytics, dispatch trigger
`courier.locations`	50	courier_id	feeds H3 index update
`restaurant.events`	30	restaurant_id	accepted/ready signals

Scaling and Bottlenecks

The lunch rush

11:30 AM - 1:30 PM local time across all metros, order volume spikes 5x. Dispatch and ETA services scale horizontally. The bottleneck is courier supply (couriers cannot multiply on demand), so the platform uses surge pay to bring more couriers online and may temporarily disable batching for high-paying tiers.

Restaurant overload

A popular restaurant during peak gets 30+ orders queued; prep time blows out. We:

Cap concurrent open orders per restaurant via the catalog (mark restaurant 'busy' to incoming customers).
Surface a longer ETA in the customer's UI when queue depth is high.
Throttle dispatch to that restaurant: pause new pickups so couriers do not pile up.

Courier supply imbalance

A suburban area has 5 orders but 1 courier; an urban area has 100 couriers and 50 orders. The platform offers 'reposition' bonuses to nudge couriers to under-served zones. The dispatch engine will also assign a courier to a longer drive if no one closer is available, accepting the higher cost.

Multi-region

Each metro is its own dispatch domain. A courier in San Francisco never appears in San Jose dispatch. The order DB is sharded by region; Redis runs per region. Cross-region failover is not needed: a metro's outage is contained.

Restaurant tablet outage

The single highest-impact incident is the tablet going offline. We monitor heartbeats and have a fallback: phone the restaurant manually for orders if offline > 5 min. This is operational, not algorithmic, but interviewers expect you to mention it.

Order DB growth

10M orders/day = ~30 GB/day of orders. With ~10 TB after several years, sharding by region keeps any one shard manageable. Older orders archive to cold storage after 1 year.

Trade-offs and Alternatives

Greedy insertion vs full VRPTW solver

Greedy insertion is myopic but runs in 10-50 ms; a full vehicle routing solver could improve total fleet efficiency by 5-15% but takes seconds and assumes a known workload, which we do not have (orders arrive online). Greedy is the right default; some platforms run a periodic 'rebalance' solver that re-bundles existing batches.

Batched vs unbatched delivery

Batching saves the platform money and increases courier per-hour pay, but adds 5-10 min to each customer's ETA. Customers prefer unbatched (faster); platforms prefer batched (cheaper). Modern platforms charge a premium ('priority delivery') for unbatched and offer batched as the default.

Same-region restaurant pickup vs multi-region (DoorDash 'Drive')

We focused on consumer delivery. DoorDash also runs a B2B service ('Drive') that delivers from one address to another (e.g., business courier). The dispatch problem is similar but the SLA is longer (1-2 h) and batching is more aggressive.

Restaurant tablet vs phone-in vs API integration

Tablet: simple, but a single point of failure per restaurant. Phone-in: deeply unscalable, used only as fallback. API integration with the restaurant POS: best for chains (Chipotle, Domino's), avoids a separate device, accept/ready signals come from the POS automatically. We support all three; the chain integrations are the most reliable.

ETA as point estimate vs range

A point estimate (32 min) is anchoring; if we deliver in 35 min the customer feels late even if it was within reasonable variance. A range (28 - 40 min) sets expectations correctly. Showing the upper bound encourages tipping for early delivery and tolerates the inevitable variance.

Why three-sided is harder than two-sided

In ride-sharing, you optimize wait time and cost for the rider; the driver is one party with consistent objectives. In food delivery you have customer (fast), restaurant (timely pickup but not too early), and courier (efficient route, fair pay). The dispatch objective is a weighted combination, and the weights themselves are tunable per metro per time of day. This is the senior insight.

Couriers as gig workers vs employees

Gig courier model gives the platform supply elasticity (more couriers come online when pay is high) but loses control over quality. Employee model gives reliable supply at higher cost. Different markets pick differently (UK Deliveroo went hybrid; some EU countries forced employment classifications). This is a business trade-off but worth mentioning.

Real-World Examples

How real systems implement this in production

DoorDash DeepRed dispatch

DoorDash's published dispatch system 'DeepRed' uses online greedy insertion with ML-predicted prep times and drive times. They report ~30% of US orders being batched at peak hours.

Trade-off: Greedy insertion sacrifices a few percent of theoretical fleet efficiency for sub-100 ms dispatch decisions; the alternative (full VRPTW solver) takes seconds and cannot keep up with online order arrivals.

Uber Eats

Uber Eats reuses much of Uber's dispatch infrastructure (H3 indexing, soft-hold pattern, Kafka location streams) but added a separate dispatch service tuned for batching. They publish on the use of two-stage assignment (candidate generation + ML scorer).

Trade-off: Sharing infrastructure with ride-sharing reduced engineering cost but created tension where the same dispatch engine had to handle both 1:1 (rides) and many:1 (food); they eventually split the dispatch path.

Deliveroo Frank

Deliveroo's 'Frank' algorithm matches orders, restaurants, and riders globally rather than per-restaurant. Frank computes a global cost function and batches across multiple restaurants when the geometry works.

Trade-off: Global optimization yields better aggregate efficiency but is operationally harder to debug than per-restaurant greedy assignment; Deliveroo invested heavily in observability and replay tools.

Meituan Waimai

Meituan, China's largest food delivery platform, serves 60M+ orders/day, around 6x DoorDash's volume. They use deep reinforcement learning for dispatch and have published on real-time multi-agent reinforcement learning for courier routing.

Trade-off: RL-based dispatch can outperform heuristics by 5-10% in dense markets but requires massive offline simulation infrastructure and is hard to interpret; most platforms outside China stick with greedy + handcrafted scoring for now.

Quick Interview Phrases

Key terms to use in your answer

three-sided marketplace dispatch with VRPTW heuristics

online greedy insertion for batched courier assignment

ETA composition: prep + assignment + drive + handoff

batched delivery is the unit-economics lever

restaurant queue depth drives ETA prediction

premium priority tier opts out of batching

Common Interview Questions

Questions you might be asked about this topic

Walk me through the full lifecycle of a food delivery order.

Customer places order -> Order Service creates row in `created` state, payment authorized. Order routed to restaurant tablet via WebSocket; state -> `restaurant_pending`. Restaurant accepts within 60 s; state -> `accepted`. ETA Service computes ETA range, broadcasts to customer. Restaurant prepares food; periodic prep_time predictor refreshes ETA. About 5 min before predicted ready, Dispatch Engine triggers: queries H3 ring around the restaurant, scores candidates by marginal insertion cost, soft-holds the best courier (Redis SET NX EX 30), offers dispatch. Courier accepts; state -> `courier_assigned`. Restaurant marks ready; state -> `ready_for_pickup`. Courier arrives at restaurant, picks up; state -> `picked_up`. Courier drives to customer (live tracking via WebSocket); state -> `in_transit`. Courier marks delivered with photo; state -> `delivered`. Payment captured (was authorized at create); tip option presented to customer.

How does batched delivery work? When do you batch and how?

How do you compute and update ETA? What's the dominant source of error?

What happens if the restaurant tablet goes offline mid-order?

How would you handle a sudden spike (e.g., a stadium event releases 5,000 simultaneous orders)?

Interview Tips

How to discuss this topic effectively

Open by contrasting with ride-sharing: 'food delivery is a 3-leg trip with a third human who can fail (the restaurant), and the platform routinely batches.' This frames the problem as harder, not just longer.

Use the term Vehicle Routing Problem with Time Windows (VRPTW) and immediately say 'NP-hard in general; we use online greedy insertion in production.' Both halves of that statement are senior-level.

Walk through ETA composition explicitly. Say 'ETA is prep + assignment + drive + wait + handoff; each is its own predictor; we surface a range, not a point.' This shows systems thinking.

Mention the restaurant queue-depth signal for prep prediction. Saying 'a backed-up restaurant takes longer per order; we use live open-order count as a real-time feature' is the production reality.

End with the batching trade-off. Quantify it: '30-50% of orders are batched at peak; batching saves ~30% courier hours but adds 5-10 min per order; premium tier opts out.' Numbers separate you from candidates who hand-wave.

Common Mistakes

Pitfalls to avoid in interviews

Treating courier dispatch as identical to ride-sharing

Ride-sharing is a 1:1 match (driver picks up rider, drops at destination). Food delivery is a 3-leg trip (courier to restaurant, wait for food, restaurant to customer) AND involves batching multiple orders onto one courier. The dispatch algorithm is different (greedy insertion or VRPTW solver, not simple nearest-neighbor) and the state machine is more complex.

Computing ETA as 'driving distance / speed'

Customer-visible ETA is the sum of: restaurant prep time (highly variable, 5-30 min), courier assignment delay, drive time to restaurant, wait at restaurant, drive to customer, handoff. The dominant variance is prep time, not drive time. Predict each component separately and surface a range, not a point estimate.

Assigning couriers as soon as the order is placed

If the food takes 20 min to prepare and the courier is 3 min away, the courier sits idle for 15 min. Production systems delay assignment until the food is close to ready (predicted ~5 min before ready), so the courier arrives just-in-time. Early assignment wastes courier time and lowers per-hour pay.

Ignoring restaurant tablet failure modes

Restaurant tablets routinely go offline (WiFi drops, staff distracted, tablet froze). Without escalation logic, orders sit unaccepted indefinitely. Use audible alarms, automatic phone-call escalation after 5 min, and auto-cancel/refund if the restaurant cannot be reached. Operational reliability is half the game.

Optimizing dispatch for one party only

Customers want speed; restaurants want orderly pickup; couriers want efficient routes. Optimizing only customer ETA causes courier exploitation; optimizing only courier efficiency causes long customer waits. The dispatch objective is a weighted combination, tunable per metro per hour. Showing you understand all three sides is the senior signal.

Back to System Design