System Design Article
Design Uber / Lyft (Ride-Sharing)
Difficulty: Medium
Design a ride-sharing service like Uber that matches a rider's request to a nearby driver in under 5 seconds, streams driver locations every 4 seconds, computes ETAs, and applies surge pricing in real time at 1M concurrent active drivers and 100K rides/min globally. The interview centerpiece is the dispatch path: how to find the nearest available driver, hold them briefly, and confirm the match without race conditions. We compare geohash, S2, and H3 for the driver index and recommend H3 hex grid for ride-sharing because hex neighbors are equidistant. We cover the trip state machine, surge multipliers per cell, and how location updates fan out without melting the network.
Design Uber / Lyft (Ride-Sharing)
Design a ride-sharing service like Uber that matches a rider's request to a nearby driver in under 5 seconds, streams driver locations every 4 seconds, computes ETAs, and applies surge pricing in real time at 1M concurrent active drivers and 100K rides/min globally. The interview centerpiece is the dispatch path: how to find the nearest available driver, hold them briefly, and confirm the match without race conditions. We compare geohash, S2, and H3 for the driver index and recommend H3 hex grid for ride-sharing because hex neighbors are equidistant. We cover the trip state machine, surge multipliers per cell, and how location updates fan out without melting the network.
285 views
9
Requirements
Functional Requirements
- Request a ride: rider picks pickup and drop-off locations and ride class (UberX, Black, XL).
- Match a driver: system finds the best nearby driver and offers them the trip; driver has a few seconds to accept.
- Real-time location streaming: driver's app sends location every 4 seconds; rider sees the driver moving on the map.
- ETA: estimate time to pickup and to drop-off; refresh as the driver moves.
- Surge pricing: dynamic price multiplier per area based on demand vs supply.
- Trip lifecycle: requested -> matched -> accepted -> arriving -> in_progress -> completed -> paid.
- Payment: charge rider, pay out driver after the trip ends.
- Rating: both rider and driver rate each other after the trip.
Out of Scope (state explicitly)
- The mapping engine itself (we use the design-google-maps lesson).
- Detailed payment processing (we use the design-payment-system lesson for ledger/reconciliation).
- Driver onboarding, KYC, background checks.
- UberPool / shared rides (covered as an extension).
Non-Functional Requirements
- Scale: 1M concurrent active drivers globally, 100K rides/min peak (Friday night New York / Saturday Sao Paulo).
- Latency: dispatch decision in <5 seconds; location updates from driver to rider in <1 second.
- Availability: 99.99% (an outage means cars and money do not move; very expensive).
- Consistency: a driver must never be matched to two trips simultaneously; trip state transitions are linearizable per trip.
- Geographic partitioning: a driver in Tokyo never appears in a Sao Paulo dispatch query.
Back-of-the-Envelope Estimation
Active Drivers and Location Updates
---------- Driver location stream ----------
Concurrent active drivers (peak): 1M globally
Location update interval: 4 seconds while moving
Updates/sec across fleet: 1M / 4 = 250K updates/sec
Per update payload: ~100 bytes (driver_id, lat, lng, heading, speed, ts)
Ingress bandwidth: 250K * 100 B = 25 MB/s = 200 Mbps (manageable per region)If we naively pushed every location update to every nearby rider over WebSocket, we would melt the network. We push only to riders who currently have a matched driver (around 100K riders at any moment), so outbound is also bounded.
Ride Volume
---------- Trip estimation ----------
Trips/day globally: 20M
Peak trips/min: 100K (Friday night across all metros)
Trips/sec: ~1.7K
Dispatch QPS: ~3K (some matches retry on driver decline)
Per trip storage: ~5 KB (route polyline, timestamps, fare breakdown)
Daily trip storage: 20M * 5 KB = 100 GB/day = ~36 TB/yearDriver Index (H3 cells)
---------- H3 index sizing ----------
1M active drivers worldwide
H3 resolution 9 (~150 m hex edge): 1M drivers spread across ~150K active cells
Index: cell_id -> set of driver_ids (Redis sorted set keyed by recency)
Index size: ~100 MB in Redis (small; lives in memory comfortably)Surge Grid
---------- Surge sizing ----------
Globally relevant cells (cities, suburbs): ~500K H3 cells at resolution 8 (~500 m)
Surge value per cell: 1 float (multiplier 1.0-5.0)
Update cadence: every 60 s
Serving path: cell_id -> multiplier; <1 ms lookupHigh-Level Design
---------- High-level architecture ----------
[ Rider App ] [ Driver App ]
| |
HTTPS + WSS HTTPS + WSS
| |
v v
+------------------------------------------------+
| API Gateway |
+------------------------------------------------+
| | |
v v v
+-----------+ +-----------+ +-----------+
| Rider Svc | | Trip Svc | | Driver Svc|
+-----------+ +-----------+ +-----------+
| | |
| v |
| +-----------------+ |
| | Dispatch Engine | |
| +-----------------+ |
| | | |
| v v |
| +----------+ +----------+ |
| | H3 Index | | Surge | |
| | (Redis) | | Service | |
| +----------+ +----------+ |
| |
v v
+-----------------------------------------+
| Location Ingest (Kafka + Stream Job) |
+-----------------------------------------+
|
v
+-------------------+ +-------------------+
| Trip Store (SQL) | | Analytics (Flink) |
+-------------------+ +-------------------+Core Services
| Service | Responsibility |
|---|---|
| API Gateway | TLS, auth, geo-routes a request to the regional cell |
| Rider Service | rider profile, payment methods, request validation |
| Driver Service | driver profile, online/offline state, vehicle info |
| Trip Service | owns trip state machine, persists trips, emits events |
| Dispatch Engine | finds candidate drivers, scores them, sends offers |
| Location Ingest | absorbs 250K updates/sec, writes to H3 index and Kafka |
| Surge Service | reads supply/demand stream, publishes multipliers per cell |
API Design
// Rider requests a ride
POST /api/v1/rides
{
"pickup": { "lat": 37.78, "lng": -122.41 },
"dropoff": { "lat": 37.79, "lng": -122.40 },
"product": "uberx",
"payment_method_id": "pm_abc",
"client_request_id": "req_xyz_123" // idempotency key
}
// Response
{
"trip_id": "trip_456",
"status": "requested",
"estimated_fare": { "amount_cents": 1850, "currency": "USD", "surge": 1.4 },
"eta_to_match_s": 30
}
// Driver receives a dispatch offer (over WebSocket)
{
"event": "trip_offered",
"trip_id": "trip_456",
"pickup": { "lat": 37.78, "lng": -122.41 },
"distance_to_pickup_m": 320,
"eta_to_pickup_s": 90,
"expires_in_s": 8
}
// Driver accepts
POST /api/v1/trips/trip_456/acceptDispatch Flow (Happy Path)
- Rider POSTs
/rideswith idempotency key. - Trip Service creates a trip row in
requestedstate, emitstrip.requestedevent. - Dispatch Engine looks up surge for the pickup cell, computes the candidate set:
H3 ring k=2around the pickup hex (the user's hex plus its 12 neighbors at radius 2). - Filter to drivers who are online, idle, accept this product (UberX), and have not declined this trip.
- Score each candidate by
eta_to_pickup_s(primary) plus driver acceptance rate. - Pick the top driver; soft-hold them in Redis with TTL 8s; send
trip_offeredover WebSocket. - If driver accepts within 8s, transition trip to
accepted, pushtrip_matchedevent to the rider, route both ends to the trip lifecycle. - If driver declines or times out, release the hold and offer to the next candidate. Repeat up to N attempts.
Detailed Design
The two interesting components are the driver location index with H3 hex cells and the trip state machine with idempotent transitions.
Driver Index: Why H3 Over Geohash
Geohash divides the world into rectangles; the four neighbors at the same depth are not equidistant from the center cell. H3 (Uber's open-source library) divides the world into hexagons, where every neighbor is exactly the same distance away. For ride dispatch we ask 'all drivers within 1-2 hex rings of the pickup', and hex rings are uniform, so the candidate set radius is consistent.
---------- Geohash vs H3 neighbor distances ----------
Geohash square cell, edge length E:
4 edge neighbors: distance E
4 corner neighbors: distance E*sqrt(2) (~1.41E) - 41% farther
H3 hex cell, edge length E:
6 neighbors: all distance ~E
Uniform; matches the 'within radius R' query naturallyFor a 150 m hex (resolution 9), a kRing(pickup, k=2) returns 19 hexes total covering roughly a 750 m radius. Querying those 19 cells in Redis pulls a few hundred candidate drivers in <5 ms.
Driver Index Storage in Redis
---------- Redis structures ----------
drivers:hex:<h3_cell> SET driver_ids currently in this cell
driver:loc:<driver_id> HASH { lat, lng, heading, speed, h3, last_update_ts }
driver:status:<driver_id> STRING online | offline | onTrip
dispatch:hold:<driver_id> STRING trip_id TTL 8 sLocation update flow:
- Driver app sends
{driver_id, lat, lng, ts}over WebSocket every 4 s. - Location Ingest publishes to Kafka topic
driver.locations(durable, replayable). - A consumer updates Redis: compute new H3 cell; if cell changed,
SREMfrom old cell andSADDto new cell; refreshdriver:lochash; set TTL 30 s on the cell membership so a crashed driver app falls out of the index automatically.
The Kafka topic also feeds the analytics pipeline (replay stream into Flink for trip reconstruction, fraud detection, surge inputs).
Trip State Machine
---------- States ----------
requested : rider asked, dispatch in progress
matched : driver accepted; en route to pickup
arrived : driver at pickup
in_progress : rider in car; meter running
completed : arrived at dropoff
paid : payment captured, driver paid out
cancelled : terminal failureValid transitions are encoded explicitly in the Trip Service. Every transition is gated by a check on the current state: if the trip is in accepted and a complete arrives, we reject it as out-of-order.
Idempotent Transitions
Network retries make duplicate events the norm. Each state transition uses an idempotency key so that retrying the same accept does not double-charge the rider or double-pay the driver.
-- Trip transitions table (audit log)
CREATE TABLE trip_transitions (
trip_id UUID NOT NULL,
seq INT NOT NULL,
from_state VARCHAR(32),
to_state VARCHAR(32) NOT NULL,
actor VARCHAR(32), -- rider | driver | system
idempotency_key VARCHAR(64) NOT NULL,
happened_at TIMESTAMPTZ NOT NULL,
PRIMARY KEY (trip_id, seq),
UNIQUE (trip_id, idempotency_key) -- duplicate event = no-op
);On a duplicate accept, the second insert violates the UNIQUE constraint; the service treats it as a no-op and returns the previous result. This is the same pattern as in the design-payment-system lesson.
Soft-Hold Pattern
When we offer a trip to a driver, we hold them in Redis for 8 seconds with SET dispatch:hold:<driver_id> <trip_id> NX EX 8. The NX flag means 'only set if not exists', so two concurrent dispatches racing for the same driver naturally serialize: one wins, the other moves on. When the driver accepts, we promote the hold into a real trip; when they decline or time out, the TTL expires and the driver is offerable again.
async function tryHoldDriver(driverId, tripId) {
const ok = await redis.set(
`dispatch:hold:${driverId}`,
tripId,
'NX',
'EX',
8
);
return ok === 'OK';
}Surge Pricing Pipeline
---------- Surge data flow ----------
[ Trip Requests ] [ Driver Locations ]
| |
v v
[ Kafka:trips ] [ Kafka:locs ]
| |
+-----------+-----------+
|
v
+---------------------+
| Flink Stream Job |
| (1-min tumbling) |
+---------------------+
|
v
+---------------------+
| surge:cell:<h3> | Redis, served by Surge Service
+---------------------+Flink computes per H3 resolution-8 cell every minute:
- demand = trip requests in the last minute
- supply = idle drivers currently in cell
- multiplier = max(1.0, min(5.0, alpha * demand / max(supply, 1)))
Smoothed with an exponential moving average to prevent oscillation. The dispatch service reads surge:cell:<h3> synchronously when quoting a fare; cell lookup is <1 ms.
Why Pull Surge into Quoting, Not Dispatch?
Surge influences the rider's quote and the driver's earnings, not the matching choice. The dispatcher always picks the best driver by ETA; surge only sets the price. Mixing surge into dispatch ranking would create perverse incentives (drivers steered toward high-surge areas at the cost of rider wait).
ETA Estimation
ETA to pickup and to dropoff is computed by the routing engine (covered in design-google-maps). For dispatch we use a fast proxy: great-circle distance divided by the rolling average urban speed for that cell at this hour (~25 km/h). The accurate ETA is fetched after match for the rider's UI; the dispatch path uses the proxy to score candidates without paying the full routing cost on every offer.
Data Model
PostgreSQL (sharded by city / region): trips
CREATE TABLE trips (
id UUID PRIMARY KEY,
rider_id UUID NOT NULL,
driver_id UUID,
state VARCHAR(32) NOT NULL,
product VARCHAR(16) NOT NULL,
pickup_lat DOUBLE PRECISION NOT NULL,
pickup_lng DOUBLE PRECISION NOT NULL,
dropoff_lat DOUBLE PRECISION,
dropoff_lng DOUBLE PRECISION,
pickup_h3 BIGINT NOT NULL,
requested_at TIMESTAMPTZ NOT NULL,
matched_at TIMESTAMPTZ,
started_at TIMESTAMPTZ,
ended_at TIMESTAMPTZ,
fare_cents INT,
surge NUMERIC(3,2),
payment_state VARCHAR(32),
region VARCHAR(8) NOT NULL
);
CREATE INDEX idx_trips_rider ON trips (rider_id, requested_at DESC);
CREATE INDEX idx_trips_driver ON trips (driver_id, requested_at DESC);Shard key is region (city-level). Cross-city trips are rare (a Tokyo trip lives entirely in the Tokyo shard). Per-shard size stays well under 1 TB indefinitely.
Redis: live driver index, dispatch holds, surge
---------- Redis keys ----------
drivers:hex:<h3_cell> SET no TTL on key, TTL 30 s on members
driver:loc:<driver_id> HASH TTL 30 s
driver:status:<driver_id> STRING updated on online/offline events
dispatch:hold:<driver_id> STRING TTL 8 s
surge:cell:<h3_cell> STRING (float) refreshed every 60 s by FlinkRedis is partitioned by region; a Sao Paulo driver lives only in the Sao Paulo Redis cluster.
Kafka: location stream and trip events
| Topic | Partitions | Key | Retention |
|---|---|---|---|
driver.locations | 100 | driver_id | 24 h |
trip.events | 50 | trip_id | 7 days |
dispatch.offers | 20 | trip_id | 24 h |
Partitioning by driver_id and trip_id preserves ordering per entity, which the trip state machine depends on.
Scaling and Bottlenecks
250K location updates/sec
The ingest path is the throughput hot spot. We:
- Run Location Ingest as a stateless Go service behind a regional load balancer.
- Write to Kafka first (fire-and-forget acks=1) for durability.
- Kafka consumers update Redis with a small batch (50-100 updates per pipeline call) to amortize the round trip.
- Suppress updates when the driver is stationary for more than 30 s (most idle drivers stop updating until they move).
This gets the effective Redis write rate down to ~50K/sec per region, which a single mid-sized Redis cluster handles.
Friday night surge in Manhattan
When demand spikes 5x in a small geo, the H3 index for those cells gets hammered. Mitigations:
- Per-cell membership uses Redis SETs; reads are O(1) cardinality + O(K) iteration.
- Hot cells (Times Square) live in dedicated Redis shards via consistent hashing on cell prefix.
- Surge naturally throttles demand by raising prices; this is a feature, not a bug.
Driver matched twice (the dreaded bug)
The dispatch:hold SET NX prevents two dispatches from offering to the same driver. The second component is the trip state machine: even if a hold leaks, the driver's accept for trip A transitions the driver to onTrip, and the dispatcher checks driver:status before offering. Belt-and-braces.
Cross-region driver moves
A driver near the SF/Oakland boundary occasionally crosses into a different region's index. The Location Ingest detects the region change (by H3 resolution-3 cell) and emits a re-registration event to the new region. Brief window of double-membership is acceptable; ride dispatch always defers to the trip's region.
Multi-region failover
Each region is independent: Tokyo's outage does not affect Sao Paulo. Within a region we run active-active across two AZs; Redis runs with replication; Kafka has multiple brokers. The trip database is the most state-heavy; we use Patroni-managed Postgres with synchronous replication for the trip in progress, and async for analytics replicas.
Trade-offs and Alternatives
Greedy nearest-driver vs batched optimal assignment
Greedy: for each rider, pick the closest available driver (current production default for low latency). Batched: collect all rider requests over a short window (e.g., 2-5 s), solve a bipartite assignment (Hungarian algorithm) to minimize total wait time across all riders.
Greedy wins on responsiveness; batched wins on aggregate efficiency by 5-10% but adds 2-5 s wait. Real systems use greedy for now and run batching only in narrow regimes (e.g., airport queues where wait is already structural). It is fair to mention both in an interview.
H3 vs S2 vs Geohash
- H3: hex grid, uniform neighbors, open-sourced by Uber, optimal for matching.
- S2: Google's hierarchical sphere cells; great for very large radii, more complex.
- Geohash: simplest, but neighbors are not equidistant; you must query 9 cells (3x3) and over-fetch.
H3 is the modern default for ride-sharing; the others are valid for adjacent problems (PostGIS for low-scale, S2 for global indexing of static data).
Push location to all riders vs only matched riders
We push live driver position only to the rider whose trip the driver is on. Pushing to all nearby riders ('cars on map') is a separate read-only stream from the H3 index, refreshed on demand (every 10 s when the rider opens the app), not pushed.
TTL on driver index entries
30 s TTL on cell membership means a crashed driver app disappears from the index within 30 s without explicit cleanup. The trade-off: a driver with a flaky network briefly drops out of the index and might miss offers. 30 s is a safe middle ground; tuning this is operational.
Surge cadence: 1 min vs 10 s
1-minute updates are smooth and cacheable; 10-second updates feel responsive but oscillate and confuse riders. Real Uber uses ~1-min surge with smoothing; a few products use shorter windows for events (concert lets out, all riders book at once).
Sharding by city vs by user
Sharding the trip database by city aligns with the operational reality (most trips are intra-city). Sharding by rider_id distributes load evenly but a single rider's trips end up scattered, hurting common queries like 'recent trips'. City sharding is the right answer for ride-sharing.
Synchronous vs async dispatch
The rider's POST /rides could either return immediately with a trip_id and stream match updates over WebSocket (current production), or block until matched (simpler but higher request latency under load). Async is the right choice; the ride request is created instantly, the rider sees 'finding a driver' UI, and the match arrives via WebSocket.
Real-World Examples
How real systems implement this in production
Uber open-sourced its hexagonal grid system, H3, in 2018 after using it internally for dispatch and surge. H3 supports 16 resolutions from 1100 km hex edges down to 0.5 m, making it suitable for both global analytics and meter-level dispatch.
Trade-off: Hex grids give uniform neighbor distance at the cost of a slightly more complex coordinate system; for 2D matching workloads the trade is clearly worth it.
Lyft's matching engine uses a similar greedy nearest-driver dispatch with periodic batched re-optimization at high-supply locations like airports. Their tech blog describes the move from purely greedy to hybrid as one of the largest efficiency gains in marketplace history.
Trade-off: Batched assignment improves total wait time by ~5-10% but adds dispatch latency; only worth it where supply is structurally queued (airports, event venues).
Uber runs one of the largest Kafka deployments in the world, with trillions of messages per day across location streams, trip events, and analytics. The location topic is partitioned by driver_id so per-driver ordering is preserved end-to-end.
Trade-off: Kafka decouples producers from consumers and absorbs spikes, but adds operational complexity (broker management, partition rebalancing) compared to direct service-to-service calls.
China's largest ride-hailing platform serves 30M+ rides per day, around 3x Uber's global volume. DiDi published research on combinatorial matching (auction-style assignment) that goes beyond Uber's greedy approach for dense urban areas.
Trade-off: Auction matching can lift acceptance rates by 5%+ at the cost of additional 2-5 s dispatch latency; only viable when supply density is high enough that batching does not starve outlying riders.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Rider posts /rides with idempotency key; trip row created in `requested` state; trip.requested event emitted. Dispatch reads pickup H3 cell, runs kRing(k=2), filters to online idle drivers accepting UberX, scores by ETA-to-pickup proxy, picks top driver, sets Redis SET NX EX 8 hold, sends trip_offered over WebSocket. Driver accepts within 8 s; trip transitions to `matched`; rider notified via WebSocket; driver navigates to pickup. Driver taps 'arrived' -> state `arrived`; rider boards -> `in_progress`; driver arrives at dropoff -> `completed`. Trip Service emits trip.completed event; Payment Service charges rider with idempotency key, settles to driver ledger; state -> `paid`. Each transition writes to trip_transitions with a unique idempotency key so retries are safe.
Two layers. First, a Redis soft-hold: when dispatch picks a driver, it does SET NX EX 8 on dispatch:hold:<driver_id>; the NX makes this atomic so two parallel dispatchers cannot both succeed. Second, before offering, dispatch checks driver:status:<driver_id> and skips drivers in `onTrip`. When the driver accepts, the trip state machine transitions driver to `onTrip` atomically with the trip transition, so even if a hold is missed, the status check catches it. The combination guarantees at-most-one active trip per driver.
Surge is a per-H3-cell multiplier between 1.0 and 5.0, refreshed every 60 s. A Flink job consumes the trip_requested and driver_location Kafka topics, computes per-cell demand and supply over a 1-min tumbling window, applies smoothing (EMA) to prevent oscillation, and writes the multiplier to Redis (surge:cell:<h3>). The Quote service reads it synchronously when the rider asks for a fare estimate; the Trip service applies it when computing the final fare. Surge does NOT influence dispatch ranking; we still pick the closest driver by ETA. Surge only changes the price, which throttles demand naturally and incentivizes more drivers to come online.
Driver app sends an update every 4 s while moving (250K/s across 1M drivers). Location Ingest is a stateless Go service that writes to a Kafka topic partitioned by driver_id (durability + replayability). A consumer batches 50-100 updates per pipeline call and writes Redis HSET on driver:loc and SADD/SREM on cell membership only when the cell changes. Stationary drivers throttle to 30 s updates client-side. Per-region Redis cluster, partitioned by H3 resolution-3 prefix, keeps per-shard write rate under 30K/s, well within Redis capacity. Kafka decouples ingest from indexing so a Redis hiccup does not drop location data.
Pool changes the matching from 1:1 to many:1. Instead of greedy nearest-driver, the dispatcher considers a small batch of riders and tries to bundle them into shared trips. New components: a route batcher that maintains in-progress trips' projected route polylines and inserts new pickups if the detour is bounded; a route service that recomputes pickup/dropoff order; a fare splitter that prorates the fare among riders. The trip state machine adds compound states (e.g., a single driver trip has multiple rider segments). The dispatch latency budget grows from 5 s to 10-30 s because batching deliberately delays to find good bundles. The H3 index and trip storage stay the same; the matching logic becomes a constrained optimization rather than nearest-neighbor.
Interview Tips
How to discuss this topic effectively
Open with the dispatch latency budget: <5 s end to end. Then decompose: candidate generation ~100 ms, scoring ~50 ms, offer round trip ~1-2 s, accept window ~5-8 s. This shows you reason backward from a user-visible number.
Pick H3 for the driver index and explain why hex neighbors beat geohash. Saying 'in a hex grid all 6 neighbors are equidistant, which matches a within-radius dispatch query naturally' is the senior-level answer.
Explicitly model the trip state machine. Draw 'requested -> matched -> in_progress -> completed -> paid' and call out that every transition needs an idempotency key. Interviewers love this because it surfaces correctness thinking.
Describe the soft-hold pattern with Redis SET NX EX 8. It is the cleanest way to prevent double-dispatch races and demonstrates deep distributed systems intuition without invoking Paxos.
Mention surge as a stream-processing job that feeds into the synchronous request path. Saying 'Flink computes a per-cell multiplier every 60 s and writes to Redis; the API reads it in <1 ms' shows you can compose batch and real-time.
Common Mistakes
Pitfalls to avoid in interviews
Using geohash for the driver index without acknowledging the corner-neighbor problem
Geohash square cells have 8 neighbors at unequal distances (corners are 1.41x farther than edges). For ride dispatch you query 'all drivers within radius', and the geohash 9-cell query over-fetches. H3 hexes give 6 equidistant neighbors and a clean kRing(k=2) query. Mention this explicitly.
Doing a synchronous routing call for ETA on every dispatch candidate
If you have 50 candidates and the routing engine takes 50 ms each, you spend 2.5 s on routing per dispatch. Use the great-circle / urban-speed proxy for ranking; only call the real router after the match for the rider's UI. Real systems decouple the dispatch ETA from the live navigation ETA.
Ignoring the double-dispatch race condition
Two parallel dispatchers in different regions or shards can offer the same driver to two riders if they read the index concurrently. The fix is a Redis SET NX EX hold per driver: the first dispatcher wins atomically; the second moves on. Without this, you eventually offer the same driver to two riders and both accept.
Storing trip state in Redis or in an event stream alone
Trips need durable transactional storage and audit transitions. Use Postgres for the trip row and a transitions table with a UNIQUE idempotency key. Redis is for transient indexes and holds, not the source of truth for trip state. Confusing these is the top correctness bug in dispatch designs.
Pushing all driver locations to all rider apps
Pushing 1M driver positions to 5M nearby rider apps is bandwidth suicide. Push live position only to the rider whose trip the driver is currently on. For the 'cars on map' background view, fetch on demand from the H3 index every 10 s, not via push.
