System Design Article
Design Nearby / Location Service (Yelp)
Difficulty: Medium
Design a 'nearby' service like Yelp that returns the top businesses within a search radius of the user's location, ranking by distance, rating, and category, scaling to 200M monthly users querying 100M businesses. The interview centerpiece is the geospatial index: how to find 'all businesses within 5 km of (lat, lng)' efficiently. We compare bounding-box scans, geohashes, quadtrees, R-trees, and PostGIS GIST indexes; we recommend geohash + secondary index for write-heavy systems and quadtree/R-tree for read-heavy. We cover business storage and search, review ranking, the infrequent-update vs frequent-query asymmetry, and how to handle the long tail of remote regions.
Design Nearby / Location Service (Yelp)
Design a 'nearby' service like Yelp that returns the top businesses within a search radius of the user's location, ranking by distance, rating, and category, scaling to 200M monthly users querying 100M businesses. The interview centerpiece is the geospatial index: how to find 'all businesses within 5 km of (lat, lng)' efficiently. We compare bounding-box scans, geohashes, quadtrees, R-trees, and PostGIS GIST indexes; we recommend geohash + secondary index for write-heavy systems and quadtree/R-tree for read-heavy. We cover business storage and search, review ranking, the infrequent-update vs frequent-query asymmetry, and how to handle the long tail of remote regions.
171 views
4
Requirements
Functional Requirements
- Find nearby businesses: given a user's (lat, lng) and a radius, return businesses within that radius.
- Filter by category: 'pizza near me', 'gas stations within 10 km', etc.
- Sort by distance, rating, or relevance (default is a blended score).
- Pagination: typical request returns top 20; user can scroll for more.
- Business detail page: full business info (name, address, hours, photos, reviews) on tap.
- Search by name ("Starbucks near me") with autocomplete.
- Reviews: users post 1-5 star reviews and short text; aggregated into business rating.
Out of Scope (state explicitly)
- Real-time location streaming (not Uber; users explicitly issue queries).
- Booking/reservations (separate product).
- Order placement / delivery (use food-delivery case study).
- The full review system at depth (we focus on the nearby-search problem).
Non-Functional Requirements
- Scale: 100M businesses, 200M MAU, 500K queries/sec at peak (geographic skew - dense city queries dominate).
- Latency: p99 < 200 ms for nearby search.
- Availability: 99.9%; if the index is briefly down, fall back to cached or stale results.
- Update rate: business attributes change rarely (~1M updates/day, mostly hours/photos); new businesses added ~10K/day.
- Read/write ratio: 10000:1 (very read-heavy).
- Eventually consistent: a hours change visible within minutes is acceptable.
Back-of-the-Envelope Estimation
Data and Storage
---------- Business catalog ----------
Businesses worldwide: 100M
Per-business record: ~5 KB (name, address, hours, ~5 categories, ~10 photos URL refs)
Total business storage: ~500 GB (small; fits one sharded SQL cluster)
Reviews: 1B (~50 reviews per active business)
Per review: ~1 KB
Review storage: ~1 TBQuery Volume
---------- Query estimation ----------
MAU: 200M
DAU: 60M
Queries per DAU per day: ~4
Total queries/day: ~240M
Queries/sec avg: ~3K
Queries/sec peak (lunch rush): ~500K <- huge spike, dense urbanThe peak/avg ratio is unusually high because queries cluster around mealtimes in dense cities. A small fraction of geographic cells (city centers) gets the bulk of queries.
Geospatial Index Size
---------- Geohash index ----------
100M businesses indexed by geohash (e.g., precision 6 = ~1.2 km cells)
Geohash -> [business_ids] inverted index: ~5 GB
With multiple precisions for fast multi-radius search: ~20 GB
Fits in a Redis cluster (~10 nodes).Cache Sizing
---------- Hot region cache ----------
Top ~10K geohash cells (city centers globally): receive 90% of queries
Per cell cache: top 100 businesses by rating in that cell, by category
Per cell entry: ~200 KB (100 businesses * ~2 KB summary)
Total cache: ~10K * 200 KB * ~10 categories = ~20 GB across distributed cacheHigh-Level Design
---------- High-level architecture ----------
+-----------+
| Client |
+-----------+
|
v
+-------------------+
| Search API |
| - validates |
| - resolves geo |
+-------------------+
|
v
+-------------------+
| Hot Region Cache | (Redis, per-cell pre-rendered top businesses)
+-------------------+
|
(cache miss)
|
v
+-------------------+
| Geo Index |
| (geohash) |
+-------------------+
|
v
+-------------------+
| Ranker |
| - distance |
| - rating |
| - personalization|
+-------------------+
|
v
+-------------------+
| Business Service |
| (hydrate details)|
+-------------------+
|
v
+-------------------+
| Business Store |
| (Postgres + S3) |
+-------------------+API Design
// Nearby search
GET /api/v1/businesses/nearby
?lat=37.7749&lng=-122.4194
&radius_km=2
&category=pizza
&limit=20
&sort=relevance
// Response
{
"results": [
{
"business_id": "b_abc",
"name": "Tony's Pizza",
"distance_m": 320,
"rating": 4.6,
"review_count": 1240,
"price": "$$",
"categories": ["pizza", "italian"],
"is_open_now": true,
"thumbnail_url": "https://...",
"score": 0.92
},
...
],
"next_cursor": "..."
}Detailed Design
The two interesting components are the geospatial index choice and the boundary-cell handling that most candidates miss.
Geospatial Index Comparison
| Structure | Update cost | Query cost | Strengths | Weaknesses |
|---|---|---|---|---|
| Lat/lng bounding box scan | O(1) | O(N) | Trivial | Doesn't scale; no real index |
| Geohash + secondary index | O(1) | O(K) | Simple, write-friendly | Boundary effects |
| Quadtree | O(log N) | O(log N + K) | Adaptive density; great for read-heavy | Tree rebalancing on updates |
| R-tree | O(log N) | O(log N + K) | Spatial queries beyond points | Worst-case complexity |
| S2 (Google's) | O(log N) | O(log N + K) | Multi-resolution, good for global | More complex to implement |
| PostGIS GIST | O(log N) | O(log N + K) | SQL native, easy queries | Heavier per-query cost |
For Yelp's read-heavy, write-light workload, geohash with cell expansion is the right default. We discuss why below.
Geohash Explained
A geohash encodes (lat, lng) into a base-32 string by interleaving bits of latitude and longitude.
---------- Geohash precision and cell size ----------
Precision 4 (~30 km cells): 'dr5r' = San Francisco area
Precision 5 (~5 km): 'dr5ru'
Precision 6 (~1 km): 'dr5ruv'
Precision 7 (~150 m): 'dr5ruvy'
Precision 8 (~40 m): 'dr5ruvy0'Nearby points share a longer common prefix. So 'businesses with geohash starting with dr5ru' are all roughly within 5 km of each other.
The Boundary Cell Problem
This is the senior insight every interviewer wants to hear.
---------- The boundary problem ----------
User at (lat=37.78, lng=-122.42): geohash precision 6 = 'dr5ru0'
Business at (lat=37.785, lng=-122.4205): geohash precision 6 = 'dr5ru9' (different cell)
They're 600m apart, well within a 1 km query radius, but live in different cells.
Naive query: 'find businesses with geohash dr5ru0' MISSES the business in dr5ru9.Fix: query the user's cell PLUS its 8 neighboring cells (north, south, east, west, NE, NW, SE, SW). For a 1 km radius at precision 6 (1 km cells), the 9-cell query reliably covers the radius; we then filter by exact distance.
---------- 9-cell query ----------
For user cell 'dr5ru0', compute the 8 neighbor cells:
'dr5ru1', 'dr5ru2', 'dr5ru3', 'dr5ru9', 'dr5rud', 'dr5ruc', 'dr5ru4', 'dr5ru6'
Query 9 cells in parallel; union the candidate businesses.
Compute exact distance to each; filter to within radius_m.
Apply ranking.Picking Geohash Precision Per Query Radius
The precision determines the cell size; you want precision such that the user's radius covers ~1-2 cells (so 9-cell query catches the result with low false-positive overhead).
---------- Precision selection ----------
radius_km == 1: precision 6 (~1 km cells, 9-cell query covers ~3km)
radius_km == 5: precision 5 (~5 km cells)
radius_km == 25: precision 4 (~30 km cells)
radius_km == 100: precision 3 (~150 km cells)Maintain multiple geohash columns per business at different precisions (one per common radius), making radius-based lookups O(1) cell calculation + O(K) candidate filter.
Quadtree Alternative
A quadtree partitions space into 4 quadrants recursively, splitting only when a node exceeds a capacity threshold. Adaptive: dense areas (downtown) have many small cells; sparse areas (open ocean) have few large cells.
---------- Quadtree structure ----------
Node {
bounds: (lat_min, lat_max, lng_min, lng_max)
businesses: list of up to K (e.g., 100) businesses
children: 4 sub-quadrants if business count > K
}
Query (lat, lng, radius):
Traverse from root.
For each node whose bounds intersect the query circle:
If leaf, scan businesses; filter by exact distance.
Else, recurse into intersecting children.Quadtrees are better when business density varies dramatically (urban vs rural). They're harder to implement and shard than geohashes; we'd recommend quadtree for read-only or static datasets and geohash for dynamic ones.
Why Geohash Over PostGIS for This Scale
PostGIS supports R-tree and GIST indexes for spatial queries with full SQL ergonomics:
SELECT * FROM businesses
WHERE ST_DWithin(location, ST_MakePoint(lng, lat)::geography, 5000)
ORDER BY ST_Distance(location, ST_MakePoint(lng, lat)::geography)
LIMIT 20;Works beautifully up to ~10M businesses on a single Postgres instance (single node, ~50 ms per query). At 100M businesses and 500K QPS, we'd shard, and PostGIS becomes operationally heavy. Geohash + Redis is simpler and cheaper at scale.
For smaller products (<10M businesses, <10K QPS), PostGIS is the right answer.
Ranking
Once we have the candidate set (businesses within radius), we rank.
---------- Ranking score (blended) ----------
score = w1 * (1 / distance_normalized) +
w2 * rating_normalized +
w3 * log(review_count + 1) +
w4 * personalization_score +
w5 * is_open_now * 0.1 (small bonus for open)Weights tuned offline against historical click data. For 'pizza near me', distance and rating dominate; for 'fancy restaurant', rating and review_count dominate.
Personalization
- User's cuisine preferences (from past reviews/searches).
- Price tier preference.
- Distance preference (some users walk; some drive).
Applied at the rerank stage on top-100 candidates (similar pattern to search engine).
Hot-Region Caching
90% of queries hit ~10K cells globally (city centers). We cache per-cell precomputed top businesses by category.
---------- Cache key structure ----------
Key: nearby:<geohash6>:<category>:<sort_by>
Value: serialized top-100 businesses with summary fields
TTL: 5 minutes for active categories; longer for stable onesFirst request to a cell warms the cache; subsequent requests for the same cell + category hit cache in <10 ms.
Cache invalidation: when a business updates its hours or rating, only the affected cell's cache is invalidated (lazy: bump version key).
Search by Name (Autocomplete)
Orthogonal to the geo index: a separate inverted index over business names with location-aware ranking.
---------- Name search flow ----------
1. User types 'star' near (lat, lng)
2. Typeahead service returns top 20 names matching 'star' across all cities
3. Filter to businesses whose location is within 50 km of user
4. Rank by name relevance + distance + rating
5. Show top 10This is essentially the typeahead system from the previous case study with a geographic filter applied to the candidate set.
Data Model
Postgres (sharded by region): businesses
CREATE TABLE businesses (
id BIGINT PRIMARY KEY,
name VARCHAR(255) NOT NULL,
address VARCHAR(512),
city VARCHAR(100),
country VARCHAR(2),
lat DOUBLE PRECISION NOT NULL,
lng DOUBLE PRECISION NOT NULL,
geohash_p4 CHAR(4), -- ~30 km cells
geohash_p5 CHAR(5), -- ~5 km
geohash_p6 CHAR(6), -- ~1 km
geohash_p7 CHAR(7), -- ~150 m
rating DOUBLE PRECISION,
review_count INT,
price_tier INT,
primary_category VARCHAR(64),
hours JSONB,
created_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_geo_p5 ON businesses (geohash_p5);
CREATE INDEX idx_geo_p6 ON businesses (geohash_p6);
CREATE INDEX idx_geo_cat_p6 ON businesses (geohash_p6, primary_category);Shard by country/region; the dataset is naturally partitionable.
Redis: geo index, hot caches
---------- Redis structure ----------
geoindex:p6:<geohash6> -> SET of business_ids no TTL
geoindex:p6:<geohash6>:<category> -> ZSET of (business_id, rating) no TTL
nearby:<geohash6>:<cat>:<sort> -> serialized top-100 result TTL 5 min
biz_summary:<biz_id> -> JSON business summary card TTL 1 hrWriting a new business updates both the SQL row and the Redis sets/sorted sets in a single transaction (or via change-data-capture).
S3: photos
Standard blob storage for business photos, served via CDN. Photo URLs stored in business JSONB.
Reviews (Cassandra, partitioned by business_id)
-- Cassandra
CREATE TABLE reviews (
business_id bigint,
review_id bigint, -- Snowflake
user_id bigint,
rating int,
text text,
created_at timestamp,
PRIMARY KEY ((business_id), review_id)
) WITH CLUSTERING ORDER BY (review_id DESC);Latest reviews per business: single partition scan.
Scaling and Bottlenecks
The lunch rush in San Francisco
At noon Pacific, every restaurant query in SF hits a few cells. Cache absorbs the bulk; the geo index sees ~10K QPS for the same handful of cells.
Mitigations:
- Per-cell cache with high TTL during peak hours (5 min is safe; ratings/hours rarely change in 5 min).
- Multiple Redis replicas for hot cells.
- Compute results once per cell per minute (precompute job) and serve from cache for the next 60s.
A new business opens; cache invalidation
When 'New Pizza Place' opens at (lat, lng), it should appear in nearby queries within minutes. Strategies:
- New business creation -> CDC event -> Redis update of all relevant geohash sets (precision 4-7) in one transaction.
- Active cell caches invalidated (version bump key).
- Next query for those cells repopulates the cache with the new business included.
Long-tail queries (a remote area)
A query in rural Wyoming hits a cold cell. No cache; first query is slow (~50-100 ms). Subsequent queries warm the cell. Acceptable; the volume in cold cells is small.
A user query spans multiple geohash precisions
If the user queries 50 km radius, we'd query 9 cells at precision 4 (~30 km). But the boundary cells could legitimately contain businesses up to ~60 km away, far beyond the radius. The exact-distance filter is what saves us; we're doing slight over-fetching but exact filtering.
Multi-region
The business catalog naturally partitions by region: US businesses in us-east, EU in eu-west, etc. Cross-region queries (a tourist in Tokyo from a US account) route to the relevant region's index.
Index size growth
100M businesses today; assume 200M in 5 years. Index doubles in size; still fits in <50 GB of Redis. Sharding is straightforward (geohash prefix as shard key).
Trade-offs and Alternatives
Geohash vs Quadtree
Geohash is simpler, easier to shard (geohash prefix is a natural shard key), and works well when business density is roughly uniform within precision boundaries. Quadtree adapts to density (more cells in cities, fewer in rural) but rebalances on writes (annoying for streaming updates). For Yelp-like read-heavy workloads, geohash + cell expansion is the standard answer.
Geohash precision: fixed vs adaptive
Fixed precision (use precision 6 for all queries) is simpler but wastes bandwidth on large-radius queries (you'd query thousands of cells for a 100 km radius). Multi-precision indexes (store each business at p4/p5/p6/p7) cost ~4x storage but enable O(1) precision selection per query.
Why not S2 (Google's hierarchical hex grid)?
S2 is more sophisticated than geohash: spherical projection (no equator distortion), hierarchical multi-resolution (every cell has parents at coarser resolution). Better for very large radii or pole-near queries. The trade-off is implementation complexity. Geohash is good enough for most products; S2 is the right answer if you're indexing the entire sky or doing trans-oceanic queries.
Distance computation: Haversine vs equirectangular
For exact distance filtering, Haversine formula is most accurate but slower. Equirectangular approximation is much faster and accurate within ~1% for ranges under ~100 km. We use equirectangular for ranking and Haversine for the final 'show me distance in meters' display.
Why precompute and cache aggressively?
The asymmetry: 1 update/sec, 500K queries/sec. Aggressive caching (5-min TTL) is cheap and safe; ratings/hours don't change often enough to make 5-min staleness a problem. The only updates that matter immediately are new business creation, handled by explicit cache invalidation.
Why Postgres for the canonical store but Redis for the index?
Postgres holds the source of truth (transactional updates, full schema). Redis holds the hot query path (geo SETs, hot-cell caches). The CDC-style sync from Postgres -> Redis is well-understood, and the read path never touches Postgres for nearby search; only the business detail page does.
Personalization vs simplicity
Personalization improves clickthrough by ~10-20% but breaks per-cell caching (each user sees different rankings). The compromise: cache the candidate set per cell; rerank with user features in a fast service. Adds 10-20 ms but unlocks personalization without losing cache effectiveness.
Real-World Examples
How real systems implement this in production
Yelp indexes ~6M businesses globally. They use a custom geospatial backend with quadtree-style indexes; reviews stored separately. Heavy use of caching for popular city + category combinations. Their search blends location, ratings, review velocity, and personalization based on past reviews and user location patterns.
Trade-off: Yelp's review-heavy ranking creates a reinforcement loop (popular places get more reviews, rank higher). They've experimented with various 'editorial' boosts to surface lesser-known businesses, with mixed success. The lesson: ranking signals can feed back on themselves and need correction.
Uber Eats handles ~10M restaurants worldwide and serves 'nearby restaurants' as a primary use case. Uses Google's S2 cell library (more accurate than geohash for a globe) plus heavy caching. Ranking includes ETA estimation in addition to distance/rating, since ETA matters more than absolute distance for delivery.
Trade-off: Uber chose S2 over geohash for global accuracy and hierarchical operations (zoom in/out as the user pans). The trade-off: S2 is more complex to implement; the precision pays off only at very large scale or for use cases where pole-near queries matter.
Tinder shows nearby potential matches sorted by distance with various preference filters. Uses geohash-style sharding plus per-user filter sets (gender, age range, distance). The query pattern is read-heavy and the dataset is 'businesses' that are people, with high update rate (people move, change location).
Trade-off: Tinder's high write rate (people moving) makes adaptive structures like quadtrees less appealing. Geohash + Redis with frequent SET updates handles the workload; quadtree's rebalance cost would be punishing.
Airbnb's nearby search differs because users search on a map by drawing a viewport ("show me everything within this map area"), not a radius. They use a quadtree-like structure tied to the map's tile system, with property availability checks layered on top. Caching is per-tile, similar to per-cell here.
Trade-off: Airbnb's tile-based query pattern fits quadtree better than radius-based queries fit geohash. The lesson: pick your spatial index to match the query shape, not just the data shape. Radius queries -> geohash; viewport/tile queries -> quadtree.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Client sends GET /nearby?lat=37.78&lng=-122.42&category=pizza&radius_km=2. Server computes user's geohash6 'dr5ru0'. Cache lookup: 'nearby:dr5ru0:pizza:relevance'. If cached (90%+ in dense areas), return top-20 in <10 ms. If miss, compute 9 neighbor cells; query Redis SETs for each cell to get candidate business_ids (200-2000 candidates). Hydrate business summaries from Redis (or Postgres on miss). Compute exact distance for each (Haversine). Filter to within radius. Score by blended ranking (distance + rating + popularity + open_now). Top 100 candidates passed to reranker for personalization. Return top 20. Cache result with 5-min TTL. Total: 50-150 ms uncached, <10 ms cached.
The boundary cell problem is exactly why we never query just the user's cell. Always compute the 8 neighbor cells around the user's cell (north, south, east, west, NE, NW, SE, SW) and query all 9 in parallel. Union the results. Compute exact distance to each candidate; filter to within radius. The over-fetch is small (9 cells * ~K candidates each) and the exact-distance filter ensures correctness. Skipping the neighbor query is the most common bug.
Both work. Geohash is simpler: a string prefix uniquely identifies a cell, sharding by prefix is trivial, lookups are O(1), and the data structure is just an inverted index from cell to businesses. Quadtree adapts to density (a small leaf in dense cities, large in rural), which can give tighter candidate sets, but the tree must rebalance on writes and sharding is harder (cells are 2D regions, not strings). For our read-heavy, write-light workload with moderate density variation, geohash is the right default. Quadtree wins for very skewed density (like Uber's driver locations in Manhattan vs Wyoming) where adaptive partitioning matters more.
The business update writes to Postgres in a transaction; a CDC-style listener publishes a 'business.updated' event to Kafka. A consumer updates the Redis index entries for that business (the geo SETs). Then the consumer invalidates the affected cell caches by incrementing a version key (or directly evicting cache entries for the relevant geohash6 + categories). Next query for that cell triggers a cache miss and repopulates with the new data. End-to-end: a hours change is visible in queries within 30 seconds.
100 km radius needs precision 3 (~150 km cells). The 9-cell query covers ~450 km, plenty for a 100 km radius. Per-business multi-precision storage means we already have geohash_p3 indexed. Query the relevant cells; the candidate set might be small (100 gas stations across 9 large rural cells) which is fine. Rural cells are usually NOT cached (low query volume), so first query goes to Redis index + Postgres for hydration: 50-100 ms. Subsequent queries (rare) hit warm cache. The cold path is acceptable because rural query volume is low.
Interview Tips
How to discuss this topic effectively
Lead with the read-heavy asymmetry. Saying '10000:1 read/write ratio means we cache aggressively and precompute' shows you read the workload before designing.
Pick geohash for the index and explain the 9-cell neighbor query. It's the standard answer; mentioning boundary cells is the senior-level signal.
Compare geohash to quadtree explicitly. Saying 'I'd use geohash because writes are easy and density is moderate; quadtree wins for highly skewed density but is harder to shard' shows depth.
For ranking, mention BLENDING distance and rating with weights tuned to historical clicks. Pure distance ranking misses the 'mediocre place is closer than the great place' problem.
Cite PostGIS as the right answer for smaller scale. Saying 'PostGIS works up to ~10M businesses on a single node; beyond that we shard' shows you know when to choose simpler tools.
Common Mistakes
Pitfalls to avoid in interviews
Querying only the user's geohash cell (forgetting boundary cells)
A user near a cell edge has many nearby points in the adjacent cell; querying only the user's cell misses them. Always query the user's cell PLUS the 8 surrounding cells (3x3 grid), then filter by exact distance. Skipping this is the most common interview mistake.
Storing only one geohash precision
A 1 km radius query needs precision 6; a 100 km radius needs precision 3. Storing one precision means you either over-fetch on small radii or query thousands of cells for large radii. Index each business at multiple precisions (4-7); pick precision per query radius.
Using PostGIS at 100M businesses without sharding
PostGIS works up to ~10M businesses on one node. At 100M with 500K QPS, it becomes operationally heavy. Either shard PostGIS by region (works but complex) or move to a simpler geohash + Redis index (lighter, easier to scale).
Ranking purely by distance
Distance alone returns the closest businesses, which are often mediocre. Real users want a blend: distance + rating + review count + open-now + personalization. Tune weights against historical clickthrough; pure distance is a strawman ranker.
Caching at the user level instead of the cell level
Per-user caches don't share between users; cache hit rate stays low. Per-cell caches (Geohash6 -> top businesses for that cell) are shared across all users in the same area, giving 90%+ hit rates for queries in city centers.
