System Design Article

Design Nearby / Location Service (Yelp)

Difficulty: Medium

Design a 'nearby' service like Yelp that returns the top businesses within a search radius of the user's location, ranking by distance, rating, and category, scaling to 200M monthly users querying 100M businesses. The interview centerpiece is the geospatial index: how to find 'all businesses within 5 km of (lat, lng)' efficiently. We compare bounding-box scans, geohashes, quadtrees, R-trees, and PostGIS GIST indexes; we recommend geohash + secondary index for write-heavy systems and quadtree/R-tree for read-heavy. We cover business storage and search, review ranking, the infrequent-update vs frequent-query asymmetry, and how to handle the long tail of remote regions.

Design Nearby / Location Service (Yelp)

System Design

Medium

design-nearby-service

case-study

search-discovery

nearby-search

geospatial-index

geohash

quadtree

r-tree

postgis

yelp

location-based-services

spatial-indexing

system-design

intermediate

premium

171 views

Requirements

Functional Requirements

Find nearby businesses: given a user's (lat, lng) and a radius, return businesses within that radius.
Filter by category: 'pizza near me', 'gas stations within 10 km', etc.
Sort by distance, rating, or relevance (default is a blended score).
Pagination: typical request returns top 20; user can scroll for more.
Business detail page: full business info (name, address, hours, photos, reviews) on tap.
Search by name ("Starbucks near me") with autocomplete.
Reviews: users post 1-5 star reviews and short text; aggregated into business rating.

Out of Scope (state explicitly)

Real-time location streaming (not Uber; users explicitly issue queries).
Booking/reservations (separate product).
Order placement / delivery (use food-delivery case study).
The full review system at depth (we focus on the nearby-search problem).

Non-Functional Requirements

Scale: 100M businesses, 200M MAU, 500K queries/sec at peak (geographic skew - dense city queries dominate).
Latency: p99 < 200 ms for nearby search.
Availability: 99.9%; if the index is briefly down, fall back to cached or stale results.
Update rate: business attributes change rarely (~1M updates/day, mostly hours/photos); new businesses added ~10K/day.
Read/write ratio: 10000:1 (very read-heavy).
Eventually consistent: a hours change visible within minutes is acceptable.

Back-of-the-Envelope Estimation

Data and Storage

Text

---------- Business catalog ----------
Businesses worldwide:             100M
Per-business record:              ~5 KB (name, address, hours, ~5 categories, ~10 photos URL refs)
Total business storage:           ~500 GB (small; fits one sharded SQL cluster)
Reviews:                          1B (~50 reviews per active business)
Per review:                       ~1 KB
Review storage:                   ~1 TB

Query Volume

Text

---------- Query estimation ----------
MAU:                              200M
DAU:                              60M
Queries per DAU per day:          ~4
Total queries/day:                ~240M
Queries/sec avg:                  ~3K
Queries/sec peak (lunch rush):    ~500K  <- huge spike, dense urban

The peak/avg ratio is unusually high because queries cluster around mealtimes in dense cities. A small fraction of geographic cells (city centers) gets the bulk of queries.

Geospatial Index Size

Text

---------- Geohash index ----------
100M businesses indexed by geohash (e.g., precision 6 = ~1.2 km cells)
Geohash -> [business_ids] inverted index: ~5 GB
With multiple precisions for fast multi-radius search: ~20 GB
Fits in a Redis cluster (~10 nodes).

Cache Sizing

Text

---------- Hot region cache ----------
Top ~10K geohash cells (city centers globally): receive 90% of queries
Per cell cache: top 100 businesses by rating in that cell, by category
Per cell entry: ~200 KB (100 businesses * ~2 KB summary)
Total cache: ~10K * 200 KB * ~10 categories = ~20 GB across distributed cache

High-Level Design

Text

---------- High-level architecture ----------
   +-----------+
   |  Client   |
   +-----------+
        |
        v
   +-------------------+
   |  Search API       |
   |  - validates      |
   |  - resolves geo   |
   +-------------------+
        |
        v
   +-------------------+
   |  Hot Region Cache |  (Redis, per-cell pre-rendered top businesses)
   +-------------------+
        |
   (cache miss)
        |
        v
   +-------------------+
   |  Geo Index        |
   |  (geohash)        |
   +-------------------+
        |
        v
   +-------------------+
   |  Ranker           |
   |  - distance       |
   |  - rating         |
   |  - personalization|
   +-------------------+
        |
        v
   +-------------------+
   |  Business Service |
   |  (hydrate details)|
   +-------------------+
        |
        v
   +-------------------+
   |  Business Store   |
   |  (Postgres + S3)  |
   +-------------------+

API Design

Jsonc

// Nearby search
GET /api/v1/businesses/nearby
  ?lat=37.7749&lng=-122.4194
  &radius_km=2
  &category=pizza
  &limit=20
  &sort=relevance

// Response
{
    "results": [
        {
            "business_id": "b_abc",
            "name": "Tony's Pizza",
            "distance_m": 320,
            "rating": 4.6,
            "review_count": 1240,
            "price": "$$",
            "categories": ["pizza", "italian"],
            "is_open_now": true,
            "thumbnail_url": "https://...",
            "score": 0.92
        },
        ...
    ],
    "next_cursor": "..."
}

Detailed Design

The two interesting components are the geospatial index choice and the boundary-cell handling that most candidates miss.

Geospatial Index Comparison

Structure	Update cost	Query cost	Strengths	Weaknesses
Lat/lng bounding box scan	O(1)	O(N)	Trivial	Doesn't scale; no real index
Geohash + secondary index	O(1)	O(K)	Simple, write-friendly	Boundary effects
Quadtree	O(log N)	O(log N + K)	Adaptive density; great for read-heavy	Tree rebalancing on updates
R-tree	O(log N)	O(log N + K)	Spatial queries beyond points	Worst-case complexity
S2 (Google's)	O(log N)	O(log N + K)	Multi-resolution, good for global	More complex to implement
PostGIS GIST	O(log N)	O(log N + K)	SQL native, easy queries	Heavier per-query cost

For Yelp's read-heavy, write-light workload, geohash with cell expansion is the right default. We discuss why below.

Geohash Explained

A geohash encodes (lat, lng) into a base-32 string by interleaving bits of latitude and longitude.

Text

---------- Geohash precision and cell size ----------
Precision 4 (~30 km cells):    'dr5r' = San Francisco area
Precision 5 (~5 km):           'dr5ru'
Precision 6 (~1 km):           'dr5ruv'
Precision 7 (~150 m):          'dr5ruvy'
Precision 8 (~40 m):           'dr5ruvy0'

Nearby points share a longer common prefix. So 'businesses with geohash starting with dr5ru' are all roughly within 5 km of each other.

The Boundary Cell Problem

This is the senior insight every interviewer wants to hear.

Text

---------- The boundary problem ----------
User at (lat=37.78, lng=-122.42): geohash precision 6 = 'dr5ru0'
Business at (lat=37.785, lng=-122.4205): geohash precision 6 = 'dr5ru9'  (different cell)

They're 600m apart, well within a 1 km query radius, but live in different cells.
Naive query: 'find businesses with geohash dr5ru0' MISSES the business in dr5ru9.

Fix: query the user's cell PLUS its 8 neighboring cells (north, south, east, west, NE, NW, SE, SW). For a 1 km radius at precision 6 (1 km cells), the 9-cell query reliably covers the radius; we then filter by exact distance.

Text

---------- 9-cell query ----------
For user cell 'dr5ru0', compute the 8 neighbor cells:
  'dr5ru1', 'dr5ru2', 'dr5ru3', 'dr5ru9', 'dr5rud', 'dr5ruc', 'dr5ru4', 'dr5ru6'
Query 9 cells in parallel; union the candidate businesses.
Compute exact distance to each; filter to within radius_m.
Apply ranking.

Picking Geohash Precision Per Query Radius

The precision determines the cell size; you want precision such that the user's radius covers ~1-2 cells (so 9-cell query catches the result with low false-positive overhead).

Text

---------- Precision selection ----------
radius_km == 1:    precision 6 (~1 km cells, 9-cell query covers ~3km)
radius_km == 5:    precision 5 (~5 km cells)
radius_km == 25:   precision 4 (~30 km cells)
radius_km == 100:  precision 3 (~150 km cells)

Maintain multiple geohash columns per business at different precisions (one per common radius), making radius-based lookups O(1) cell calculation + O(K) candidate filter.

Quadtree Alternative

A quadtree partitions space into 4 quadrants recursively, splitting only when a node exceeds a capacity threshold. Adaptive: dense areas (downtown) have many small cells; sparse areas (open ocean) have few large cells.

Text

---------- Quadtree structure ----------
Node {
  bounds: (lat_min, lat_max, lng_min, lng_max)
  businesses: list of up to K (e.g., 100) businesses
  children: 4 sub-quadrants if business count > K
}

Query (lat, lng, radius):
  Traverse from root.
  For each node whose bounds intersect the query circle:
    If leaf, scan businesses; filter by exact distance.
    Else, recurse into intersecting children.

Quadtrees are better when business density varies dramatically (urban vs rural). They're harder to implement and shard than geohashes; we'd recommend quadtree for read-only or static datasets and geohash for dynamic ones.

Why Geohash Over PostGIS for This Scale

PostGIS supports R-tree and GIST indexes for spatial queries with full SQL ergonomics:

SQL

SELECT * FROM businesses
  WHERE ST_DWithin(location, ST_MakePoint(lng, lat)::geography, 5000)
  ORDER BY ST_Distance(location, ST_MakePoint(lng, lat)::geography)
  LIMIT 20;

Works beautifully up to ~10M businesses on a single Postgres instance (single node, ~50 ms per query). At 100M businesses and 500K QPS, we'd shard, and PostGIS becomes operationally heavy. Geohash + Redis is simpler and cheaper at scale.

For smaller products (<10M businesses, <10K QPS), PostGIS is the right answer.

Ranking

Once we have the candidate set (businesses within radius), we rank.

Text

---------- Ranking score (blended) ----------
score = w1 * (1 / distance_normalized) +
        w2 * rating_normalized +
        w3 * log(review_count + 1) +
        w4 * personalization_score +
        w5 * is_open_now * 0.1 (small bonus for open)

Weights tuned offline against historical click data. For 'pizza near me', distance and rating dominate; for 'fancy restaurant', rating and review_count dominate.

Personalization

User's cuisine preferences (from past reviews/searches).
Price tier preference.
Distance preference (some users walk; some drive).

Applied at the rerank stage on top-100 candidates (similar pattern to search engine).

Hot-Region Caching

90% of queries hit ~10K cells globally (city centers). We cache per-cell precomputed top businesses by category.

Text

---------- Cache key structure ----------
Key:    nearby:<geohash6>:<category>:<sort_by>
Value:  serialized top-100 businesses with summary fields
TTL:    5 minutes for active categories; longer for stable ones

First request to a cell warms the cache; subsequent requests for the same cell + category hit cache in <10 ms.

Cache invalidation: when a business updates its hours or rating, only the affected cell's cache is invalidated (lazy: bump version key).

Search by Name (Autocomplete)

Orthogonal to the geo index: a separate inverted index over business names with location-aware ranking.

Text

---------- Name search flow ----------
1. User types 'star' near (lat, lng)
2. Typeahead service returns top 20 names matching 'star' across all cities
3. Filter to businesses whose location is within 50 km of user
4. Rank by name relevance + distance + rating
5. Show top 10

This is essentially the typeahead system from the previous case study with a geographic filter applied to the candidate set.

Data Model

Postgres (sharded by region): businesses

SQL

CREATE TABLE businesses (
    id              BIGINT PRIMARY KEY,
    name            VARCHAR(255) NOT NULL,
    address         VARCHAR(512),
    city            VARCHAR(100),
    country         VARCHAR(2),
    lat             DOUBLE PRECISION NOT NULL,
    lng             DOUBLE PRECISION NOT NULL,
    geohash_p4      CHAR(4),                  -- ~30 km cells
    geohash_p5      CHAR(5),                  -- ~5 km
    geohash_p6      CHAR(6),                  -- ~1 km
    geohash_p7      CHAR(7),                  -- ~150 m
    rating          DOUBLE PRECISION,
    review_count    INT,
    price_tier      INT,
    primary_category VARCHAR(64),
    hours           JSONB,
    created_at      TIMESTAMPTZ NOT NULL
);

CREATE INDEX idx_geo_p5 ON businesses (geohash_p5);
CREATE INDEX idx_geo_p6 ON businesses (geohash_p6);
CREATE INDEX idx_geo_cat_p6 ON businesses (geohash_p6, primary_category);

Shard by country/region; the dataset is naturally partitionable.

Redis: geo index, hot caches

Text

---------- Redis structure ----------
geoindex:p6:<geohash6>            -> SET of business_ids                 no TTL
geoindex:p6:<geohash6>:<category> -> ZSET of (business_id, rating)        no TTL
nearby:<geohash6>:<cat>:<sort>    -> serialized top-100 result            TTL 5 min
biz_summary:<biz_id>              -> JSON business summary card           TTL 1 hr

Writing a new business updates both the SQL row and the Redis sets/sorted sets in a single transaction (or via change-data-capture).

S3: photos

Standard blob storage for business photos, served via CDN. Photo URLs stored in business JSONB.

Reviews (Cassandra, partitioned by business_id)

SQL

-- Cassandra
CREATE TABLE reviews (
    business_id     bigint,
    review_id       bigint,                   -- Snowflake
    user_id         bigint,
    rating          int,
    text            text,
    created_at      timestamp,
    PRIMARY KEY ((business_id), review_id)
) WITH CLUSTERING ORDER BY (review_id DESC);

Latest reviews per business: single partition scan.

Scaling and Bottlenecks

The lunch rush in San Francisco

At noon Pacific, every restaurant query in SF hits a few cells. Cache absorbs the bulk; the geo index sees ~10K QPS for the same handful of cells.

Mitigations:

Per-cell cache with high TTL during peak hours (5 min is safe; ratings/hours rarely change in 5 min).
Multiple Redis replicas for hot cells.
Compute results once per cell per minute (precompute job) and serve from cache for the next 60s.

A new business opens; cache invalidation

When 'New Pizza Place' opens at (lat, lng), it should appear in nearby queries within minutes. Strategies:

New business creation -> CDC event -> Redis update of all relevant geohash sets (precision 4-7) in one transaction.
Active cell caches invalidated (version bump key).
Next query for those cells repopulates the cache with the new business included.

Long-tail queries (a remote area)

A query in rural Wyoming hits a cold cell. No cache; first query is slow (~50-100 ms). Subsequent queries warm the cell. Acceptable; the volume in cold cells is small.

A user query spans multiple geohash precisions

If the user queries 50 km radius, we'd query 9 cells at precision 4 (~30 km). But the boundary cells could legitimately contain businesses up to ~60 km away, far beyond the radius. The exact-distance filter is what saves us; we're doing slight over-fetching but exact filtering.

Multi-region

The business catalog naturally partitions by region: US businesses in us-east, EU in eu-west, etc. Cross-region queries (a tourist in Tokyo from a US account) route to the relevant region's index.

Index size growth

100M businesses today; assume 200M in 5 years. Index doubles in size; still fits in <50 GB of Redis. Sharding is straightforward (geohash prefix as shard key).

Trade-offs and Alternatives

Geohash vs Quadtree

Geohash is simpler, easier to shard (geohash prefix is a natural shard key), and works well when business density is roughly uniform within precision boundaries. Quadtree adapts to density (more cells in cities, fewer in rural) but rebalances on writes (annoying for streaming updates). For Yelp-like read-heavy workloads, geohash + cell expansion is the standard answer.

Geohash precision: fixed vs adaptive

Fixed precision (use precision 6 for all queries) is simpler but wastes bandwidth on large-radius queries (you'd query thousands of cells for a 100 km radius). Multi-precision indexes (store each business at p4/p5/p6/p7) cost ~4x storage but enable O(1) precision selection per query.

Why not S2 (Google's hierarchical hex grid)?

S2 is more sophisticated than geohash: spherical projection (no equator distortion), hierarchical multi-resolution (every cell has parents at coarser resolution). Better for very large radii or pole-near queries. The trade-off is implementation complexity. Geohash is good enough for most products; S2 is the right answer if you're indexing the entire sky or doing trans-oceanic queries.

Distance computation: Haversine vs equirectangular

For exact distance filtering, Haversine formula is most accurate but slower. Equirectangular approximation is much faster and accurate within ~1% for ranges under ~100 km. We use equirectangular for ranking and Haversine for the final 'show me distance in meters' display.

Why precompute and cache aggressively?

The asymmetry: 1 update/sec, 500K queries/sec. Aggressive caching (5-min TTL) is cheap and safe; ratings/hours don't change often enough to make 5-min staleness a problem. The only updates that matter immediately are new business creation, handled by explicit cache invalidation.

Why Postgres for the canonical store but Redis for the index?

Postgres holds the source of truth (transactional updates, full schema). Redis holds the hot query path (geo SETs, hot-cell caches). The CDC-style sync from Postgres -> Redis is well-understood, and the read path never touches Postgres for nearby search; only the business detail page does.

Personalization vs simplicity

Personalization improves clickthrough by ~10-20% but breaks per-cell caching (each user sees different rankings). The compromise: cache the candidate set per cell; rerank with user features in a fast service. Adds 10-20 ms but unlocks personalization without losing cache effectiveness.

Real-World Examples

How real systems implement this in production

Yelp

Yelp indexes ~6M businesses globally. They use a custom geospatial backend with quadtree-style indexes; reviews stored separately. Heavy use of caching for popular city + category combinations. Their search blends location, ratings, review velocity, and personalization based on past reviews and user location patterns.

Trade-off: Yelp's review-heavy ranking creates a reinforcement loop (popular places get more reviews, rank higher). They've experimented with various 'editorial' boosts to surface lesser-known businesses, with mixed success. The lesson: ranking signals can feed back on themselves and need correction.

Uber's Eats search

Uber Eats handles ~10M restaurants worldwide and serves 'nearby restaurants' as a primary use case. Uses Google's S2 cell library (more accurate than geohash for a globe) plus heavy caching. Ranking includes ETA estimation in addition to distance/rating, since ETA matters more than absolute distance for delivery.

Trade-off: Uber chose S2 over geohash for global accuracy and hierarchical operations (zoom in/out as the user pans). The trade-off: S2 is more complex to implement; the precision pays off only at very large scale or for use cases where pole-near queries matter.

Tinder (location-based matching)

Tinder shows nearby potential matches sorted by distance with various preference filters. Uses geohash-style sharding plus per-user filter sets (gender, age range, distance). The query pattern is read-heavy and the dataset is 'businesses' that are people, with high update rate (people move, change location).

Trade-off: Tinder's high write rate (people moving) makes adaptive structures like quadtrees less appealing. Geohash + Redis with frequent SET updates handles the workload; quadtree's rebalance cost would be punishing.

Airbnb listings search

Airbnb's nearby search differs because users search on a map by drawing a viewport ("show me everything within this map area"), not a radius. They use a quadtree-like structure tied to the map's tile system, with property availability checks layered on top. Caching is per-tile, similar to per-cell here.

Trade-off: Airbnb's tile-based query pattern fits quadtree better than radius-based queries fit geohash. The lesson: pick your spatial index to match the query shape, not just the data shape. Radius queries -> geohash; viewport/tile queries -> quadtree.

Quick Interview Phrases

Key terms to use in your answer

geohash with 9-cell neighbor query for boundary handling

multi-precision geohash columns for radius-aware lookup

hot-region cache absorbs city-center traffic

candidate generation by cell, exact filtering by Haversine

rerank with rating and personalization on top-100

infrequent writes enable aggressive precomputation

Common Interview Questions

Questions you might be asked about this topic

Walk me through what happens when a user searches 'pizza near me' from downtown San Francisco.

Client sends GET /nearby?lat=37.78&lng=-122.42&category=pizza&radius_km=2. Server computes user's geohash6 'dr5ru0'. Cache lookup: 'nearby:dr5ru0:pizza:relevance'. If cached (90%+ in dense areas), return top-20 in <10 ms. If miss, compute 9 neighbor cells; query Redis SETs for each cell to get candidate business_ids (200-2000 candidates). Hydrate business summaries from Redis (or Postgres on miss). Compute exact distance for each (Haversine). Filter to within radius. Score by blended ranking (distance + rating + popularity + open_now). Top 100 candidates passed to reranker for personalization. Return top 20. Cache result with 5-min TTL. Total: 50-150 ms uncached, <10 ms cached.

How do you handle the boundary case where a business is just across a geohash cell line from the user?

Why geohash and not quadtree?

How does the cache stay fresh when a business updates its hours?

How do you make this work for 'gas stations within 100 km' in rural Wyoming?

Interview Tips

How to discuss this topic effectively

Lead with the read-heavy asymmetry. Saying '10000:1 read/write ratio means we cache aggressively and precompute' shows you read the workload before designing.

Pick geohash for the index and explain the 9-cell neighbor query. It's the standard answer; mentioning boundary cells is the senior-level signal.

Compare geohash to quadtree explicitly. Saying 'I'd use geohash because writes are easy and density is moderate; quadtree wins for highly skewed density but is harder to shard' shows depth.

For ranking, mention BLENDING distance and rating with weights tuned to historical clicks. Pure distance ranking misses the 'mediocre place is closer than the great place' problem.

Cite PostGIS as the right answer for smaller scale. Saying 'PostGIS works up to ~10M businesses on a single node; beyond that we shard' shows you know when to choose simpler tools.

Common Mistakes

Pitfalls to avoid in interviews

Querying only the user's geohash cell (forgetting boundary cells)

A user near a cell edge has many nearby points in the adjacent cell; querying only the user's cell misses them. Always query the user's cell PLUS the 8 surrounding cells (3x3 grid), then filter by exact distance. Skipping this is the most common interview mistake.

Storing only one geohash precision

A 1 km radius query needs precision 6; a 100 km radius needs precision 3. Storing one precision means you either over-fetch on small radii or query thousands of cells for large radii. Index each business at multiple precisions (4-7); pick precision per query radius.

Using PostGIS at 100M businesses without sharding

PostGIS works up to ~10M businesses on one node. At 100M with 500K QPS, it becomes operationally heavy. Either shard PostGIS by region (works but complex) or move to a simpler geohash + Redis index (lighter, easier to scale).

Ranking purely by distance

Distance alone returns the closest businesses, which are often mediocre. Real users want a blend: distance + rating + review count + open-now + personalization. Tune weights against historical clickthrough; pure distance is a strawman ranker.

Caching at the user level instead of the cell level

Per-user caches don't share between users; cache hit rate stays low. Per-cell caches (Geohash6 -> top businesses for that cell) are shared across all users in the same area, giving 90%+ hit rates for queries in city centers.

Back to System Design