System Design Article

Design Typeahead / Autocomplete

Difficulty: Medium

Design a typeahead/autocomplete service like Google Search's suggestion bar that returns the top 10 ranked completions for a query prefix in under 100ms p99, scaling to 5B searches per day with a multi-billion-entry suggestion index. The interview centerpiece is the data structure choice (trie vs sorted strings vs ngram index) and the offline pipeline that ranks suggestions by frequency, recency, personalization, and click-through rate. We cover the trie with precomputed top-K per node, edge n-gram indexes for typo tolerance, the MapReduce/Spark batch pipeline that rebuilds suggestions nightly, and the per-region edge cache that absorbs 99% of traffic.

System Design
/

Design Typeahead / Autocomplete

Design Typeahead / Autocomplete

Design a typeahead/autocomplete service like Google Search's suggestion bar that returns the top 10 ranked completions for a query prefix in under 100ms p99, scaling to 5B searches per day with a multi-billion-entry suggestion index. The interview centerpiece is the data structure choice (trie vs sorted strings vs ngram index) and the offline pipeline that ranks suggestions by frequency, recency, personalization, and click-through rate. We cover the trie with precomputed top-K per node, edge n-gram indexes for typo tolerance, the MapReduce/Spark batch pipeline that rebuilds suggestions nightly, and the per-region edge cache that absorbs 99% of traffic.

System Design
Medium
design-typeahead
case-study
search-discovery
autocomplete
trie
edge-ngrams
ranking
top-k-precomputation
edge-caching
personalization
system-design
intermediate
free

712 views

23

Requirements

Functional Requirements

  1. Return suggestions for a query prefix as the user types; fire on every keystroke after 2 characters.
  2. Top 10 suggestions ranked by global popularity, recency, and (optionally) personalized to the user.
  3. Typo tolerance: small typos in the prefix still surface relevant suggestions ("facbook" -> "facebook").
  4. Multi-language: support multiple locales; each locale has its own suggestion set.
  5. Freshness: trending queries (a breaking news term) appear within minutes, not next-day batch.
  6. Click logging: when a user picks a suggestion, log the click to feed back into ranking.

Out of Scope (state explicitly)

  • The actual search results page (this is just the suggestion box).
  • Spell correction in mid-string (we focus on prefix completion, not error correction).
  • Voice/image typeahead (different input modality).

Non-Functional Requirements

  1. Latency: p99 < 100 ms end-to-end (because it fires per keystroke; users notice 200 ms).
  2. Scale: 5B searches/day (~60K QPS avg, ~200K peak); each search means ~5-10 keystroke requests, so ~600K-1M typeahead QPS.
  3. Suggestion corpus: 1B unique queries seen historically, ~50M with enough signal to be a candidate suggestion.
  4. Availability: 99.95%; degrade to no suggestions rather than break the search box.
  5. Eventual freshness: a new query becomes a suggestion within ~5-15 min for trending terms; ~24 h for steady-state.

Back-of-the-Envelope Estimation

Traffic

Text
---------- Query and prefix volume ----------
Searches/day:                       5B
Keystrokes per search avg:          ~7 (most users type a few chars)
Typeahead requests/day:             5B * ~7 = 35B (only 5/7 actually fire suggestions)
Typeahead QPS avg:                  35B / 86400 ~= 405K /sec
Typeahead QPS peak (3x):            ~1.2M /sec

Unique prefixes seen:               ~10B (but 99% of traffic hits ~1M popular prefixes)

Suggestion Index

Text
---------- Suggestion corpus ----------
Unique queries with significant volume:    ~50M
Per entry storage (query, score, meta):    ~100 bytes
Flat suggestion table:                     ~5 GB
With trie nodes and per-node top-K cache:  ~30-50 GB

This fits easily on a single server. The challenge isn't size; it's QPS and freshness.

QPS per server

Text
---------- Per-server QPS ----------
A tuned in-memory trie server (Go/C++):
  - 50K-200K lookups/sec depending on trie size
For 1.2M peak QPS, with redundancy: ~30 trie servers per region

Most traffic should hit a cache (edge or app-tier) so the trie servers see only ~5-10% of total QPS.

High-Level Design

Text
---------- High-level architecture ----------
  +----------+
  |  Client  |  (search box)
  +----------+
        |
        v
  +---------------------+
  |   CDN / Edge Cache  |  (cache top suggestions for popular prefixes)
  +---------------------+
        |
        v
  +---------------------+
  |  Typeahead API      |  (validates, applies user context)
  +---------------------+
        |
        v
  +---------------------+
  |  Trie Service       |  (sharded by prefix; in-memory trie)
  |  -> top-K per node  |
  +---------------------+
        |
  +-----+----------+
  v                v
+--------+    +----------+
|Personal|    |Ranking   |
|izer    |    | (BM25,   |
|(reranker)   |  CTR, ...|
+--------+    +----------+

         (offline)
  +-------------------------+
  |   Click Log (Kafka)     |
  +-------------------------+
        |
        v
  +-------------------------+
  |  Spark Pipeline (daily) |
  |  - count queries        |
  |  - score by signals     |
  |  - rebuild trie         |
  +-------------------------+
        |
        v
  +-------------------------+
  |  Suggestion Snapshot    |
  |   (S3, then loaded into |
  |   trie servers)         |
  +-------------------------+

         (real-time)
  +-------------------------+
  |  Trending Detector      |
  |  - count last 5 min     |
  |  - inject hot prefixes  |
  +-------------------------+

API Design

Jsonc
// Typeahead request
GET /api/v1/typeahead?q=face&locale=en-US&limit=10

// Response
{
    "prefix": "face",
    "suggestions": [
        { "text": "facebook",        "score": 0.98 },
        { "text": "face mask",       "score": 0.72 },
        { "text": "face recognition", "score": 0.68 },
        { "text": "facetime",        "score": 0.65 }
    ],
    "latency_ms": 12,
    "served_from": "edge-cache"
}

// Click log (fire-and-forget from client when user picks a suggestion)
POST /api/v1/typeahead/click
{
    "prefix": "face",
    "selected": "facebook",
    "position": 0
}

Detailed Design

The two interesting components are the trie with precomputed top-K and the offline ranking pipeline.

Trie with Precomputed Top-K Per Node

Why a trie?

Lookups by prefix are exactly what tries are designed for: walk the trie character by character to reach the prefix node; everything below that node is a candidate completion.

Text
---------- Trie example for prefix "fac" ----------
             root
              |
              f
             / \
            a   ...
            |
            c
           / | \
          e  i  t
          |  |  |
         (book, (ial, (ory, ...)
          time,  ory)
          ...)

To find suggestions for "fac", walk f->a->c, then collect all leaves under that node.

The naive problem: collecting leaves is O(N)

A popular prefix like "how to" might have 100K leaves. Iterating all of them per query is slow.

The solution: precompute top-K at every node

At each trie node, store a precomputed list of the top 10 highest-scoring queries that complete this prefix. Reading top suggestions becomes O(prefix length) instead of O(leaves).

Text
---------- Annotated trie node ----------
Node 'fac' stores:
  top_k = [
    ("facebook",         0.98),
    ("face mask",        0.72),
    ("face recognition", 0.68),
    ...10 entries...
  ]

Memory cost: 10 entries * ~50 bytes * (number of trie nodes). For 50M queries with avg query length 15 chars, ~750M trie nodes -> ~375 GB if every node stored top-K.

Bounded memory: only nodes at certain depths

We don't store top-K at every node. Only at nodes between depths 2 and 6 (the typical prefix length where suggestions matter). At depth 7+, we walk down to the few remaining leaves.

Text
---------- Depth limits ----------
Depth < 2:      no top-K (would be ranked among all queries; not useful)
Depth 2-6:      top-K cached
Depth 7+:       walk to leaves (small subtree by then)

Reduces memory to ~30-50 GB; one server holds the whole trie.

Sharding the trie

For redundancy and QPS, replicate the trie across many servers. They're identical (read-only between rebuilds). Load balancer round-robins requests.

Partitioning by prefix (servers A-F, G-L, M-R, ...) is also possible but creates uneven load (some letters are far more common). Replication is simpler.

Lookup pseudocode
// Trie lookup
function lookup(prefix) {
    let node = root;
    for (const ch of prefix) {
        node = node.children[ch];
        if (!node) return [];        // unknown prefix
    }
    if (node.topK) return node.topK; // precomputed
    // walk subtree (small at depth 7+)
    return collectLeaves(node).slice(0, 10);
}

Offline Ranking Pipeline

The trie is rebuilt daily (or hourly for fresher data) by a batch pipeline.

Inputs
  • Search logs for the past 30 days: every query a user submitted, with timestamp.
  • Click logs from the typeahead service: which suggestion was selected for which prefix.
  • Spam/blocklist: terms we've manually banned (offensive or low-quality queries).
Pipeline stages (Spark)
Text
---------- Daily ranking pipeline ----------
Stage 1: aggregate query counts
  - Group by (query, day); compute count.
  - Filter queries with < 50 searches in 30 days (long tail noise).

Stage 2: compute features per query
  - frequency_30d:  total searches in last 30 days
  - frequency_24h:  recency boost
  - ctr_per_prefix: click-through rate when shown as a suggestion
  - personalization features (separate path)

Stage 3: score = weighted sum
  score = w1 * log(frequency_30d) + w2 * recency_bonus + w3 * ctr - w4 * spam_score

Stage 4: build trie
  - Sort queries by score.
  - For each query, insert into trie.
  - At each ancestor node up to depth 6, update its top-K heap.

Stage 5: snapshot
  - Serialize trie to a compact binary format.
  - Upload to S3.

Stage 6: deploy
  - Trie servers download the snapshot, load into memory, swap atomically.
  - Old version remains available until new version is loaded everywhere.

Daily rebuild from scratch keeps the index simple (no incremental complexity) and corrects for stale entries naturally.

Why rebuild rather than update incrementally?

Incremental updates to a trie are possible but complex (rebalancing, removing dead nodes, updating top-K heaps). Rebuilds are O(N) but happen offline and finish in hours. For typeahead's freshness requirement (~daily), this is the right trade-off.

Real-Time Trending Injection

Daily rebuilds miss breaking news. A 'Taylor Swift announces new album' query may surge in 5 minutes; we want it suggested within the next typeahead request.

Text
---------- Trending pipeline ----------
Real-time:
  1. Stream of queries from Kafka.
  2. Trending Detector counts queries per (locale, term) in 5-min sliding windows.
  3. Compute z-score: how anomalous is this term's count vs its 30-day baseline?
  4. If z > threshold, emit a 'trending' record.

Injection:
  5. Trie server periodically (every 60s) pulls the trending list.
  6. For each trending term, walk to its prefix nodes and insert/promote it in top-K.
  7. Trending entries decay after 1 hour if z-score drops.

This bypasses the daily pipeline for hot terms and makes typeahead feel real-time.

Edge Caching: The Real Workhorse

99% of typeahead traffic hits ~1M popular prefixes. Edge caching absorbs this load before it touches the trie servers.

Text
---------- Edge cache layout ----------
Cache key:    typeahead:<locale>:<prefix>
Cache value:  serialized JSON of top-10 suggestions
TTL:          30-60 minutes for popular prefixes
Storage:      CDN edge POPs (Cloudfront/Fastly) close to users

With ~1M popular prefixes * 1 KB = ~1 GB of cache per edge POP, we serve hundreds of thousands of QPS per POP at <10 ms latency.

Cache invalidation on rebuild: on snapshot deploy, bump a version number in the cache key (typeahead:en-US:v243:<prefix>), causing edges to refetch on next request. New version warms gradually (cold cache after rebuild).

Typo Tolerance

Users mistype. "facbook" should still suggest "facebook".

Edge n-grams

Precompute character n-grams (3-grams, 4-grams) of each indexed query and build an inverted index from n-gram -> queries. Lookup splits the prefix into n-grams and intersects the posting lists.

Text
---------- Edge n-gram example ----------
"facebook" -> ["fac", "ace", "ceb", "ebo", "boo", "ook"]

Query "facb" -> ["fac", "acb"]
  acb intersects {} (no exact match)
  Fall back to fuzzy: edit-distance 1 of "facb" includes "face"
  -> trie lookup with "face"
Damerau-Levenshtein at the prefix

For short prefixes (3-5 chars), enumerate all strings within edit-distance 1 and look each up in the trie. Costs O(prefix_length * alphabet_size) lookups but the trie's response is fast.

Production systems combine: fast trie lookup first, fall back to fuzzy only when zero results.

Personalization

A logged-in user might prefer different completions than a logged-out one (their search history, language, location).

Text
---------- Personalization tier ----------
Global trie:           returns top-10 candidates for the prefix
Personalizer:          re-ranks the top 50 (extra fetched) using user features
  - user's search history
  - user's location
  - user's language
  - user's followed entities
Returns top 10 to the client.

Personalization adds latency (~10-30 ms) and only fires for logged-in users. Anonymous users skip it.

Data Model

Search Logs (Kafka -> Hadoop/S3)

Queries flow through Kafka and land in S3 partitioned by day. Spark jobs read from S3 for the daily pipeline.

SQL
CREATE TABLE search_logs (
    user_id          BIGINT,           -- nullable for anonymous
    query            VARCHAR(255) NOT NULL,
    locale           VARCHAR(10),
    timestamp        TIMESTAMPTZ NOT NULL,
    is_typeahead_pick BOOLEAN,         -- true if from suggestion click
    selected_position INT              -- if applicable
);

Suggestion Index (in-memory trie + S3 snapshot)

At rest, the trie is a binary file in S3:

Text
---------- Snapshot format (conceptual) ----------
Header: version, locale, timestamp, total_nodes
Node array: each node has
  - char (1 byte)
  - children offsets (variable)
  - top_k entries (each entry is text offset + score)
String table: deduped suggestion strings

Loaded into the trie server at startup or on hot-swap.

Personalization Features (Cassandra, per-user)

SQL
-- Cassandra
CREATE TABLE user_search_features (
    user_id          bigint PRIMARY KEY,
    top_categories   map<text, double>,    -- category -> affinity
    location_lat_lng tuple<double, double>,
    language         text,
    last_updated     timestamp
);

Click Aggregates (Redis or Spark output)

Text
---------- Aggregated click features ----------
Key:    ctr:<prefix>:<suggestion>
Value:  {impressions, clicks, ctr}
Updated daily by Spark job feeding back into ranking.

Scaling and Bottlenecks

The Super Bowl moment

A halftime show search query surges 100x in 30 seconds. Mitigations:

  • Trending detector picks it up within 5 minutes.
  • Edge caches absorb the bulk of the surge once the trending term is in suggestions.
  • Trie servers are over-provisioned for 5x baseline; auto-scale on QPS for additional 3x.

Cold start after daily rebuild

Fresh edge caches mean every popular prefix misses on the first request after deploy. Without warming, the trie servers see a 100x QPS burst.

Warming strategy: a synthetic crawler issues the top 10K most popular prefixes against each edge POP after deploy, warming the cache before real users hit the cold edges.

Localized vs global suggestions

A query popular in Brazil might be unknown in Japan. We maintain separate tries per locale (en-US, en-GB, ja-JP, pt-BR, ...). Per-locale storage is small; the index server can hold many locales in memory.

Spam and abuse

Trending detector can be gamed (coordinated query injection). Mitigations:

  • Per-IP rate limiting on counted queries.
  • Heuristic filters (queries from new accounts weighted less).
  • Human moderation for trending suggestions in sensitive locales.

Multi-region

Deploy the trie + edge caches in every major region (us-east, us-west, eu-west, ap-east). Daily snapshot replicates to all regions; edge caches are independent. Personalization data stays user-regional.

Trade-offs and Alternatives

Trie vs sorted strings vs inverted index

StructureLookup timeSpaceBest for
TrieO(prefix length)High (lots of small nodes)Short prefixes, dense alphabet
Sorted strings + binary searchO(log N + scan)Low (just the array)Memory-constrained; slower
Inverted index (n-gram)O(intersection)MediumTypo tolerance, partial matches

Production systems use a trie for the fast path and an n-gram index for fuzzy fallback. Sorted-strings is rarely the right choice when memory is plentiful.

Daily batch vs streaming index updates

Streaming (e.g., insert each new query into the trie immediately) keeps freshness perfect but adds enormous complexity (concurrent index updates, top-K maintenance under writes). Daily batch + 5-min trending injection covers 99% of use cases at a fraction of the engineering cost.

Why precompute top-K at every node vs querying on demand?

Without precompute, a popular prefix like 'how' would scan thousands of completions per query. Precompute moves that work to offline batch (cheap because it runs once per day) and makes reads O(prefix length). Standard space-time trade-off.

Personalization on the hot path vs not at all

Personalization helps logged-in user UX but doubles latency and complicates caching (each user's response is unique, so edge caching is per-user, not per-prefix). We compromise: skip personalization for anonymous users (cache hits dominate), enable for logged-in users (cache miss but better UX).

Edit-distance fuzzy vs phonetic matching

Edit distance handles typos. Phonetic matching (Soundex, Metaphone) handles 'sounds like' (great for proper nouns). Production systems often layer both: edit distance first, phonetic as a secondary fallback. We focused on edit distance here; phonetic is a reasonable extension.

Why not Elasticsearch?

Elasticsearch can do prefix queries via 'completion suggester' or n-gram fields. It works but is heavier per-query than a custom in-memory trie (50-100 ms vs <10 ms) and harder to control for the top-K precomputation pattern. For a dedicated typeahead service at high QPS, custom code wins; for products where typeahead is a side feature on top of an existing ES cluster, ES is fine.

Real-World Examples

How real systems implement this in production

Google Search typeahead

Google serves billions of typeahead requests daily. Their suggestions blend global popularity, real-time trending (Google Trends), and personalization based on the user's search history and location. The infrastructure uses precomputed completions per prefix, served from globally distributed edges. They publish the 'YearInSearch' trending data which is computed by similar pipelines.

Trade-off: Google's strong personalization makes its suggestions feel uncanny but also creates filter-bubble concerns. The trade-off is debated publicly: more personal -> more useful but less serendipitous discovery.

Amazon product search typeahead

Amazon's typeahead surfaces both product titles and category navigations. The data structure is similar (trie + top-K) but with product-specific signals: stock availability, profit margin, conversion rate per query. Suggestions are heavily personalized to the user's purchase history.

Trade-off: Amazon ranks suggestions partly by what's profitable to recommend, not just what's most popular. This is a different optimization than search engines: typeahead is a sales funnel, not just a query helper.

LinkedIn people-search typeahead

LinkedIn typeahead must surface real names from a graph of 1B+ users. They use an inverted index of n-grams over user names plus social-graph signals (closer connections rank higher). Auto-complete is per-user (different results for different searchers).

Trade-off: LinkedIn cannot use a global popularity ranking because every user's relevant matches are different. The trade-off vs Google: more personalization upfront, no edge caching, much higher per-query cost.

Stack Overflow tag typeahead

Smaller scale (~50K tags) but interesting pattern: tags are a closed vocabulary, so a simple sorted list with prefix binary search works perfectly. No trie needed because the corpus is so small.

Trade-off: Stack Overflow's design shows that not every typeahead needs a trie. For small, closed vocabularies, simpler structures win on operational simplicity. Reach for the trie only when N grows past 100K.

Quick Interview Phrases

Key terms to use in your answer

trie with precomputed top-K per node
edge caching absorbs the long tail
daily batch rebuild plus real-time trending injection
edit-distance fallback for typo tolerance
sharded by replication, not by prefix
personalization at the rerank tier

Common Interview Questions

Questions you might be asked about this topic

Client fires GET /typeahead?q=f after a 50 ms debounce. Request hits CDN edge. If 'f' is in the edge cache (99% chance for any single letter), edge returns the precomputed top-10 in <10 ms. If miss, request goes to the typeahead API, which routes to a trie server (round-robin across replicas). Trie walks 'f' (depth 1, no top-K stored at depth 1), so it walks down to depth 2 ('fa', 'fb', 'fc', ...) and merges their top-K's. Returns top-10 in ~5 ms. Total latency from keystroke to render: ~50-100 ms.

Interview Tips

How to discuss this topic effectively

1

Lead with the latency budget. Saying 'p99 < 100 ms because it fires per keystroke' immediately frames every later decision (in-memory, edge cached, sharded for redundancy).

2

Pick a trie with top-K precomputed at each node. This is the textbook answer; saying it confidently signals you've seen this problem before.

3

Mention edge caching as 'the real workhorse'. 99% of traffic on popular prefixes is cached at CDN edges; the trie servers handle the long tail. This shows you understand traffic distribution.

4

Don't try to make the index real-time. Daily batch + 5-min trending injection is the standard pattern. Reaching for streaming index updates is a senior-level mistake.

5

Cite Google: 'they precompute completions per prefix and serve from edge caches'. It's the canonical example and grounds your answer in real production.

Common Mistakes

Pitfalls to avoid in interviews

Walking the trie subtree on every query without precomputing top-K

Popular prefixes can have 100K leaves. Iterating per query is slow. Precompute top-K (say, 10) at every trie node up to depth 6 during the offline build; reads become O(prefix length).

Storing the suggestion index in a database

Suggestion lookups are read-heavy (1M+ QPS), latency-critical (<100 ms), and the data fits in memory (~30-50 GB). An in-memory trie or DAWG on a dedicated server outperforms any database; databases add ~5-50 ms overhead per lookup.

Trying to update the trie in real-time per search

Concurrent trie updates with top-K maintenance are complex. Rebuild daily from search logs (simple, batch). Inject trending queries every 60 seconds via a separate code path that mutates only the top-K (not the trie structure). Best of both worlds.

Ignoring caching and sending every query to the trie service

99% of traffic hits ~1M popular prefixes. Cache them at the CDN edge with a 30-60 min TTL. Without this, your trie service capacity is 100x what it needs to be.

Personalizing every request by re-ranking on the hot path

Personalization breaks edge caching (every user's response is unique). Skip it for anonymous users (the majority of traffic). For logged-in users, fetch top-50 from the global trie and rerank in a personalizer service; latency is ~30 ms higher but cache miss is acceptable for that segment.