System Design Article
Design Typeahead / Autocomplete
Difficulty: Medium
Design a typeahead/autocomplete service like Google Search's suggestion bar that returns the top 10 ranked completions for a query prefix in under 100ms p99, scaling to 5B searches per day with a multi-billion-entry suggestion index. The interview centerpiece is the data structure choice (trie vs sorted strings vs ngram index) and the offline pipeline that ranks suggestions by frequency, recency, personalization, and click-through rate. We cover the trie with precomputed top-K per node, edge n-gram indexes for typo tolerance, the MapReduce/Spark batch pipeline that rebuilds suggestions nightly, and the per-region edge cache that absorbs 99% of traffic.
Design Typeahead / Autocomplete
Design a typeahead/autocomplete service like Google Search's suggestion bar that returns the top 10 ranked completions for a query prefix in under 100ms p99, scaling to 5B searches per day with a multi-billion-entry suggestion index. The interview centerpiece is the data structure choice (trie vs sorted strings vs ngram index) and the offline pipeline that ranks suggestions by frequency, recency, personalization, and click-through rate. We cover the trie with precomputed top-K per node, edge n-gram indexes for typo tolerance, the MapReduce/Spark batch pipeline that rebuilds suggestions nightly, and the per-region edge cache that absorbs 99% of traffic.
712 views
23
Requirements
Functional Requirements
- Return suggestions for a query prefix as the user types; fire on every keystroke after 2 characters.
- Top 10 suggestions ranked by global popularity, recency, and (optionally) personalized to the user.
- Typo tolerance: small typos in the prefix still surface relevant suggestions ("facbook" -> "facebook").
- Multi-language: support multiple locales; each locale has its own suggestion set.
- Freshness: trending queries (a breaking news term) appear within minutes, not next-day batch.
- Click logging: when a user picks a suggestion, log the click to feed back into ranking.
Out of Scope (state explicitly)
- The actual search results page (this is just the suggestion box).
- Spell correction in mid-string (we focus on prefix completion, not error correction).
- Voice/image typeahead (different input modality).
Non-Functional Requirements
- Latency: p99 < 100 ms end-to-end (because it fires per keystroke; users notice 200 ms).
- Scale: 5B searches/day (~60K QPS avg, ~200K peak); each search means ~5-10 keystroke requests, so ~600K-1M typeahead QPS.
- Suggestion corpus: 1B unique queries seen historically, ~50M with enough signal to be a candidate suggestion.
- Availability: 99.95%; degrade to no suggestions rather than break the search box.
- Eventual freshness: a new query becomes a suggestion within ~5-15 min for trending terms; ~24 h for steady-state.
Back-of-the-Envelope Estimation
Traffic
---------- Query and prefix volume ----------
Searches/day: 5B
Keystrokes per search avg: ~7 (most users type a few chars)
Typeahead requests/day: 5B * ~7 = 35B (only 5/7 actually fire suggestions)
Typeahead QPS avg: 35B / 86400 ~= 405K /sec
Typeahead QPS peak (3x): ~1.2M /sec
Unique prefixes seen: ~10B (but 99% of traffic hits ~1M popular prefixes)Suggestion Index
---------- Suggestion corpus ----------
Unique queries with significant volume: ~50M
Per entry storage (query, score, meta): ~100 bytes
Flat suggestion table: ~5 GB
With trie nodes and per-node top-K cache: ~30-50 GBThis fits easily on a single server. The challenge isn't size; it's QPS and freshness.
QPS per server
---------- Per-server QPS ----------
A tuned in-memory trie server (Go/C++):
- 50K-200K lookups/sec depending on trie size
For 1.2M peak QPS, with redundancy: ~30 trie servers per regionMost traffic should hit a cache (edge or app-tier) so the trie servers see only ~5-10% of total QPS.
High-Level Design
---------- High-level architecture ----------
+----------+
| Client | (search box)
+----------+
|
v
+---------------------+
| CDN / Edge Cache | (cache top suggestions for popular prefixes)
+---------------------+
|
v
+---------------------+
| Typeahead API | (validates, applies user context)
+---------------------+
|
v
+---------------------+
| Trie Service | (sharded by prefix; in-memory trie)
| -> top-K per node |
+---------------------+
|
+-----+----------+
v v
+--------+ +----------+
|Personal| |Ranking |
|izer | | (BM25, |
|(reranker) | CTR, ...|
+--------+ +----------+
(offline)
+-------------------------+
| Click Log (Kafka) |
+-------------------------+
|
v
+-------------------------+
| Spark Pipeline (daily) |
| - count queries |
| - score by signals |
| - rebuild trie |
+-------------------------+
|
v
+-------------------------+
| Suggestion Snapshot |
| (S3, then loaded into |
| trie servers) |
+-------------------------+
(real-time)
+-------------------------+
| Trending Detector |
| - count last 5 min |
| - inject hot prefixes |
+-------------------------+API Design
// Typeahead request
GET /api/v1/typeahead?q=face&locale=en-US&limit=10
// Response
{
"prefix": "face",
"suggestions": [
{ "text": "facebook", "score": 0.98 },
{ "text": "face mask", "score": 0.72 },
{ "text": "face recognition", "score": 0.68 },
{ "text": "facetime", "score": 0.65 }
],
"latency_ms": 12,
"served_from": "edge-cache"
}
// Click log (fire-and-forget from client when user picks a suggestion)
POST /api/v1/typeahead/click
{
"prefix": "face",
"selected": "facebook",
"position": 0
}Detailed Design
The two interesting components are the trie with precomputed top-K and the offline ranking pipeline.
Trie with Precomputed Top-K Per Node
Why a trie?
Lookups by prefix are exactly what tries are designed for: walk the trie character by character to reach the prefix node; everything below that node is a candidate completion.
---------- Trie example for prefix "fac" ----------
root
|
f
/ \
a ...
|
c
/ | \
e i t
| | |
(book, (ial, (ory, ...)
time, ory)
...)To find suggestions for "fac", walk f->a->c, then collect all leaves under that node.
The naive problem: collecting leaves is O(N)
A popular prefix like "how to" might have 100K leaves. Iterating all of them per query is slow.
The solution: precompute top-K at every node
At each trie node, store a precomputed list of the top 10 highest-scoring queries that complete this prefix. Reading top suggestions becomes O(prefix length) instead of O(leaves).
---------- Annotated trie node ----------
Node 'fac' stores:
top_k = [
("facebook", 0.98),
("face mask", 0.72),
("face recognition", 0.68),
...10 entries...
]Memory cost: 10 entries * ~50 bytes * (number of trie nodes). For 50M queries with avg query length 15 chars, ~750M trie nodes -> ~375 GB if every node stored top-K.
Bounded memory: only nodes at certain depths
We don't store top-K at every node. Only at nodes between depths 2 and 6 (the typical prefix length where suggestions matter). At depth 7+, we walk down to the few remaining leaves.
---------- Depth limits ----------
Depth < 2: no top-K (would be ranked among all queries; not useful)
Depth 2-6: top-K cached
Depth 7+: walk to leaves (small subtree by then)Reduces memory to ~30-50 GB; one server holds the whole trie.
Sharding the trie
For redundancy and QPS, replicate the trie across many servers. They're identical (read-only between rebuilds). Load balancer round-robins requests.
Partitioning by prefix (servers A-F, G-L, M-R, ...) is also possible but creates uneven load (some letters are far more common). Replication is simpler.
Lookup pseudocode
// Trie lookup
function lookup(prefix) {
let node = root;
for (const ch of prefix) {
node = node.children[ch];
if (!node) return []; // unknown prefix
}
if (node.topK) return node.topK; // precomputed
// walk subtree (small at depth 7+)
return collectLeaves(node).slice(0, 10);
}Offline Ranking Pipeline
The trie is rebuilt daily (or hourly for fresher data) by a batch pipeline.
Inputs
- Search logs for the past 30 days: every query a user submitted, with timestamp.
- Click logs from the typeahead service: which suggestion was selected for which prefix.
- Spam/blocklist: terms we've manually banned (offensive or low-quality queries).
Pipeline stages (Spark)
---------- Daily ranking pipeline ----------
Stage 1: aggregate query counts
- Group by (query, day); compute count.
- Filter queries with < 50 searches in 30 days (long tail noise).
Stage 2: compute features per query
- frequency_30d: total searches in last 30 days
- frequency_24h: recency boost
- ctr_per_prefix: click-through rate when shown as a suggestion
- personalization features (separate path)
Stage 3: score = weighted sum
score = w1 * log(frequency_30d) + w2 * recency_bonus + w3 * ctr - w4 * spam_score
Stage 4: build trie
- Sort queries by score.
- For each query, insert into trie.
- At each ancestor node up to depth 6, update its top-K heap.
Stage 5: snapshot
- Serialize trie to a compact binary format.
- Upload to S3.
Stage 6: deploy
- Trie servers download the snapshot, load into memory, swap atomically.
- Old version remains available until new version is loaded everywhere.Daily rebuild from scratch keeps the index simple (no incremental complexity) and corrects for stale entries naturally.
Why rebuild rather than update incrementally?
Incremental updates to a trie are possible but complex (rebalancing, removing dead nodes, updating top-K heaps). Rebuilds are O(N) but happen offline and finish in hours. For typeahead's freshness requirement (~daily), this is the right trade-off.
Real-Time Trending Injection
Daily rebuilds miss breaking news. A 'Taylor Swift announces new album' query may surge in 5 minutes; we want it suggested within the next typeahead request.
---------- Trending pipeline ----------
Real-time:
1. Stream of queries from Kafka.
2. Trending Detector counts queries per (locale, term) in 5-min sliding windows.
3. Compute z-score: how anomalous is this term's count vs its 30-day baseline?
4. If z > threshold, emit a 'trending' record.
Injection:
5. Trie server periodically (every 60s) pulls the trending list.
6. For each trending term, walk to its prefix nodes and insert/promote it in top-K.
7. Trending entries decay after 1 hour if z-score drops.This bypasses the daily pipeline for hot terms and makes typeahead feel real-time.
Edge Caching: The Real Workhorse
99% of typeahead traffic hits ~1M popular prefixes. Edge caching absorbs this load before it touches the trie servers.
---------- Edge cache layout ----------
Cache key: typeahead:<locale>:<prefix>
Cache value: serialized JSON of top-10 suggestions
TTL: 30-60 minutes for popular prefixes
Storage: CDN edge POPs (Cloudfront/Fastly) close to usersWith ~1M popular prefixes * 1 KB = ~1 GB of cache per edge POP, we serve hundreds of thousands of QPS per POP at <10 ms latency.
Cache invalidation on rebuild: on snapshot deploy, bump a version number in the cache key (typeahead:en-US:v243:<prefix>), causing edges to refetch on next request. New version warms gradually (cold cache after rebuild).
Typo Tolerance
Users mistype. "facbook" should still suggest "facebook".
Edge n-grams
Precompute character n-grams (3-grams, 4-grams) of each indexed query and build an inverted index from n-gram -> queries. Lookup splits the prefix into n-grams and intersects the posting lists.
---------- Edge n-gram example ----------
"facebook" -> ["fac", "ace", "ceb", "ebo", "boo", "ook"]
Query "facb" -> ["fac", "acb"]
acb intersects {} (no exact match)
Fall back to fuzzy: edit-distance 1 of "facb" includes "face"
-> trie lookup with "face"Damerau-Levenshtein at the prefix
For short prefixes (3-5 chars), enumerate all strings within edit-distance 1 and look each up in the trie. Costs O(prefix_length * alphabet_size) lookups but the trie's response is fast.
Production systems combine: fast trie lookup first, fall back to fuzzy only when zero results.
Personalization
A logged-in user might prefer different completions than a logged-out one (their search history, language, location).
---------- Personalization tier ----------
Global trie: returns top-10 candidates for the prefix
Personalizer: re-ranks the top 50 (extra fetched) using user features
- user's search history
- user's location
- user's language
- user's followed entities
Returns top 10 to the client.Personalization adds latency (~10-30 ms) and only fires for logged-in users. Anonymous users skip it.
Data Model
Search Logs (Kafka -> Hadoop/S3)
Queries flow through Kafka and land in S3 partitioned by day. Spark jobs read from S3 for the daily pipeline.
CREATE TABLE search_logs (
user_id BIGINT, -- nullable for anonymous
query VARCHAR(255) NOT NULL,
locale VARCHAR(10),
timestamp TIMESTAMPTZ NOT NULL,
is_typeahead_pick BOOLEAN, -- true if from suggestion click
selected_position INT -- if applicable
);Suggestion Index (in-memory trie + S3 snapshot)
At rest, the trie is a binary file in S3:
---------- Snapshot format (conceptual) ----------
Header: version, locale, timestamp, total_nodes
Node array: each node has
- char (1 byte)
- children offsets (variable)
- top_k entries (each entry is text offset + score)
String table: deduped suggestion stringsLoaded into the trie server at startup or on hot-swap.
Personalization Features (Cassandra, per-user)
-- Cassandra
CREATE TABLE user_search_features (
user_id bigint PRIMARY KEY,
top_categories map<text, double>, -- category -> affinity
location_lat_lng tuple<double, double>,
language text,
last_updated timestamp
);Click Aggregates (Redis or Spark output)
---------- Aggregated click features ----------
Key: ctr:<prefix>:<suggestion>
Value: {impressions, clicks, ctr}
Updated daily by Spark job feeding back into ranking.Scaling and Bottlenecks
The Super Bowl moment
A halftime show search query surges 100x in 30 seconds. Mitigations:
- Trending detector picks it up within 5 minutes.
- Edge caches absorb the bulk of the surge once the trending term is in suggestions.
- Trie servers are over-provisioned for 5x baseline; auto-scale on QPS for additional 3x.
Cold start after daily rebuild
Fresh edge caches mean every popular prefix misses on the first request after deploy. Without warming, the trie servers see a 100x QPS burst.
Warming strategy: a synthetic crawler issues the top 10K most popular prefixes against each edge POP after deploy, warming the cache before real users hit the cold edges.
Localized vs global suggestions
A query popular in Brazil might be unknown in Japan. We maintain separate tries per locale (en-US, en-GB, ja-JP, pt-BR, ...). Per-locale storage is small; the index server can hold many locales in memory.
Spam and abuse
Trending detector can be gamed (coordinated query injection). Mitigations:
- Per-IP rate limiting on counted queries.
- Heuristic filters (queries from new accounts weighted less).
- Human moderation for trending suggestions in sensitive locales.
Multi-region
Deploy the trie + edge caches in every major region (us-east, us-west, eu-west, ap-east). Daily snapshot replicates to all regions; edge caches are independent. Personalization data stays user-regional.
Trade-offs and Alternatives
Trie vs sorted strings vs inverted index
| Structure | Lookup time | Space | Best for |
|---|---|---|---|
| Trie | O(prefix length) | High (lots of small nodes) | Short prefixes, dense alphabet |
| Sorted strings + binary search | O(log N + scan) | Low (just the array) | Memory-constrained; slower |
| Inverted index (n-gram) | O(intersection) | Medium | Typo tolerance, partial matches |
Production systems use a trie for the fast path and an n-gram index for fuzzy fallback. Sorted-strings is rarely the right choice when memory is plentiful.
Daily batch vs streaming index updates
Streaming (e.g., insert each new query into the trie immediately) keeps freshness perfect but adds enormous complexity (concurrent index updates, top-K maintenance under writes). Daily batch + 5-min trending injection covers 99% of use cases at a fraction of the engineering cost.
Why precompute top-K at every node vs querying on demand?
Without precompute, a popular prefix like 'how' would scan thousands of completions per query. Precompute moves that work to offline batch (cheap because it runs once per day) and makes reads O(prefix length). Standard space-time trade-off.
Personalization on the hot path vs not at all
Personalization helps logged-in user UX but doubles latency and complicates caching (each user's response is unique, so edge caching is per-user, not per-prefix). We compromise: skip personalization for anonymous users (cache hits dominate), enable for logged-in users (cache miss but better UX).
Edit-distance fuzzy vs phonetic matching
Edit distance handles typos. Phonetic matching (Soundex, Metaphone) handles 'sounds like' (great for proper nouns). Production systems often layer both: edit distance first, phonetic as a secondary fallback. We focused on edit distance here; phonetic is a reasonable extension.
Why not Elasticsearch?
Elasticsearch can do prefix queries via 'completion suggester' or n-gram fields. It works but is heavier per-query than a custom in-memory trie (50-100 ms vs <10 ms) and harder to control for the top-K precomputation pattern. For a dedicated typeahead service at high QPS, custom code wins; for products where typeahead is a side feature on top of an existing ES cluster, ES is fine.
Real-World Examples
How real systems implement this in production
Google serves billions of typeahead requests daily. Their suggestions blend global popularity, real-time trending (Google Trends), and personalization based on the user's search history and location. The infrastructure uses precomputed completions per prefix, served from globally distributed edges. They publish the 'YearInSearch' trending data which is computed by similar pipelines.
Trade-off: Google's strong personalization makes its suggestions feel uncanny but also creates filter-bubble concerns. The trade-off is debated publicly: more personal -> more useful but less serendipitous discovery.
Amazon's typeahead surfaces both product titles and category navigations. The data structure is similar (trie + top-K) but with product-specific signals: stock availability, profit margin, conversion rate per query. Suggestions are heavily personalized to the user's purchase history.
Trade-off: Amazon ranks suggestions partly by what's profitable to recommend, not just what's most popular. This is a different optimization than search engines: typeahead is a sales funnel, not just a query helper.
LinkedIn typeahead must surface real names from a graph of 1B+ users. They use an inverted index of n-grams over user names plus social-graph signals (closer connections rank higher). Auto-complete is per-user (different results for different searchers).
Trade-off: LinkedIn cannot use a global popularity ranking because every user's relevant matches are different. The trade-off vs Google: more personalization upfront, no edge caching, much higher per-query cost.
Smaller scale (~50K tags) but interesting pattern: tags are a closed vocabulary, so a simple sorted list with prefix binary search works perfectly. No trie needed because the corpus is so small.
Trade-off: Stack Overflow's design shows that not every typeahead needs a trie. For small, closed vocabularies, simpler structures win on operational simplicity. Reach for the trie only when N grows past 100K.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Client fires GET /typeahead?q=f after a 50 ms debounce. Request hits CDN edge. If 'f' is in the edge cache (99% chance for any single letter), edge returns the precomputed top-10 in <10 ms. If miss, request goes to the typeahead API, which routes to a trie server (round-robin across replicas). Trie walks 'f' (depth 1, no top-K stored at depth 1), so it walks down to depth 2 ('fa', 'fb', 'fc', ...) and merges their top-K's. Returns top-10 in ~5 ms. Total latency from keystroke to render: ~50-100 ms.
A separate streaming pipeline counts query frequency in 5-minute sliding windows partitioned by locale. Compute z-score: how unusual is this term's current count vs its 30-day baseline? If z > threshold, mark as trending. The trending list is published every 60 seconds. Trie servers pull the list and update top-K at the relevant prefix nodes (insert or promote the trending term). Edge caches expire normally, picking up the new top-K on next miss; or you can proactively bump cache version for trending terms. End-to-end: trending term is suggested within ~5-15 minutes of starting to surge.
Two layers. Layer 1: maintain an inverted index from character n-grams (3-grams, 4-grams) to candidate queries. For a misspelled prefix 'facbk', n-grams ['fac', 'acb', 'cbk'] are intersected; matching queries are scored. Layer 2 (fallback): for short prefixes, enumerate all strings within edit distance 1 of the prefix and look each up in the trie. The intersection of edit-distance candidates with the n-gram results gives 'looks like' matches. We always try the exact trie lookup first (fastest); fuzzy is a fallback when zero exact matches.
Two-tier ranking. The trie returns the global top-50 candidates for the prefix (instead of top-10). A Personalizer service then re-ranks these 50 using per-user features stored in Cassandra: search history, location, language, followed entities. The reranker can be a learned model (e.g., gradient-boosted trees) scoring (candidate, user_features). Returns top-10 to the client. This bypasses edge cache for logged-in users but skips personalization entirely for anonymous (~70% of traffic), so cache effectiveness is maintained for the majority.
Atomic snapshot swap. New trie binary uploaded to S3 with version number (e.g., v243). Trie servers asynchronously download and load the new version into memory; old version stays serving until the new one is ready. Then atomic pointer swap routes traffic to the new trie. Edge caches use the version number in the cache key (typeahead:en-US:v243:<prefix>), so all existing cached entries are stranded; on the next request for any prefix, a cache miss happens and the edge fetches fresh. To avoid a thundering-herd cold-cache hit, run a synthetic crawler that pre-warms the top 10K prefixes at every edge POP after deploy.
Interview Tips
How to discuss this topic effectively
Lead with the latency budget. Saying 'p99 < 100 ms because it fires per keystroke' immediately frames every later decision (in-memory, edge cached, sharded for redundancy).
Pick a trie with top-K precomputed at each node. This is the textbook answer; saying it confidently signals you've seen this problem before.
Mention edge caching as 'the real workhorse'. 99% of traffic on popular prefixes is cached at CDN edges; the trie servers handle the long tail. This shows you understand traffic distribution.
Don't try to make the index real-time. Daily batch + 5-min trending injection is the standard pattern. Reaching for streaming index updates is a senior-level mistake.
Cite Google: 'they precompute completions per prefix and serve from edge caches'. It's the canonical example and grounds your answer in real production.
Common Mistakes
Pitfalls to avoid in interviews
Walking the trie subtree on every query without precomputing top-K
Popular prefixes can have 100K leaves. Iterating per query is slow. Precompute top-K (say, 10) at every trie node up to depth 6 during the offline build; reads become O(prefix length).
Storing the suggestion index in a database
Suggestion lookups are read-heavy (1M+ QPS), latency-critical (<100 ms), and the data fits in memory (~30-50 GB). An in-memory trie or DAWG on a dedicated server outperforms any database; databases add ~5-50 ms overhead per lookup.
Trying to update the trie in real-time per search
Concurrent trie updates with top-K maintenance are complex. Rebuild daily from search logs (simple, batch). Inject trending queries every 60 seconds via a separate code path that mutates only the top-K (not the trie structure). Best of both worlds.
Ignoring caching and sending every query to the trie service
99% of traffic hits ~1M popular prefixes. Cache them at the CDN edge with a 30-60 min TTL. Without this, your trie service capacity is 100x what it needs to be.
Personalizing every request by re-ranking on the hot path
Personalization breaks edge caching (every user's response is unique). Skip it for anonymous users (the majority of traffic). For logged-in users, fetch top-50 from the global trie and rerank in a personalizer service; latency is ~30 ms higher but cache miss is acceptable for that segment.
