System Design Article
Design a Web Crawler
Difficulty: Medium
Design a distributed web crawler that fetches 5 billion pages per month from the public web while respecting robots.txt, applying per-host politeness limits, deduplicating URLs and content across a 50PB corpus, and feeding the indexer pipeline downstream. The interview centerpiece is the URL frontier: a priority-aware queue of pending URLs sharded by host so politeness rules can be enforced per domain, plus content deduplication via hashing and shingling. We cover the fetcher worker pool, DNS caching, content extraction, the bloom-filter URL seen set, and how to handle hostile sites (large pages, redirect loops, slow responses, deliberate spam).
Design a Web Crawler
Design a distributed web crawler that fetches 5 billion pages per month from the public web while respecting robots.txt, applying per-host politeness limits, deduplicating URLs and content across a 50PB corpus, and feeding the indexer pipeline downstream. The interview centerpiece is the URL frontier: a priority-aware queue of pending URLs sharded by host so politeness rules can be enforced per domain, plus content deduplication via hashing and shingling. We cover the fetcher worker pool, DNS caching, content extraction, the bloom-filter URL seen set, and how to handle hostile sites (large pages, redirect loops, slow responses, deliberate spam).
626 views
6
Requirements
Functional Requirements
- Crawl 5B pages/month from a seed list of URLs, expanding via discovered links.
- Respect robots.txt for every host: honor disallowed paths and crawl delay.
- Per-host politeness: at most N concurrent connections to one host; minimum delay between requests.
- Deduplicate URLs so the same URL isn't fetched twice.
- Deduplicate content so near-duplicate pages (e.g., the same article on a syndicated network) don't waste storage.
- Extract links from each fetched page and add to the frontier.
- Store raw HTML for the indexer pipeline to consume.
- Recrawl periodically to refresh changed pages.
Out of Scope (state explicitly)
- The actual search index (handled by the search engine case study).
- Rendering JavaScript-heavy pages (we focus on static HTML; headless browsers are a 100x cost extension).
- Crawling auth-protected content (we treat the crawler as anonymous).
- Real-time streaming of newly published pages (we focus on bulk crawling).
Non-Functional Requirements
- Scale: 5B pages/month (~2K pages/sec sustained, ~10K peak).
- Storage: ~50 PB of compressed HTML retained for indexing.
- Politeness: never violate per-host limits or robots.txt; crawler should be a 'good citizen'.
- Robustness: hostile sites (parser bombs, redirect loops, slow responses) must not crash the crawler.
- Freshness: high-priority pages (news sites) recrawled within hours; long-tail recrawled monthly.
- Coverage: at least 70% of pages discoverable from the seed set should be indexed within 30 days.
Back-of-the-Envelope Estimation
Fetch and Storage
---------- Fetch and storage estimation ----------
Pages per month: 5B
Pages per day: ~167M
Pages per second (avg): ~2K
Pages per second (peak 5x): ~10K
Avg page size (HTML, compressed): ~100 KB raw -> ~20 KB gzipped
Daily fetched bytes: 167M * 100 KB = 16.7 TB/day fetched
Stored compressed: 167M * 20 KB = 3.3 TB/day stored
Monthly storage: 100 TB/month
With 3x replication: 300 TB/month
Lifetime corpus (~5 years): ~18 PBURL Frontier
---------- URL frontier sizing ----------
Distinct URLs known: ~50B (most web URLs known to crawlers)
URL avg length: ~80 chars
Flat URL list: ~4 TB
URL bloom filter (1% FP rate): ~75 GB
Frontier (queued URLs): ~1B at any time -> 80 GBBandwidth and Compute
---------- Bandwidth and compute ----------
Ingress bandwidth: 10K pages/sec * 100 KB = 1 GB/s peak
Fetcher workers needed: ~1000 (each handling ~10 fetches/sec)
DNS lookups: 10K/sec * 1 unique host per ~100 URLs = ~100/sec
Parser CPU: 10K pages/sec * ~1 ms parse = 10 cores per nodeHigh-Level Design
---------- High-level architecture ----------
+-----------+
| Seed |
| URLs |
+-----------+
|
v
+-------------------+
| URL Frontier |
| (sharded by host) |
+-------------------+
|
v
+-------------------+
| Fetcher Workers |
| - DNS cache |
| - HTTP client |
| - politeness |
+-------------------+
| \
| \
v v
+-------------------+ +-------------------+
| robots.txt | | DNS Resolver |
| Cache | | (cached) |
+-------------------+ +-------------------+
|
v
+-------------------+
| Content Store |
| (S3 / HDFS) |
+-------------------+
|
v
+-------------------+
| Parser |
| - extract links |
| - clean text |
+-------------------+
|
v
+-------------------+
| URL Dedup |
| (Bloom filter + |
| exact set) |
+-------------------+
|
v (new URLs)
+-------------------+
| URL Frontier |
+-------------------+
+----------+
| |
v v
+-------------------+
| Content Dedup |
| (shingling / |
| MinHash) |
+-------------------+
|
v
+-------------------+
| Indexer Pipeline |
| (downstream) |
+-------------------+Crawl Loop
---------- Crawl loop ----------
1. Frontier returns next URL respecting per-host politeness.
2. Worker checks robots.txt cache for the host:
- If cached, validate URL is allowed.
- If not cached, fetch /robots.txt first, then check.
3. Worker resolves the host via DNS cache (TTL ~1 hr).
4. Worker fetches the URL with timeout (~10 sec) and size cap (~5 MB).
5. On success, content saved to Content Store with key = sha1(url).
6. Parser extracts: title, body text, outbound links.
7. Each new outbound link is hashed and checked against URL bloom filter.
- If not seen, added to frontier.
8. Content hash + shingles computed; if near-duplicate of existing, mark dup.
9. Worker reports completion; frontier updates last-fetched timestamp.Detailed Design
The two interesting components are the URL frontier (politeness-aware queue) and content deduplication via shingling.
URL Frontier: Politeness Through Sharding
Why per-host sharding?
Say the crawler has 1000 worker threads and a frontier with 10K URLs from cnn.com next. If workers grab URLs naively, they'll all hit cnn.com simultaneously, generating 1000 concurrent requests. This is rude (likely to get blocked, may DOS the site).
The rule: at most N concurrent connections per host (where N=2 is typical) AND at least M ms between requests to the same host (where M comes from robots.txt or defaults to 1000 ms).
To enforce this efficiently, shard the frontier by host:
---------- Frontier shard layout ----------
Frontier shard 0: URLs from hosts that hash to bucket 0
Frontier shard 1: URLs from hosts that hash to bucket 1
...
Each shard owns a subset of hosts.
Within a shard, URLs are queued per-host.Workers are also assigned to shards. A worker only fetches URLs from its shard's hosts, so politeness is enforced locally without coordination.
Frontier per-host queue structure
---------- Per-host queue ----------
For each host H:
- queue of pending URLs (FIFO with priority overlay)
- last_fetched_timestamp
- crawl_delay (from robots.txt; default 1000 ms)
- in_flight_count
- max_concurrency (default 2)Worker loop:
---------- Worker loop ----------
1. Pick a host H from this shard's eligible-now list:
- eligible if (now - last_fetched) >= crawl_delay AND in_flight < max_concurrency
2. Pop next URL from H's queue.
3. Increment in_flight; record last_fetched.
4. Fetch URL.
5. On completion (success or failure), decrement in_flight.
6. If H has more URLs and is eligible, mark eligible again.The 'eligible-now' list is a min-heap keyed by next_eligible_time. Workers pull the host whose eligibility expires soonest.
Priority within the frontier
Not all URLs are equal:
- News sites and high-PageRank domains: priority 1, recrawled hourly.
- Mid-tier (most blogs, forums): priority 2, recrawled weekly.
- Long tail (rarely-linked pages): priority 3, recrawled monthly.
Each host's queue is a small priority queue, not pure FIFO. The frontier ranker assigns priority at insertion time based on PageRank, last-modified hints, and content type.
Frontier durability
1B URLs in the frontier means losing it is catastrophic. We persist:
- The bloom filter of seen URLs (rebuildable but slow) -> S3 snapshots every hour.
- Per-host queues -> Kafka (each host's queue is a topic partition; partitioned by host hash).
Workers consume from Kafka with offset tracking; on restart, resume from offset.
Content Deduplication via Shingling
Why dedupe content?
Many pages are near-duplicates: syndicated news articles, RSS aggregators, content farms. Storing all of them wastes space and pollutes search results.
Exact dedup (hash of the entire page) catches identical copies but misses near-duplicates with minor differences (ad insertions, timestamps, A/B tests).
Shingling
A k-shingle is a sliding window of k consecutive words. "the quick brown fox jumps" with k=3 produces shingles {"the quick brown", "quick brown fox", "brown fox jumps"}.
Two documents are near-duplicates if their shingle sets have high Jaccard similarity (intersection / union > threshold like 0.9).
Full shingle sets are large. We compress with MinHash:
---------- MinHash sketch ----------
For each document:
Compute all shingles.
Hash each shingle with N independent hash functions (N ~= 100).
Take the minimum hash for each function.
-> N-element MinHash signature.
To estimate Jaccard between two docs:
Compare their MinHash signatures element-wise.
Fraction of matching positions ~= Jaccard similarity.MinHash signatures are ~400 bytes per document. Vastly smaller than full shingle sets and Jaccard estimable in O(N) time.
LSH for fast lookup
Naive Jaccard comparison is O(corpus size) per new document. Locality-Sensitive Hashing (LSH) buckets MinHash signatures so similar signatures land in the same bucket.
---------- LSH for near-dup lookup ----------
Split MinHash signature into B bands (each band = K rows).
Hash each band; use as a bucket key.
Documents sharing >=1 band-bucket are likely near-duplicates.
Verify with full Jaccard estimate.False positives (non-dup pairs sharing a bucket) are filtered by the verification step. False negatives (dup pairs not sharing any bucket) are rare with appropriate B,K choices.
Fetcher Pool
Per-worker capacity
A fetcher worker (Go/Python with async I/O) handles ~50-100 concurrent fetches. With ~1000 workers, total concurrency ~50K-100K active fetches; throughput ~10K pages/sec at avg 200 ms per fetch.
DNS caching
DNS resolution is slow (50-200 ms per uncached lookup). Cache resolved IPs locally in each worker with TTL ~1 hour. A central DNS resolver service handles cache misses, batching lookups and respecting DNS server limits.
---------- DNS cache ----------
Key: hostname
Value: [IP1, IP2, ...]
TTL: 1 hour (or DNS-provided TTL)
Miss: resolve via local DNS or DNSaaS, populate cacheConnection pooling
HTTP/2 connections to the same host are reused. Each worker maintains a pool of connections per host (sized to max_concurrency for that host).
Robots.txt handling
Fetched once per host per day, cached in Redis:
---------- Robots.txt cache ----------
Key: robots:<host>
Value: parsed robots.txt rules + crawl_delay
TTL: 24 hoursMissing robots.txt: assume allowed with default 1000 ms crawl delay.
Failure handling
---------- Failure modes ----------
200 OK: process content
3xx Redirect: follow up to 5 hops; track loops
4xx Client error: log; do not retry (page is broken)
5xx Server error: retry with exponential backoff; max 3 attempts;
if persistent, downgrade host's crawl rate
Timeout (10s): treat as transient; retry once
Response > 5 MB: truncate; log; stop crawling further on this host's deep pagesHosts that consistently return 5xx get their crawl rate slowed automatically (a separate adaptive feedback loop).
Data Model
URL Frontier (Kafka + per-shard state)
---------- Kafka topic ----------
Topic: url-frontier
Partitions: 1024 (sharded by hash(host))
Message: {url, priority, depth, discovered_at}Worker shards subscribe to specific partitions for politeness locality.
URL Seen Set (Bloom filter + exact backup)
---------- URL dedup ----------
Bloom filter (in-memory per shard): 1% FP rate, 75 GB across cluster
Exact set (Cassandra): for FP verification when bloom hits
- row key: (host, sha1(url))
- value: (last_fetched, etag, last_modified, status_code)When bloom says 'maybe seen', check Cassandra exact. When bloom says 'not seen', it's definitely not seen.
Content Store (S3 / HDFS)
---------- Content storage ----------
Bucket: web-crawl-content
raw/<sha1-prefix>/<sha1-of-url>.gz
metadata: { url, fetched_at, content_hash, size, mime_type }
Compressed; lifecycle to cold storage after 30 days for unindexed content.Per-Host Metadata (Postgres / Cassandra)
CREATE TABLE hosts (
host VARCHAR(253) PRIMARY KEY,
pagerank_score DOUBLE PRECISION,
crawl_delay_ms INT,
max_concurrency INT,
robots_last_fetch TIMESTAMPTZ,
last_5xx_count_24h INT,
blocked BOOLEAN DEFAULT FALSE
);Content Dedup (LSH buckets)
-- Cassandra
CREATE TABLE lsh_buckets (
band_idx int,
band_hash bigint,
document_id bigint, -- sha1(url)
PRIMARY KEY ((band_idx, band_hash), document_id)
);
CREATE TABLE document_signatures (
document_id bigint PRIMARY KEY,
signature blob, -- 400-byte MinHash
first_seen timestamp
);Scaling and Bottlenecks
A 1B-page host (reddit.com)
A single popular host has billions of URLs. Per-host politeness (e.g., 2 concurrent, 1 sec delay) caps fetches at ~2/sec, meaning crawling all of reddit.com takes years.
Mitigations:
- Negotiate higher crawl rate with the site's webmaster (most have a sitemap.xml).
- Use sitemap to skip dead URLs.
- Selective recrawl: prioritize pages by PageRank rather than crawling exhaustively.
Spam farms (AI-generated content)
A spam farm hosts 10M URLs of low-quality auto-generated content. We waste fetch budget. Mitigations:
- Spam classifier on fetched content (keyword density, structure heuristics, ML).
- Demote the host's PageRank if spam ratio > threshold.
- Add the host to a blocklist if spam ratio is overwhelming.
Frontier explosion
A single page can link to thousands of URLs. Naive insertion grows the frontier unboundedly. Mitigations:
- Per-domain link cap (max 1000 outbound links followed per page).
- Depth limit (URLs at depth > N don't expand further).
- Frontier size cap with eviction of low-priority URLs.
Fetcher worker failure
A worker crashing mid-fetch leaves an in-flight URL stranded. Mitigations:
- Workers commit Kafka offsets only after the fetch + parse + dedup pipeline completes.
- A dead worker's partition is reassigned (Kafka consumer group); the new owner restarts from the last committed offset.
- Idempotent: refetching a URL we've already crawled is not catastrophic (it's just wasted bandwidth).
Multi-region
Geographically distribute fetcher pools to be close to crawled content (cuts fetch latency and respects local network conditions). Frontier sharding considers host geographic affinity (a .jp host gets crawled from APAC region).
Trade-offs and Alternatives
Bloom filter vs exact set for URL dedup
Exact set is correct but costs memory (50B URLs * 80 bytes = 4 TB; sharded across cluster). Bloom filter is 75 GB, fits in memory, with 1% false positives (we'd skip ~1% of new URLs incorrectly). The cost-benefit favors bloom + Cassandra-as-fallback for the rare cases.
Why shingling vs simhash for near-dup?
Simhash is an alternative: a single 64-bit fingerprint with Hamming-distance comparison. It's smaller and faster but less precise for the 'is this 90% the same' question. Shingling + MinHash is the de facto choice for high-quality dup detection; simhash is good for 'roughly similar'.
Per-host sharding vs global politeness coordination
Global coordination (a central rate limiter) means every fetch checks a shared service. Doesn't scale. Per-host sharding moves politeness state to one node per host, no coordination needed; trade-off is hot hosts (lots of URLs from one domain) make their shard the bottleneck.
Fetching fully-rendered (JS) vs static HTML
Headless browsers (Puppeteer/Playwright) execute JavaScript and capture the rendered DOM. ~100x more expensive than static fetching (100 ms HTML fetch vs 5-10 sec JS render). Most crawlers fetch static HTML by default and selectively render only JS-heavy domains (sitemap-flagged or detected as JS apps).
Why Kafka for the frontier?
Durable, partitioned, replayable, supports high write rate. Alternatives:
- Redis: faster, in-memory, but losing it loses 1B URLs.
- Cassandra queue: works but Cassandra isn't optimized for queue access patterns.
- Custom: years of work to match Kafka's reliability.
Kafka is the standard answer for any 'durable distributed queue' problem.
Recrawl strategy: every N days vs adaptive
Fixed-interval recrawl wastes effort on rarely-changing pages and starves frequently-changing ones. Adaptive recrawl tracks per-page change rate (compare new content hash to previous) and adjusts the next-recrawl interval. Pages that change daily get crawled daily; pages that haven't changed in a year get crawled annually.
Why dedupe content if storage is cheap?
Dedup isn't just about storage. Near-duplicate pages pollute search results: a user asks for 'Apple Q3 earnings' and gets 50 versions of the same syndicated article. Dedup at crawl time prevents downstream indexing waste and improves search quality.
Real-World Examples
How real systems implement this in production
Google's crawler fetches ~50B pages/day across the public web. Adaptive recrawl based on page change rate and PageRank. Honors robots.txt strictly; publishes IP ranges so webmasters can identify legitimate Googlebot traffic. Distinct user-agents for search, image, video, mobile.
Trade-off: Google's adaptive recrawl saves bandwidth but means low-PageRank sites are crawled rarely (sometimes never). Webmasters complain that new pages on small sites can take weeks to be indexed; Google trades coverage for efficiency.
Microsoft Bing's crawler operates similarly to Googlebot but at smaller scale. Bing publishes a 'IndexNow' protocol where webmasters can push URL changes to Bing instead of waiting for crawl, reducing Bing's load and accelerating discovery.
Trade-off: IndexNow trades crawl autonomy for cooperation: Bing depends on webmasters notifying it of changes. Many sites adopt IndexNow because it's faster than waiting; Google has been slower to adopt similar protocols.
Common Crawl is an open web crawl publishing monthly snapshots of 5B+ pages to S3. They run a smaller cluster than Google but are explicitly polite: max 1 request/sec per host, prominent user-agent, contact info in headers. Researchers use the data for ML training and academic studies.
Trade-off: Common Crawl's politeness is so strict that they take days to crawl what Googlebot does in hours. The trade-off: they avoid being blocked anywhere, get cooperation from webmasters, and serve as a public good. Speed sacrificed for sustainability.
Open-source distributed crawler used by smaller search engines and academic projects. Architecturally similar to this lesson: Hadoop-backed URL frontier, distributed fetchers, deduplication via signatures. Lacks Google-scale tuning but illustrates the core patterns clearly.
Trade-off: Nutch's reliance on Hadoop made it operationally heavy. Modern reimplementations (StormCrawler, Heritrix) use streaming systems instead. The lesson: batch-oriented architectures (Hadoop MapReduce) are wrong for the fetch-process-dedup-queue loop, which wants streaming.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
URL hashed to determine frontier shard (by host). Sent to that shard's Kafka partition. Shard's worker pool consumes; the worker checks per-host eligibility (last_fetched + crawl_delay <= now AND in_flight < max_concurrency). When eligible, pops URL, increments in_flight. Robots.txt checked from cache; if disallowed, drop. DNS resolved from cache. HTTP fetch with timeout; content saved to S3 with key sha1(url). Parser extracts links and content; content_hash + MinHash signature computed; if near-duplicate of existing, marked dup. Each new outbound link checked against bloom filter; if not seen, queued back to frontier. Worker decrements in_flight. Total: ~200-500 ms per URL on average.
Per-host politeness enforced via frontier sharding. Each host has a queue, last_fetched_timestamp, crawl_delay (from robots.txt or default 1000 ms), and max_concurrency (default 2). A worker can only fetch from a host if (now - last_fetched) >= crawl_delay AND in_flight < max_concurrency. Adaptive throttling: if a host returns persistent 5xx, slow down (double crawl_delay until errors stop). Operator dashboards alert on hosts where we exceed politeness thresholds, allowing emergency interventions.
Compute k-shingles (sliding 5-word windows) for each document. Hash each shingle with 100 hash functions; take the minimum per function -> a 100-element MinHash signature (~400 bytes). The Jaccard similarity of two signatures is the fraction of matching positions, which approximates the document-level Jaccard similarity. For fast lookup, split signatures into bands and use Locality-Sensitive Hashing: documents sharing any band-bucket are candidates; verify with full Jaccard. Threshold ~0.9 for 'near-duplicate'.
Workers commit their Kafka offsets only after the entire pipeline (fetch + parse + dedup + queue new URLs) completes successfully. If a worker crashes mid-fetch, its Kafka offset is still pointing at the URL it was processing. Kafka's consumer group reassigns the partition to another worker, which restarts from the same offset and refetches the URL. Refetching is idempotent (we just overwrite the content_hash entry). Net effect: at-most-one-page lost? No, exactly: refetched once. Slight bandwidth cost, no correctness issue.
At 2 concurrent, 1-sec delay, we fetch Wikipedia at ~2 pages/sec, meaning 1B pages takes ~16 years. Mitigations: (1) request higher crawl rate from the site (sitemap publishers usually have a contact). (2) Use sitemaps to skip dead URLs and prioritize fresh ones. (3) Adaptive recrawl: don't refetch every page every cycle; prioritize by PageRank and last-modified hints. (4) Selective coverage: indexing the top 100M Wikipedia pages by PageRank captures 99% of search value at 10x less cost. Don't try to crawl exhaustively.
Interview Tips
How to discuss this topic effectively
Lead with politeness as the central constraint. Saying 'the URL frontier must shard by host because politeness is per-host' immediately frames the architecture correctly.
Mention robots.txt by name and explain crawl_delay. Many candidates skip this; it signals you understand the social contract of crawling.
Use bloom filter for URL dedup. It's the textbook answer; saying 'bloom filter with Cassandra fallback for false-positive verification' is the senior framing.
For near-duplicate detection, name shingling and MinHash specifically. These are real algorithms with real names; saying them confidently signals depth.
Always discuss hostile sites (parser bombs, redirect loops, slow responses). This shows operational maturity that pure algorithm-focused candidates miss.
Common Mistakes
Pitfalls to avoid in interviews
Letting workers grab URLs without per-host politeness coordination
Naively, 1000 workers all try the same popular host. They generate 1000 concurrent requests, get blocked, and look like a DOS attack. Shard the URL frontier by host so each shard's workers naturally rate-limit per-host. Politeness is the single biggest constraint.
Storing every URL in a SQL table for dedup
50B URLs at 80 bytes = 4 TB. Lookups during fetch would be slow. Use a bloom filter (~75 GB, in-memory, 1% FP rate) with Cassandra as exact-set fallback when the filter says 'maybe'. The 1% false-positive rate means rarely missing a new URL, which is acceptable.
Detecting duplicates only by exact hash
Many pages differ only in ads, timestamps, or trivial layout. Exact hashing misses these and you store thousands of near-duplicate copies. Shingling + MinHash + LSH catches near-duplicates at <100 byte signatures and bucketed lookup.
Ignoring robots.txt
Crawling without honoring robots.txt is rude and can get your IPs blacklisted across the web. Fetch /robots.txt before any fetch on a new host (cached daily); honor disallowed paths and crawl_delay. Even if your crawler is technically capable of more, politeness is a long-term cooperation game.
Following every link without depth or per-page caps
A single page can link to thousands of URLs (sitemaps, archives, category pages). Without limits the frontier explodes uncontrollably. Cap per-page outbound links (1000 max), depth (10 levels max), and total frontier size with low-priority eviction.
