System Design Article

Design a Web Crawler

Difficulty: Medium

Design a distributed web crawler that fetches 5 billion pages per month from the public web while respecting robots.txt, applying per-host politeness limits, deduplicating URLs and content across a 50PB corpus, and feeding the indexer pipeline downstream. The interview centerpiece is the URL frontier: a priority-aware queue of pending URLs sharded by host so politeness rules can be enforced per domain, plus content deduplication via hashing and shingling. We cover the fetcher worker pool, DNS caching, content extraction, the bloom-filter URL seen set, and how to handle hostile sites (large pages, redirect loops, slow responses, deliberate spam).

Design a Web Crawler

System Design

Medium

design-web-crawler

case-study

search-discovery

web-crawler

url-frontier

politeness

robots-txt

bloom-filter

shingling

minhash

content-dedup

distributed-fetching

system-design

intermediate

premium

626 views

Requirements

Functional Requirements

Crawl 5B pages/month from a seed list of URLs, expanding via discovered links.
Respect robots.txt for every host: honor disallowed paths and crawl delay.
Per-host politeness: at most N concurrent connections to one host; minimum delay between requests.
Deduplicate URLs so the same URL isn't fetched twice.
Deduplicate content so near-duplicate pages (e.g., the same article on a syndicated network) don't waste storage.
Extract links from each fetched page and add to the frontier.
Store raw HTML for the indexer pipeline to consume.
Recrawl periodically to refresh changed pages.

Out of Scope (state explicitly)

The actual search index (handled by the search engine case study).
Rendering JavaScript-heavy pages (we focus on static HTML; headless browsers are a 100x cost extension).
Crawling auth-protected content (we treat the crawler as anonymous).
Real-time streaming of newly published pages (we focus on bulk crawling).

Non-Functional Requirements

Scale: 5B pages/month (~2K pages/sec sustained, ~10K peak).
Storage: ~50 PB of compressed HTML retained for indexing.
Politeness: never violate per-host limits or robots.txt; crawler should be a 'good citizen'.
Robustness: hostile sites (parser bombs, redirect loops, slow responses) must not crash the crawler.
Freshness: high-priority pages (news sites) recrawled within hours; long-tail recrawled monthly.
Coverage: at least 70% of pages discoverable from the seed set should be indexed within 30 days.

Back-of-the-Envelope Estimation

Fetch and Storage

Text

---------- Fetch and storage estimation ----------
Pages per month:                 5B
Pages per day:                   ~167M
Pages per second (avg):          ~2K
Pages per second (peak 5x):      ~10K

Avg page size (HTML, compressed): ~100 KB raw -> ~20 KB gzipped
Daily fetched bytes:              167M * 100 KB = 16.7 TB/day fetched
Stored compressed:                 167M * 20 KB = 3.3 TB/day stored
Monthly storage:                  100 TB/month
With 3x replication:              300 TB/month
Lifetime corpus (~5 years):       ~18 PB

URL Frontier

Text

---------- URL frontier sizing ----------
Distinct URLs known:              ~50B (most web URLs known to crawlers)
URL avg length:                   ~80 chars
Flat URL list:                    ~4 TB
URL bloom filter (1% FP rate):    ~75 GB
Frontier (queued URLs):           ~1B at any time -> 80 GB

Bandwidth and Compute

Text

---------- Bandwidth and compute ----------
Ingress bandwidth:                10K pages/sec * 100 KB = 1 GB/s peak
Fetcher workers needed:           ~1000 (each handling ~10 fetches/sec)
DNS lookups:                      10K/sec * 1 unique host per ~100 URLs = ~100/sec
Parser CPU:                       10K pages/sec * ~1 ms parse = 10 cores per node

High-Level Design

Text

---------- High-level architecture ----------
   +-----------+
   |  Seed     |
   |  URLs     |
   +-----------+
        |
        v
   +-------------------+
   |  URL Frontier     |
   | (sharded by host) |
   +-------------------+
        |
        v
   +-------------------+
   |  Fetcher Workers  |
   |  - DNS cache      |
   |  - HTTP client    |
   |  - politeness     |
   +-------------------+
        |       \
        |        \
        v         v
   +-------------------+   +-------------------+
   |   robots.txt      |   |   DNS Resolver    |
   |   Cache           |   |   (cached)        |
   +-------------------+   +-------------------+
        |
        v
   +-------------------+
   |  Content Store    |
   |  (S3 / HDFS)      |
   +-------------------+
        |
        v
   +-------------------+
   |  Parser           |
   |  - extract links  |
   |  - clean text     |
   +-------------------+
        |
        v
   +-------------------+
   |  URL Dedup        |
   |  (Bloom filter +  |
   |   exact set)      |
   +-------------------+
        |
        v   (new URLs)
   +-------------------+
   |  URL Frontier     |
   +-------------------+
        
        +----------+
        |          |
        v          v
   +-------------------+
   | Content Dedup     |
   | (shingling /      |
   |  MinHash)         |
   +-------------------+
        |
        v
   +-------------------+
   |  Indexer Pipeline |
   |  (downstream)     |
   +-------------------+

Crawl Loop

Text

---------- Crawl loop ----------
1. Frontier returns next URL respecting per-host politeness.
2. Worker checks robots.txt cache for the host:
   - If cached, validate URL is allowed.
   - If not cached, fetch /robots.txt first, then check.
3. Worker resolves the host via DNS cache (TTL ~1 hr).
4. Worker fetches the URL with timeout (~10 sec) and size cap (~5 MB).
5. On success, content saved to Content Store with key = sha1(url).
6. Parser extracts: title, body text, outbound links.
7. Each new outbound link is hashed and checked against URL bloom filter.
   - If not seen, added to frontier.
8. Content hash + shingles computed; if near-duplicate of existing, mark dup.
9. Worker reports completion; frontier updates last-fetched timestamp.

Detailed Design

The two interesting components are the URL frontier (politeness-aware queue) and content deduplication via shingling.

URL Frontier: Politeness Through Sharding

Why per-host sharding?

Say the crawler has 1000 worker threads and a frontier with 10K URLs from cnn.com next. If workers grab URLs naively, they'll all hit cnn.com simultaneously, generating 1000 concurrent requests. This is rude (likely to get blocked, may DOS the site).

The rule: at most N concurrent connections per host (where N=2 is typical) AND at least M ms between requests to the same host (where M comes from robots.txt or defaults to 1000 ms).

To enforce this efficiently, shard the frontier by host:

Text

---------- Frontier shard layout ----------
Frontier shard 0:  URLs from hosts that hash to bucket 0
Frontier shard 1:  URLs from hosts that hash to bucket 1
...
Each shard owns a subset of hosts.
Within a shard, URLs are queued per-host.

Workers are also assigned to shards. A worker only fetches URLs from its shard's hosts, so politeness is enforced locally without coordination.

Frontier per-host queue structure

Text

---------- Per-host queue ----------
For each host H:
  - queue of pending URLs (FIFO with priority overlay)
  - last_fetched_timestamp
  - crawl_delay (from robots.txt; default 1000 ms)
  - in_flight_count
  - max_concurrency (default 2)

Worker loop:

Text

---------- Worker loop ----------
1. Pick a host H from this shard's eligible-now list:
   - eligible if (now - last_fetched) >= crawl_delay AND in_flight < max_concurrency
2. Pop next URL from H's queue.
3. Increment in_flight; record last_fetched.
4. Fetch URL.
5. On completion (success or failure), decrement in_flight.
6. If H has more URLs and is eligible, mark eligible again.

The 'eligible-now' list is a min-heap keyed by next_eligible_time. Workers pull the host whose eligibility expires soonest.

Priority within the frontier

Not all URLs are equal:

News sites and high-PageRank domains: priority 1, recrawled hourly.
Mid-tier (most blogs, forums): priority 2, recrawled weekly.
Long tail (rarely-linked pages): priority 3, recrawled monthly.

Each host's queue is a small priority queue, not pure FIFO. The frontier ranker assigns priority at insertion time based on PageRank, last-modified hints, and content type.

Frontier durability

1B URLs in the frontier means losing it is catastrophic. We persist:

The bloom filter of seen URLs (rebuildable but slow) -> S3 snapshots every hour.
Per-host queues -> Kafka (each host's queue is a topic partition; partitioned by host hash).

Workers consume from Kafka with offset tracking; on restart, resume from offset.

Content Deduplication via Shingling

Why dedupe content?

Many pages are near-duplicates: syndicated news articles, RSS aggregators, content farms. Storing all of them wastes space and pollutes search results.

Exact dedup (hash of the entire page) catches identical copies but misses near-duplicates with minor differences (ad insertions, timestamps, A/B tests).

Shingling

A k-shingle is a sliding window of k consecutive words. "the quick brown fox jumps" with k=3 produces shingles {"the quick brown", "quick brown fox", "brown fox jumps"}.

Two documents are near-duplicates if their shingle sets have high Jaccard similarity (intersection / union > threshold like 0.9).

Full shingle sets are large. We compress with MinHash:

Text

---------- MinHash sketch ----------
For each document:
  Compute all shingles.
  Hash each shingle with N independent hash functions (N ~= 100).
  Take the minimum hash for each function.
  -> N-element MinHash signature.

To estimate Jaccard between two docs:
  Compare their MinHash signatures element-wise.
  Fraction of matching positions ~= Jaccard similarity.

MinHash signatures are ~400 bytes per document. Vastly smaller than full shingle sets and Jaccard estimable in O(N) time.

LSH for fast lookup

Naive Jaccard comparison is O(corpus size) per new document. Locality-Sensitive Hashing (LSH) buckets MinHash signatures so similar signatures land in the same bucket.

Text

---------- LSH for near-dup lookup ----------
Split MinHash signature into B bands (each band = K rows).
Hash each band; use as a bucket key.
Documents sharing >=1 band-bucket are likely near-duplicates.
Verify with full Jaccard estimate.

False positives (non-dup pairs sharing a bucket) are filtered by the verification step. False negatives (dup pairs not sharing any bucket) are rare with appropriate B,K choices.

Fetcher Pool

Per-worker capacity

A fetcher worker (Go/Python with async I/O) handles ~50-100 concurrent fetches. With ~1000 workers, total concurrency ~50K-100K active fetches; throughput ~10K pages/sec at avg 200 ms per fetch.

DNS caching

DNS resolution is slow (50-200 ms per uncached lookup). Cache resolved IPs locally in each worker with TTL ~1 hour. A central DNS resolver service handles cache misses, batching lookups and respecting DNS server limits.

Text

---------- DNS cache ----------
Key:      hostname
Value:    [IP1, IP2, ...]
TTL:      1 hour (or DNS-provided TTL)
Miss:     resolve via local DNS or DNSaaS, populate cache

Connection pooling

HTTP/2 connections to the same host are reused. Each worker maintains a pool of connections per host (sized to max_concurrency for that host).

Robots.txt handling

Fetched once per host per day, cached in Redis:

Text

---------- Robots.txt cache ----------
Key:      robots:<host>
Value:    parsed robots.txt rules + crawl_delay
TTL:      24 hours

Missing robots.txt: assume allowed with default 1000 ms crawl delay.

Failure handling

Text

---------- Failure modes ----------
200 OK:                  process content
3xx Redirect:            follow up to 5 hops; track loops
4xx Client error:        log; do not retry (page is broken)
5xx Server error:        retry with exponential backoff; max 3 attempts;
                         if persistent, downgrade host's crawl rate
Timeout (10s):           treat as transient; retry once
Response > 5 MB:         truncate; log; stop crawling further on this host's deep pages

Hosts that consistently return 5xx get their crawl rate slowed automatically (a separate adaptive feedback loop).

Data Model

URL Frontier (Kafka + per-shard state)

Text

---------- Kafka topic ----------
Topic: url-frontier
Partitions: 1024 (sharded by hash(host))
Message: {url, priority, depth, discovered_at}

Worker shards subscribe to specific partitions for politeness locality.

URL Seen Set (Bloom filter + exact backup)

Text

---------- URL dedup ----------
Bloom filter (in-memory per shard): 1% FP rate, 75 GB across cluster
Exact set (Cassandra):              for FP verification when bloom hits
  - row key: (host, sha1(url))
  - value:   (last_fetched, etag, last_modified, status_code)

When bloom says 'maybe seen', check Cassandra exact. When bloom says 'not seen', it's definitely not seen.

Content Store (S3 / HDFS)

Text

---------- Content storage ----------
Bucket: web-crawl-content
  raw/<sha1-prefix>/<sha1-of-url>.gz
    metadata: { url, fetched_at, content_hash, size, mime_type }
Compressed; lifecycle to cold storage after 30 days for unindexed content.

Per-Host Metadata (Postgres / Cassandra)

SQL

CREATE TABLE hosts (
    host                VARCHAR(253) PRIMARY KEY,
    pagerank_score      DOUBLE PRECISION,
    crawl_delay_ms      INT,
    max_concurrency     INT,
    robots_last_fetch   TIMESTAMPTZ,
    last_5xx_count_24h  INT,
    blocked             BOOLEAN DEFAULT FALSE
);

Content Dedup (LSH buckets)

SQL

-- Cassandra
CREATE TABLE lsh_buckets (
    band_idx     int,
    band_hash    bigint,
    document_id  bigint,                  -- sha1(url)
    PRIMARY KEY ((band_idx, band_hash), document_id)
);

CREATE TABLE document_signatures (
    document_id  bigint PRIMARY KEY,
    signature    blob,                    -- 400-byte MinHash
    first_seen   timestamp
);

Scaling and Bottlenecks

A 1B-page host (reddit.com)

A single popular host has billions of URLs. Per-host politeness (e.g., 2 concurrent, 1 sec delay) caps fetches at ~2/sec, meaning crawling all of reddit.com takes years.

Mitigations:

Negotiate higher crawl rate with the site's webmaster (most have a sitemap.xml).
Use sitemap to skip dead URLs.
Selective recrawl: prioritize pages by PageRank rather than crawling exhaustively.

Spam farms (AI-generated content)

A spam farm hosts 10M URLs of low-quality auto-generated content. We waste fetch budget. Mitigations:

Spam classifier on fetched content (keyword density, structure heuristics, ML).
Demote the host's PageRank if spam ratio > threshold.
Add the host to a blocklist if spam ratio is overwhelming.

Frontier explosion

A single page can link to thousands of URLs. Naive insertion grows the frontier unboundedly. Mitigations:

Per-domain link cap (max 1000 outbound links followed per page).
Depth limit (URLs at depth > N don't expand further).
Frontier size cap with eviction of low-priority URLs.

Fetcher worker failure

A worker crashing mid-fetch leaves an in-flight URL stranded. Mitigations:

Workers commit Kafka offsets only after the fetch + parse + dedup pipeline completes.
A dead worker's partition is reassigned (Kafka consumer group); the new owner restarts from the last committed offset.
Idempotent: refetching a URL we've already crawled is not catastrophic (it's just wasted bandwidth).

Multi-region

Geographically distribute fetcher pools to be close to crawled content (cuts fetch latency and respects local network conditions). Frontier sharding considers host geographic affinity (a .jp host gets crawled from APAC region).

Trade-offs and Alternatives

Bloom filter vs exact set for URL dedup

Exact set is correct but costs memory (50B URLs * 80 bytes = 4 TB; sharded across cluster). Bloom filter is 75 GB, fits in memory, with 1% false positives (we'd skip ~1% of new URLs incorrectly). The cost-benefit favors bloom + Cassandra-as-fallback for the rare cases.

Why shingling vs simhash for near-dup?

Simhash is an alternative: a single 64-bit fingerprint with Hamming-distance comparison. It's smaller and faster but less precise for the 'is this 90% the same' question. Shingling + MinHash is the de facto choice for high-quality dup detection; simhash is good for 'roughly similar'.

Per-host sharding vs global politeness coordination

Global coordination (a central rate limiter) means every fetch checks a shared service. Doesn't scale. Per-host sharding moves politeness state to one node per host, no coordination needed; trade-off is hot hosts (lots of URLs from one domain) make their shard the bottleneck.

Fetching fully-rendered (JS) vs static HTML

Headless browsers (Puppeteer/Playwright) execute JavaScript and capture the rendered DOM. ~100x more expensive than static fetching (100 ms HTML fetch vs 5-10 sec JS render). Most crawlers fetch static HTML by default and selectively render only JS-heavy domains (sitemap-flagged or detected as JS apps).

Why Kafka for the frontier?

Durable, partitioned, replayable, supports high write rate. Alternatives:

Redis: faster, in-memory, but losing it loses 1B URLs.
Cassandra queue: works but Cassandra isn't optimized for queue access patterns.
Custom: years of work to match Kafka's reliability.

Kafka is the standard answer for any 'durable distributed queue' problem.

Recrawl strategy: every N days vs adaptive

Fixed-interval recrawl wastes effort on rarely-changing pages and starves frequently-changing ones. Adaptive recrawl tracks per-page change rate (compare new content hash to previous) and adjusts the next-recrawl interval. Pages that change daily get crawled daily; pages that haven't changed in a year get crawled annually.

Why dedupe content if storage is cheap?

Dedup isn't just about storage. Near-duplicate pages pollute search results: a user asks for 'Apple Q3 earnings' and gets 50 versions of the same syndicated article. Dedup at crawl time prevents downstream indexing waste and improves search quality.

Real-World Examples

How real systems implement this in production

Googlebot

Google's crawler fetches ~50B pages/day across the public web. Adaptive recrawl based on page change rate and PageRank. Honors robots.txt strictly; publishes IP ranges so webmasters can identify legitimate Googlebot traffic. Distinct user-agents for search, image, video, mobile.

Trade-off: Google's adaptive recrawl saves bandwidth but means low-PageRank sites are crawled rarely (sometimes never). Webmasters complain that new pages on small sites can take weeks to be indexed; Google trades coverage for efficiency.

Bingbot

Microsoft Bing's crawler operates similarly to Googlebot but at smaller scale. Bing publishes a 'IndexNow' protocol where webmasters can push URL changes to Bing instead of waiting for crawl, reducing Bing's load and accelerating discovery.

Trade-off: IndexNow trades crawl autonomy for cooperation: Bing depends on webmasters notifying it of changes. Many sites adopt IndexNow because it's faster than waiting; Google has been slower to adopt similar protocols.

Common Crawl

Common Crawl is an open web crawl publishing monthly snapshots of 5B+ pages to S3. They run a smaller cluster than Google but are explicitly polite: max 1 request/sec per host, prominent user-agent, contact info in headers. Researchers use the data for ML training and academic studies.

Trade-off: Common Crawl's politeness is so strict that they take days to crawl what Googlebot does in hours. The trade-off: they avoid being blocked anywhere, get cooperation from webmasters, and serve as a public good. Speed sacrificed for sustainability.

Apache Nutch

Open-source distributed crawler used by smaller search engines and academic projects. Architecturally similar to this lesson: Hadoop-backed URL frontier, distributed fetchers, deduplication via signatures. Lacks Google-scale tuning but illustrates the core patterns clearly.

Trade-off: Nutch's reliance on Hadoop made it operationally heavy. Modern reimplementations (StormCrawler, Heritrix) use streaming systems instead. The lesson: batch-oriented architectures (Hadoop MapReduce) are wrong for the fetch-process-dedup-queue loop, which wants streaming.

Quick Interview Phrases

Key terms to use in your answer

URL frontier sharded by host for politeness

bloom filter for URL dedup

shingling and MinHash for near-duplicate content

robots.txt cached per host

per-host crawl delay and concurrency cap

adaptive recrawl based on change rate

Common Interview Questions

Questions you might be asked about this topic

Walk me through what happens to a single URL from the moment it's added to the frontier.

URL hashed to determine frontier shard (by host). Sent to that shard's Kafka partition. Shard's worker pool consumes; the worker checks per-host eligibility (last_fetched + crawl_delay <= now AND in_flight < max_concurrency). When eligible, pops URL, increments in_flight. Robots.txt checked from cache; if disallowed, drop. DNS resolved from cache. HTTP fetch with timeout; content saved to S3 with key sha1(url). Parser extracts links and content; content_hash + MinHash signature computed; if near-duplicate of existing, marked dup. Each new outbound link checked against bloom filter; if not seen, queued back to frontier. Worker decrements in_flight. Total: ~200-500 ms per URL on average.

How do you make sure your crawler doesn't accidentally DOS a website?

How do you detect that two pages are near-duplicates?

What happens if a fetcher worker crashes mid-page?

How would you handle a 1B-page domain like Wikipedia at the politeness limit?

Interview Tips

How to discuss this topic effectively

Lead with politeness as the central constraint. Saying 'the URL frontier must shard by host because politeness is per-host' immediately frames the architecture correctly.

Mention robots.txt by name and explain crawl_delay. Many candidates skip this; it signals you understand the social contract of crawling.

Use bloom filter for URL dedup. It's the textbook answer; saying 'bloom filter with Cassandra fallback for false-positive verification' is the senior framing.

For near-duplicate detection, name shingling and MinHash specifically. These are real algorithms with real names; saying them confidently signals depth.

Always discuss hostile sites (parser bombs, redirect loops, slow responses). This shows operational maturity that pure algorithm-focused candidates miss.

Common Mistakes

Pitfalls to avoid in interviews

Letting workers grab URLs without per-host politeness coordination

Naively, 1000 workers all try the same popular host. They generate 1000 concurrent requests, get blocked, and look like a DOS attack. Shard the URL frontier by host so each shard's workers naturally rate-limit per-host. Politeness is the single biggest constraint.

Storing every URL in a SQL table for dedup

50B URLs at 80 bytes = 4 TB. Lookups during fetch would be slow. Use a bloom filter (~75 GB, in-memory, 1% FP rate) with Cassandra as exact-set fallback when the filter says 'maybe'. The 1% false-positive rate means rarely missing a new URL, which is acceptable.

Detecting duplicates only by exact hash

Many pages differ only in ads, timestamps, or trivial layout. Exact hashing misses these and you store thousands of near-duplicate copies. Shingling + MinHash + LSH catches near-duplicates at <100 byte signatures and bucketed lookup.

Ignoring robots.txt

Crawling without honoring robots.txt is rude and can get your IPs blacklisted across the web. Fetch /robots.txt before any fetch on a new host (cached daily); honor disallowed paths and crawl_delay. Even if your crawler is technically capable of more, politeness is a long-term cooperation game.

Following every link without depth or per-page caps

A single page can link to thousands of URLs (sitemaps, archives, category pages). Without limits the frontier explodes uncontrollably. Cap per-page outbound links (1000 max), depth (10 levels max), and total frontier size with low-priority eviction.

Back to System Design