System Design Article
Design an Email Service (Gmail)
Difficulty: Medium
Design an email service like Gmail handling 1.8B users storing 500EB of email, accepting ~300B inbound messages per day from the public SMTP network while filtering 90%+ as spam, and serving full-text search over a user's entire inbox in sub-200ms. The interview centerpiece is the asymmetric architecture: SMTP is an untrusted public protocol with hostile traffic patterns (spam, phishing, sender forgery) that needs heavy gateway-side filtering, while the user-facing IMAP/web layer needs cheap reads, pagination of huge mailboxes, and per-user inverted indexes for search. We cover the SMTP MX gateway, the spam pipeline (SPF/DKIM/DMARC + ML), the per-user inverted index for search, and how mailboxes scale when one user holds 50GB of email.
Design an Email Service (Gmail)
Design an email service like Gmail handling 1.8B users storing 500EB of email, accepting ~300B inbound messages per day from the public SMTP network while filtering 90%+ as spam, and serving full-text search over a user's entire inbox in sub-200ms. The interview centerpiece is the asymmetric architecture: SMTP is an untrusted public protocol with hostile traffic patterns (spam, phishing, sender forgery) that needs heavy gateway-side filtering, while the user-facing IMAP/web layer needs cheap reads, pagination of huge mailboxes, and per-user inverted indexes for search. We cover the SMTP MX gateway, the spam pipeline (SPF/DKIM/DMARC + ML), the per-user inverted index for search, and how mailboxes scale when one user holds 50GB of email.
926 views
9
Requirements
Functional Requirements
- Receive email from the public internet via SMTP on port 25 (MX records point at us).
- Send email to other domains via outbound SMTP, with delivery status tracking.
- Web/IMAP/POP3 access: users read their inbox via a web app (preferred) or standard mail clients.
- Folders and labels: organize email into Inbox, Sent, Drafts, Trash, plus user-defined labels.
- Search: full-text search across the user's entire mailbox in <200 ms p99.
- Spam filtering: classify and route ~90% of inbound as spam without losing legitimate mail.
- Attachments: send/receive attachments up to 25 MB; deduplicate to save storage.
- Threading: group related messages by
Message-ID/In-Reply-Toheaders.
Out of Scope (state explicitly)
- Calendar (separate product, even at Google).
- Contacts (similar separate product).
- End-to-end encryption (S/MIME, PGP) - assume server-side at-rest encryption only.
- Real-time chat (use the Chat System case study).
Non-Functional Requirements
- Scale: 1.8B users, ~300B inbound emails/day (after spam ~30B legitimate), 500 EB total storage.
- Latency: web inbox load <500 ms; search <200 ms p99.
- Durability: 11 nines (no email ever lost, even after spam filtering).
- Spam recall: <1% false-negative (spam in inbox), <0.1% false-positive (ham in spam).
- Outbound deliverability: stay above 99% inbox placement (sender reputation matters).
- Availability: 99.99%. Email is asynchronous so brief outages are tolerable; SMTP retries handle most.
Back-of-the-Envelope Estimation
Inbound Volume
---------- Inbound traffic estimation ----------
Users: 1.8B
Inbound emails/day total: ~300B (mostly spam)
Legitimate after filtering: ~30B (~10%)
Per legitimate user/day: ~17 emails
Peak inbound rate: 300B / 86400 * 3 = ~10M emails/sec peak
Average email size:
Plain text body: ~2 KB
With HTML: ~30 KB
With one attachment: ~500 KB (avg)
Without attachment (bulk): ~10 KB avgStorage Growth
---------- Storage estimation ----------
Legitimate email/day: 30B * 10 KB = 300 TB/day raw
After dedup (attachments): ~200 TB/day effective
Legitimate email/year: 73 PB/year
Total inbox storage today: ~500 EB across 20 years
(most users have ~10 GB; power users 50-100 GB)
3x replication: 1.5 EB stored
After erasure coding (1.5x): 750 EB storedSearch Index
---------- Search index size ----------
Words per email: ~500 avg
Total words indexed: 500 EB / 10 KB = 50T emails * 500 = 25Q tokens
Inverted index size (~30% of body): ~150 EB
Per-user index (mostly): ~3 GB for a 10 GB mailboxThe key insight: we do per-user inverted indexes, not a single global one, because access is always per-user.
Bandwidth and Compute
---------- Compute / bandwidth ----------
Inbound bandwidth: 10M msg/s * 10 KB = 100 GB/s peak
Spam classification: 10M msg/s, ~5 ms each = 50K cores
Index updates: 30B/day legitimate, ~350K updates/sec avg
Search QPS: 1.8B users * ~3 searches/day = ~60K searches/sec avgHigh-Level Design
---------- High-level architecture ----------
+--------------+
| Sender (any |
| domain) |
+--------------+
| SMTP (port 25)
v
+-------------------+
| MX Gateway |
| - SPF/DKIM/DMARC |
| - rate-limit IP |
| - greylist |
+-------------------+
|
v Kafka: ingress.raw
+-------------------+
| Spam Pipeline |
| - rule engine |
| - ML classifier |
| - virus scan |
+-------------------+
|
+---+---+
v v
spam inbox
topic topic
| |
v v
+-------------------+
| Mailbox Writer |
| - dedupe attach |
| - persist body |
| - thread linkage |
+-------------------+
|
+---+---+
v v
+------+ +------+
|Bigtab| | Blob |
|/ | |Store |
|Spann.| | (S3) |
+------+ +------+
|
v
+-------------------+
| Index Pipeline |
| per-user index |
+-------------------+
|
v
+-------------------+
| Search Index |
| (Lucene-style, |
| per-user shards) |
+-------------------+
^
| search query
|
+-------------------+
| Web / IMAP |
| Front-End |
+-------------------+
^
| HTTPS / IMAP
|
+-------------------+
| User Client |
+-------------------+Public-facing Surface
Three distinct entry points; each is its own protocol with different threat models.
---------- Public surfaces ----------
Port 25 (SMTP): inbound from the world; UNTRUSTED
Port 993 (IMAPS): authenticated user access via mail clients
Port 443 (HTTPS): authenticated user access via web/mobile app
Port 587 (Submission): authenticated outbound from user's mail clientPort 25 is the dangerous one: anyone on the internet can connect and try to deliver mail. The MX Gateway must be ruthlessly defensive.
Inbound SMTP Flow
---------- Inbound SMTP flow ----------
1. Sender's MTA looks up MX record for example.com
-> mx1.example.com
2. Connects to mx1:25, runs SMTP handshake
3. MX Gateway:
a. Checks IP reputation (RBL: Spamhaus, etc.)
- if listed, return 5xx -> sender retries elsewhere or drops
b. SPF check: does the sending IP appear in From-domain's SPF record?
c. Greylist: defer first attempt with 4xx (real senders retry; spam often doesn't)
d. Rate-limit: max N emails/min per source IP
4. If all pass, accept the DATA stream (raw RFC 5322 message)
5. DKIM verification: check the cryptographic signature in the headers
6. DMARC alignment: does From-domain authorize this SPF/DKIM result?
7. Enqueue to Kafka topic ingress.raw partitioned by recipient_user_id
8. Spam Pipeline consumes; classifies; routes to inbox or spam topic
9. Mailbox Writer persists; updates user's mailbox view
10. Index Pipeline updates per-user inverted indexUser Read Flow
---------- User read flow ----------
1. User opens Gmail web app -> HTTPS to Web Front-End
2. Front-End queries Mailbox API:
GET /api/v1/inbox?cursor=<>&limit=50
3. Mailbox Service queries Bigtable:
SELECT * FROM mailbox WHERE user_id = ? AND folder = 'inbox'
ORDER BY received_at DESC LIMIT 50
4. Returns 50 message stubs (subject, from, snippet, has_attachment, labels)
5. User clicks a message -> GET /api/v1/messages/<id>
Returns full body (HTML), inline attachments via signed URLs to S3User Search Flow
---------- User search flow ----------
1. User types 'invoice from acme'
2. Front-End queries Search API:
GET /api/v1/search?q=invoice%20from%3Aacme
3. Query parsed: full-text 'invoice', filter from:acme
4. Search Index router maps user_id -> shard 17
5. Shard 17 looks up per-user index for this user
6. Inverted index lookup, BM25 scoring, returns top 100 message_ids
7. Mailbox Service hydrates message stubs from Bigtable
8. Result rendered in <200 msDetailed Design
The two interesting components are the SMTP gateway with spam filtering and the per-user search index.
MX Gateway and Spam Pipeline
Why filter at the edge?
300B inbound/day with 90% spam means we'd waste 270B writes per day if we accepted everything. The MX Gateway aggressively rejects at SMTP time so spam never enters our storage layer.
SPF, DKIM, DMARC: the three checks every interview asks about
---------- Email auth protocols ----------
SPF: DNS TXT record on the sender's domain lists IPs allowed to send.
"v=spf1 ip4:1.2.3.0/24 -all"
We check: is the connecting IP in the list?
DKIM: Sender signs the message with a private key; public key in DNS.
We verify the signature in the DKIM-Signature header.
Proves the message wasn't tampered with in transit.
DMARC: Policy that ties SPF and DKIM together.
"v=DMARC1; p=reject; aspf=s; adkim=s"
Tells receivers what to do when SPF/DKIM fail.A message passing all three is highly likely to be from the claimed sender. A message failing all three is highly likely to be forged spam. Real spam classifiers weight these heavily but don't rely on them alone (DMARC isn't universal).
Greylisting
First-time sender from a new (sender, recipient, IP) tuple gets a temporary 4xx error. Real MTAs queue and retry in a few minutes; spam blasters often don't retry (they move on to other targets). One simple trick that catches a huge fraction of low-effort spam at near-zero cost.
Downside: legitimate mail is delayed by ~5 minutes on first contact. Most receivers cache the (sender, recipient) tuple for 30 days so subsequent mail flows immediately.
Rule engine + ML classifier
After accepting the DATA, the spam pipeline runs:
- Rule engine: heuristics (keyword density, suspicious links, malformed headers, all-caps subject). Fast, ~1 ms per email.
- Reputation lookup: sender IP + sender domain reputation in Redis. ~0.5 ms.
- URL scanning: extract URLs, look up in real-time blacklists (Google Safe Browsing-style). ~5 ms.
- ML classifier: features from above + bag-of-words + TF-IDF; gradient boosted model returns spam probability. ~3 ms.
- Virus scan: attachments scanned by ClamAV-style engine and a custom ML model.
Combined latency budget: ~10 ms per email. With 10M emails/sec peak, we need ~100K classification cores.
Output: spam vs inbox
---------- Routing decision ----------
spam_probability < 0.1 -> inbox
0.1 <= prob < 0.5 -> inbox with 'suspicious' label
0.5 <= prob < 0.95 -> spam folder (recoverable)
prob >= 0.95 -> spam folder + alert if user opensWe never delete; users can recover from spam. False positives are worse than false negatives in email (a missed legitimate email is a real-world failure).
Per-User Search Index
Why per-user, not global?
A global index would mean every search joins the user's mailbox to a global posting list. The posting lists for common words ("order", "invoice") have billions of entries; filtering down to one user's hits is wasteful.
Per-user indexes invert the question: "in this user's posting list for 'invoice', return the top hits". Posting lists are small (the user's own mailbox); BM25 ranking is fast.
Sharding the index
---------- Search index sharding ----------
Users: 1.8B
Shards: 10,000 (180K users per shard avg)
Users per shard: bucketed by hash(user_id)
Shard storage: ~50 TB (per-user indexes for that shard's users)A shard contains many small per-user indexes. Reads always target one user's index within one shard. Writes (new email) update one user's index within one shard. No cross-shard joins.
Index data structure
Lucene-style segments:
---------- Per-user index segment ----------
For each (user_id, term):
posting list = [(message_id_1, positions), (message_id_2, positions), ...]
Stored on disk in compressed format.
Merged periodically (small new segments -> larger old segments).
Query: open segments, intersect posting lists, score with BM25, return top K.For a 10 GB mailbox (50K messages), the per-user index is ~3 GB. A search across this index takes ~10-50 ms on SSD.
Updates: near-real-time but not synchronous
After the Mailbox Writer persists a message, an email.persisted event is emitted to Kafka. The Index Pipeline consumes and updates the per-user index. End-to-end indexing latency is ~1-5 seconds, which is fast enough for 'just received' email to appear in search.
Search query
// Search service routes to the right shard then queries the user's index
async function search(userId, query) {
const shardId = hashShard(userId); // e.g. user 1234 -> shard 17
const shard = await getShardClient(shardId);
const parsed = parseQuery(query); // 'invoice from:acme'
const results = await shard.userSearch({
userId,
terms: parsed.terms, // ['invoice']
filters: parsed.filters, // [{from: 'acme'}]
limit: 100
});
// results is [(message_id, score), ...]
const messages = await mailbox.bulkGet(
userId,
results.map((r) => r.messageId)
);
return mergeScores(messages, results);
}Attachment Deduplication
The same PDF gets emailed to 1000 recipients of a corporate newsletter. Naively, we store 1000 copies. Dedup:
---------- Attachment dedup flow ----------
1. On message receive, compute SHA-256 of each attachment.
2. Look up SHA-256 in attachments table.
3. If exists, increment ref_count, link the new message to existing blob.
4. If not, write blob to S3, insert row in attachments table with ref_count=1.
5. Per-user mailbox stores only (message_id, attachment_sha256) tuple.Result: ~3-5x storage savings on attachments across all users. The corporate newsletter PDF is stored once, referenced from 1000 mailboxes.
Garbage collection: when a user deletes a message, decrement ref_count. When ref_count hits 0, delete the blob. Run as a background job; deletion is eventual.
Outbound Email
Submission flow
The user's mail client connects to port 587 with auth. We accept the message, run our own SPF/DKIM signing on the way out (so receiving servers can verify), and submit to the recipient's MX.
Sender reputation
Outbound IP reputation is everything in email. Each outbound IP has a reputation score with major receivers (Gmail, Yahoo, Microsoft). Sending spam from an IP destroys its reputation; recovery takes weeks.
Mitigations:
- Multiple outbound IP pools, used for different categories (transactional vs marketing vs internal).
- A single bad sender doesn't poison the whole pool: rate-limit per-user outbound, terminate accounts that send spam.
- 'IP warming': new IPs ramp up volume slowly to build reputation.
Bounce handling
The receiving server may reject (5xx) immediately or accept-then-bounce later via a delivery status notification (DSN). We classify bounces:
---------- Bounce categories ----------
Hard bounce (5xx, permanent): mailbox doesn't exist; remove from active sends.
Soft bounce (4xx, transient): full mailbox / temporary block; retry with backoff.
Block (5xx, reputation): receiving server blacklisted us; alert ops.Data Model
Bigtable / Spanner: messages and mailbox
Gmail historically used Bigtable; modern Gmail uses Spanner. The data model is wide-row with column families.
---------- Bigtable schema (conceptual) ----------
Row key: <user_id>#<message_id>
Column families:
meta: from, to, subject, received_at, size, has_attach
body: text, html, headers (full RFC 5322)
labels: inbox=true, important=false, label_xyz=true
flags: read, starred, trashReads scan a row range for one user efficiently. Writes target one row.
Folders/labels are inverted to 'rows-with-this-label' lists per user, also stored in Bigtable.
Postgres: users, accounts, settings
CREATE TABLE users (
id BIGINT PRIMARY KEY,
primary_email VARCHAR(254) UNIQUE NOT NULL,
storage_used BIGINT NOT NULL DEFAULT 0,
plan VARCHAR(16) NOT NULL, -- free, premium
created_at TIMESTAMPTZ NOT NULL
);
CREATE TABLE aliases (
user_id BIGINT NOT NULL REFERENCES users(id),
address VARCHAR(254) NOT NULL,
PRIMARY KEY (address)
);
CREATE TABLE labels (
user_id BIGINT NOT NULL,
label_id INT NOT NULL,
name VARCHAR(64) NOT NULL,
color VARCHAR(7),
PRIMARY KEY (user_id, label_id)
);Object Storage (S3 / GCS): blobs and attachments
---------- Blob storage layout ----------
Bucket: gmail-blobs-prod
attachments/<sha256-prefix>/<sha256>.bin (deduped; referenced by msgs)
raw/<user_id>/<message_id>.eml (full RFC 5322 archive)Search Index Storage
Distributed file system (Colossus / HDFS-like) holding Lucene-style segments per user. Index nodes load segments on demand with caching.
Redis: hot caches
---------- Redis keys ----------
inbox:<user_id>:cursor -> latest 50 message_id stubs TTL 1h
folder:<user_id>:<folder_name> -> recent message_ids TTL 1h
ip_reputation:<ip> -> score 0-100 no TTL
sender_reputation:<domain> -> score 0-100 no TTL
greylist:<sender_ip>:<rcpt>:<from> -> defer count TTL 1dScaling and Bottlenecks
The 50 GB inbox problem
A power user's mailbox with 100K messages strains paging. Mitigations:
- Cursor-based pagination (always 'next 50 after received_at = X'), not offset (which scans the whole prefix).
- Aggressive caching of the most recent N messages in Redis.
- Background generation of label aggregates: the count for 'unread in inbox' is precomputed, not counted on demand.
The 100K-recipient mailing list
A legitimate corporate announcement sends to 100K employees. Our SMTP gateway sees 100K duplicate MAIL FROM lines for the same DATA. We can't reject (it's legitimate); we can't create 100K storage copies (waste).
Mitigations:
- Detect identical-body messages via content hash and dedupe at ingestion (only the headers differ).
- Spread the 100K writes across mailbox shards by recipient.
Search index updates lag
Index updates run asynchronously. Under heavy load, lag may grow from 1-5s to minutes. Acceptable for search; not acceptable for the 'I just sent this email and want to find it' case. Solution: a 'recent inbox' overlay queried directly from Bigtable for the last 5 minutes, merged with the index for older results.
Hot index shard
A shard containing a viral newsletter recipient (someone who got 1M emails in a day) becomes a hotspot for index updates. Solution: split that user's index across multiple sub-shards by date range; reads merge across them.
Multi-region durability
All writes replicate to at least 2 regions before ack. Reads serve from the nearest region. For email, eventual cross-region consistency is acceptable: 'I sent an email from US, opened it from EU 30 seconds later' might briefly show stale state.
The migration problem (40 PB user wants to leave)
Users can export their entire mailbox. For huge mailboxes, this is a multi-day archive job. Not a hot-path issue, but the design must support it (e.g., scheduled Spanner export to S3, then a downloadable archive).
Trade-offs and Alternatives
Per-user index vs global index
A global Lucene-style index would let us run ranking algorithms across all users (popularity signals, etc.). But every query would have to filter by user_id, and posting lists for common terms would be billions of entries. Per-user indexes lose cross-user signals but make queries fast and naturally enforce isolation. Gmail picks per-user; web search engines like Google Search go global.
Bigtable vs Spanner for mailboxes
Bigtable: high-throughput, simple key-value. Gmail historically used it. Spanner: SQL on top of Bigtable's storage with global transactions. Modern Gmail uses Spanner because the new threading and labeling features need transactional updates across multiple rows. Trade-off: Spanner is more expensive but simpler for evolving schemas.
Why not store full bodies in the mailbox row?
Large HTML emails can be 1 MB. Embedding them in the row inflates row reads. We store body in a separate column family or in object storage with a pointer; metadata in the main row stays small.
Why greylist, given the user-experience cost?
Before ML classifiers, greylisting blocked huge fractions of spam at near-zero cost. Today ML does most of the work, but greylisting still catches the cheapest blasts. It's a 5-minute first-mail latency in exchange for ~30% spam volume reduction; usually worth it for new senders.
Why dedupe attachments only?
Message bodies are usually unique enough that dedup wouldn't save much. Attachments (PDFs, images) are often identical across many recipients. The cost-benefit of a content-hash lookup makes sense for binary blobs but not for short HTML.
Why per-user storage quotas?
Free-tier storage quotas (15 GB at Google) prevent a single user from costing more than the ad revenue they generate. A small fraction of users would fill petabytes if uncapped. Trade-off: paid tiers are a real revenue driver because many users prefer to pay than to delete email.
Real-time vs batch index updates
Batch (every 10 minutes) is cheaper and simpler. Real-time (per email) means the index can never be more than a few seconds behind. For a 'search my latest email' UX, real-time wins. The cost: significantly more complex infrastructure (a streaming index update pipeline vs nightly batches).
Real-World Examples
How real systems implement this in production
Gmail handles ~1.8B users with mailboxes stored on Spanner (migrated from Bigtable). Search uses per-user Lucene-style indexes. Spam filtering involves heavy ML; Google reports 99.9% spam blocked at the edge. Attachment storage uses Colossus (Google's distributed file system) with content-hash dedup.
Trade-off: Gmail's investment in ML spam filtering created a moat: small competitors can't match Gmail's spam recall because they lack training data. The lesson: in spam, scale begets accuracy begets more scale.
ProtonMail emphasizes end-to-end encryption: the server stores only encrypted ciphertext. Search runs client-side after the encrypted blob is downloaded, or via a partial server-side index over decrypted-on-server snippets (with the user's password held in memory).
Trade-off: ProtonMail's privacy guarantees mean server-side ML spam filtering is much weaker (can't read the body). They compensate with header-only filtering and IP reputation. Trade-off: privacy at the cost of spam recall.
Microsoft runs Exchange Online on a different model: tenant-isolated mailboxes (each company gets its own logical isolation), stored in a custom database (ESE engine evolved from on-premises Exchange). Search uses per-mailbox indexes managed by FAST Search.
Trade-off: Per-tenant isolation simplifies compliance (a US tenant's mail never leaves US data centers) but means cross-tenant features (forwarding, calendar share) need extra plumbing. Microsoft trades some flexibility for enterprise compliance wins.
Yahoo Mail stored mailboxes in MySQL clusters historically, sharded by user. They famously suffered the 2013 breach of 3B accounts. Modern Yahoo Mail uses a custom NoSQL message store with per-user sharding and Lucene-style search.
Trade-off: Yahoo's MySQL legacy made some operations expensive (cross-mailbox search for legal hold required scatter-gather over hundreds of shards). The lesson: pick a storage model that supports your auditing and compliance access patterns from day one.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Sender's MTA looks up MX, connects to mx-gateway:25. Gateway runs IP reputation, SPF check, greylist (defer if first contact). Accepts DATA, runs DKIM signature verify, DMARC alignment. Enqueues raw to Kafka topic ingress.raw partitioned by recipient_user_id. Spam pipeline pulls, runs rules + ML + virus scan; routes to inbox or spam topic. Mailbox Writer dedupes attachments (SHA-256 lookup), persists message in Bigtable row keyed by user#message_id, links thread, emits email.persisted event. Index Pipeline updates per-user inverted index. Total ingestion latency ~5 seconds; visible in user inbox immediately.
Per-user inverted index sharded by user_id (10K shards). Each user's index is a Lucene-style segment file (~3 GB for a 10 GB mailbox). Search routes to the user's shard via consistent hash, opens segments, intersects posting lists for query terms, scores with BM25, returns top 100 message_ids. Hydration of stubs from Bigtable in parallel. Index nodes cache hot segments in memory. End-to-end: 10-50 ms index query + 30-100 ms hydration = comfortably under 200 ms.
All search queries are user-scoped (a user only searches their own mail). A global index would have billion-entry posting lists for common words and every query would have to filter to one user. Per-user indexes invert the question: open this user's small index and return top hits. Posting lists are tiny (the user's mailbox); BM25 is fast; no cross-user data leakage. The trade-off is no cross-user signals (popularity, etc.) but those don't apply to personal email anyway.
Layered filtering. At SMTP time: IP reputation (RBLs), greylisting, SPF/DKIM/DMARC checks, per-IP rate limits. After accepting: rule engine (heuristics on headers/content), URL blacklist lookups, ML classifier (gradient-boosted model on bag-of-words + signals), virus scan on attachments. Final spam_probability routes to inbox / suspicious-label / spam folder / spam folder + alert. Critically, never delete: false positives are worse than false negatives in email.
Sender IP reputation is everything. Maintain separate IP pools per category (transactional/marketing/user-submitted) so a bad sender on one pool doesn't poison the others. Sign all outbound with DKIM and publish strict SPF/DMARC. Rate-limit outbound per user; terminate spam senders aggressively. Warm new IPs slowly (start at low volume, ramp over weeks). Monitor delivery rates per receiver; if Gmail starts deferring our mail, slow down and investigate. Recovery from a reputation hit takes weeks, so prevention dominates.
Interview Tips
How to discuss this topic effectively
Lead by separating the public SMTP ingress from the user-facing read path. The asymmetry is the entire design; saying 'untrusted public protocol meets trusted user reads' wins points immediately.
Mention SPF + DKIM + DMARC by name and explain what each does. Most candidates only know SPF; full coverage signals you've actually run a mail server.
When asked about search, commit to per-user indexes from the start. Designing a global index for personal email is a classic anti-pattern that an experienced interviewer will catch.
Cite the spam ratio (~90% of inbound is spam). It justifies why we filter at the SMTP gateway, not in storage. Gmail's published 90%+ blocked-at-edge rate is a famous interview number.
Highlight attachment dedup. It's a small mention but signals you think about cost at scale; 1 PB saved is real money.
Common Mistakes
Pitfalls to avoid in interviews
Accepting all SMTP traffic and filtering after storage
If 90% of inbound is spam and you accept everything, you waste 90% of writes, 90% of bandwidth, and 90% of disk. The MX gateway must reject at SMTP time using IP reputation, SPF/DKIM, greylisting, and rate limits. Only ~10% should hit the spam pipeline.
Single global inverted index for search
A global index has billion-entry posting lists for common words; filtering down to one user is wasteful. Use per-user inverted indexes sharded by user_id; a search hits one user's small index and returns in <50 ms.
Storing every attachment copy without dedup
A corporate newsletter with a 10 MB PDF sent to 100K employees stores 1 TB without dedup. Hash attachments (SHA-256), reference-count, and store the blob once. Saves 3-5x on storage.
Treating outbound IP reputation as something you can fix on demand
IP reputation takes weeks to build and weeks to recover after a hit. You must isolate outbound traffic by category (transactional/marketing/user-submitted) on separate IP pools, rate-limit per user, and warm new IPs slowly. There is no 'just rotate the IP' shortcut.
Ignoring threading and treating each email as standalone
Users see threads, not individual messages. Group by Message-ID/In-Reply-To/References headers at write time. Without threading, a 50-reply conversation looks like 50 separate inbox entries and the UX collapses.
