System Design Article

Design an Email Service (Gmail)

Difficulty: Medium

Design an email service like Gmail handling 1.8B users storing 500EB of email, accepting ~300B inbound messages per day from the public SMTP network while filtering 90%+ as spam, and serving full-text search over a user's entire inbox in sub-200ms. The interview centerpiece is the asymmetric architecture: SMTP is an untrusted public protocol with hostile traffic patterns (spam, phishing, sender forgery) that needs heavy gateway-side filtering, while the user-facing IMAP/web layer needs cheap reads, pagination of huge mailboxes, and per-user inverted indexes for search. We cover the SMTP MX gateway, the spam pipeline (SPF/DKIM/DMARC + ML), the per-user inverted index for search, and how mailboxes scale when one user holds 50GB of email.

System Design
/

Design an Email Service (Gmail)

Design an Email Service (Gmail)

Design an email service like Gmail handling 1.8B users storing 500EB of email, accepting ~300B inbound messages per day from the public SMTP network while filtering 90%+ as spam, and serving full-text search over a user's entire inbox in sub-200ms. The interview centerpiece is the asymmetric architecture: SMTP is an untrusted public protocol with hostile traffic patterns (spam, phishing, sender forgery) that needs heavy gateway-side filtering, while the user-facing IMAP/web layer needs cheap reads, pagination of huge mailboxes, and per-user inverted indexes for search. We cover the SMTP MX gateway, the spam pipeline (SPF/DKIM/DMARC + ML), the per-user inverted index for search, and how mailboxes scale when one user holds 50GB of email.

System Design
Medium
design-email-service
case-study
messaging-communication
email
smtp
spam-filtering
spf-dkim-dmarc
inverted-index
full-text-search
blob-storage
attachment-dedup
system-design
intermediate
premium

926 views

9

Requirements

Functional Requirements

  1. Receive email from the public internet via SMTP on port 25 (MX records point at us).
  2. Send email to other domains via outbound SMTP, with delivery status tracking.
  3. Web/IMAP/POP3 access: users read their inbox via a web app (preferred) or standard mail clients.
  4. Folders and labels: organize email into Inbox, Sent, Drafts, Trash, plus user-defined labels.
  5. Search: full-text search across the user's entire mailbox in <200 ms p99.
  6. Spam filtering: classify and route ~90% of inbound as spam without losing legitimate mail.
  7. Attachments: send/receive attachments up to 25 MB; deduplicate to save storage.
  8. Threading: group related messages by Message-ID / In-Reply-To headers.

Out of Scope (state explicitly)

  • Calendar (separate product, even at Google).
  • Contacts (similar separate product).
  • End-to-end encryption (S/MIME, PGP) - assume server-side at-rest encryption only.
  • Real-time chat (use the Chat System case study).

Non-Functional Requirements

  1. Scale: 1.8B users, ~300B inbound emails/day (after spam ~30B legitimate), 500 EB total storage.
  2. Latency: web inbox load <500 ms; search <200 ms p99.
  3. Durability: 11 nines (no email ever lost, even after spam filtering).
  4. Spam recall: <1% false-negative (spam in inbox), <0.1% false-positive (ham in spam).
  5. Outbound deliverability: stay above 99% inbox placement (sender reputation matters).
  6. Availability: 99.99%. Email is asynchronous so brief outages are tolerable; SMTP retries handle most.

Back-of-the-Envelope Estimation

Inbound Volume

Text
---------- Inbound traffic estimation ----------
Users:                          1.8B
Inbound emails/day total:       ~300B (mostly spam)
  Legitimate after filtering:   ~30B (~10%)
  Per legitimate user/day:      ~17 emails
Peak inbound rate:              300B / 86400 * 3 = ~10M emails/sec peak

Average email size:
  Plain text body:              ~2 KB
  With HTML:                    ~30 KB
  With one attachment:          ~500 KB (avg)
  Without attachment (bulk):    ~10 KB avg

Storage Growth

Text
---------- Storage estimation ----------
Legitimate email/day:           30B * 10 KB = 300 TB/day raw
  After dedup (attachments):    ~200 TB/day effective
Legitimate email/year:          73 PB/year
Total inbox storage today:      ~500 EB across 20 years
  (most users have ~10 GB; power users 50-100 GB)

3x replication:                 1.5 EB stored
After erasure coding (1.5x):    750 EB stored

Search Index

Text
---------- Search index size ----------
Words per email:                ~500 avg
Total words indexed:            500 EB / 10 KB = 50T emails * 500 = 25Q tokens
Inverted index size (~30% of body): ~150 EB
Per-user index (mostly):         ~3 GB for a 10 GB mailbox

The key insight: we do per-user inverted indexes, not a single global one, because access is always per-user.

Bandwidth and Compute

Text
---------- Compute / bandwidth ----------
Inbound bandwidth:              10M msg/s * 10 KB = 100 GB/s peak
Spam classification:            10M msg/s, ~5 ms each = 50K cores
Index updates:                  30B/day legitimate, ~350K updates/sec avg
Search QPS:                     1.8B users * ~3 searches/day = ~60K searches/sec avg

High-Level Design

Text
---------- High-level architecture ----------
   +--------------+
   | Sender (any  |
   |  domain)     |
   +--------------+
          | SMTP (port 25)
          v
   +-------------------+
   | MX Gateway        |
   |  - SPF/DKIM/DMARC |
   |  - rate-limit IP  |
   |  - greylist       |
   +-------------------+
          |
          v Kafka: ingress.raw
   +-------------------+
   | Spam Pipeline     |
   |  - rule engine    |
   |  - ML classifier  |
   |  - virus scan     |
   +-------------------+
          |
      +---+---+
      v       v
   spam     inbox
   topic    topic
      |       |
      v       v
   +-------------------+
   | Mailbox Writer    |
   |  - dedupe attach  |
   |  - persist body   |
   |  - thread linkage |
   +-------------------+
          |
      +---+---+
      v       v
   +------+ +------+
   |Bigtab| | Blob |
   |/     | |Store |
   |Spann.| | (S3) |
   +------+ +------+
          |
          v
   +-------------------+
   | Index Pipeline    |
   |  per-user index   |
   +-------------------+
          |
          v
   +-------------------+
   | Search Index      |
   | (Lucene-style,    |
   |  per-user shards) |
   +-------------------+
          ^
          | search query
          |
   +-------------------+
   | Web / IMAP        |
   | Front-End         |
   +-------------------+
          ^
          | HTTPS / IMAP
          |
   +-------------------+
   |   User Client     |
   +-------------------+

Public-facing Surface

Three distinct entry points; each is its own protocol with different threat models.

Text
---------- Public surfaces ----------
Port 25 (SMTP):    inbound from the world; UNTRUSTED
Port 993 (IMAPS):  authenticated user access via mail clients
Port 443 (HTTPS):  authenticated user access via web/mobile app
Port 587 (Submission): authenticated outbound from user's mail client

Port 25 is the dangerous one: anyone on the internet can connect and try to deliver mail. The MX Gateway must be ruthlessly defensive.

Inbound SMTP Flow

Text
---------- Inbound SMTP flow ----------
1. Sender's MTA looks up MX record for example.com
   -> mx1.example.com
2. Connects to mx1:25, runs SMTP handshake
3. MX Gateway:
   a. Checks IP reputation (RBL: Spamhaus, etc.)
      - if listed, return 5xx -> sender retries elsewhere or drops
   b. SPF check: does the sending IP appear in From-domain's SPF record?
   c. Greylist: defer first attempt with 4xx (real senders retry; spam often doesn't)
   d. Rate-limit: max N emails/min per source IP
4. If all pass, accept the DATA stream (raw RFC 5322 message)
5. DKIM verification: check the cryptographic signature in the headers
6. DMARC alignment: does From-domain authorize this SPF/DKIM result?
7. Enqueue to Kafka topic ingress.raw partitioned by recipient_user_id
8. Spam Pipeline consumes; classifies; routes to inbox or spam topic
9. Mailbox Writer persists; updates user's mailbox view
10. Index Pipeline updates per-user inverted index

User Read Flow

Text
---------- User read flow ----------
1. User opens Gmail web app -> HTTPS to Web Front-End
2. Front-End queries Mailbox API:
   GET /api/v1/inbox?cursor=<>&limit=50
3. Mailbox Service queries Bigtable:
   SELECT * FROM mailbox WHERE user_id = ? AND folder = 'inbox'
     ORDER BY received_at DESC LIMIT 50
4. Returns 50 message stubs (subject, from, snippet, has_attachment, labels)
5. User clicks a message -> GET /api/v1/messages/<id>
   Returns full body (HTML), inline attachments via signed URLs to S3

User Search Flow

Text
---------- User search flow ----------
1. User types 'invoice from acme'
2. Front-End queries Search API:
   GET /api/v1/search?q=invoice%20from%3Aacme
3. Query parsed: full-text 'invoice', filter from:acme
4. Search Index router maps user_id -> shard 17
5. Shard 17 looks up per-user index for this user
6. Inverted index lookup, BM25 scoring, returns top 100 message_ids
7. Mailbox Service hydrates message stubs from Bigtable
8. Result rendered in <200 ms

Detailed Design

The two interesting components are the SMTP gateway with spam filtering and the per-user search index.

MX Gateway and Spam Pipeline

Why filter at the edge?

300B inbound/day with 90% spam means we'd waste 270B writes per day if we accepted everything. The MX Gateway aggressively rejects at SMTP time so spam never enters our storage layer.

SPF, DKIM, DMARC: the three checks every interview asks about
Text
---------- Email auth protocols ----------
SPF:    DNS TXT record on the sender's domain lists IPs allowed to send.
        "v=spf1 ip4:1.2.3.0/24 -all"
        We check: is the connecting IP in the list?

DKIM:   Sender signs the message with a private key; public key in DNS.
        We verify the signature in the DKIM-Signature header.
        Proves the message wasn't tampered with in transit.

DMARC:  Policy that ties SPF and DKIM together.
        "v=DMARC1; p=reject; aspf=s; adkim=s"
        Tells receivers what to do when SPF/DKIM fail.

A message passing all three is highly likely to be from the claimed sender. A message failing all three is highly likely to be forged spam. Real spam classifiers weight these heavily but don't rely on them alone (DMARC isn't universal).

Greylisting

First-time sender from a new (sender, recipient, IP) tuple gets a temporary 4xx error. Real MTAs queue and retry in a few minutes; spam blasters often don't retry (they move on to other targets). One simple trick that catches a huge fraction of low-effort spam at near-zero cost.

Downside: legitimate mail is delayed by ~5 minutes on first contact. Most receivers cache the (sender, recipient) tuple for 30 days so subsequent mail flows immediately.

Rule engine + ML classifier

After accepting the DATA, the spam pipeline runs:

  1. Rule engine: heuristics (keyword density, suspicious links, malformed headers, all-caps subject). Fast, ~1 ms per email.
  2. Reputation lookup: sender IP + sender domain reputation in Redis. ~0.5 ms.
  3. URL scanning: extract URLs, look up in real-time blacklists (Google Safe Browsing-style). ~5 ms.
  4. ML classifier: features from above + bag-of-words + TF-IDF; gradient boosted model returns spam probability. ~3 ms.
  5. Virus scan: attachments scanned by ClamAV-style engine and a custom ML model.

Combined latency budget: ~10 ms per email. With 10M emails/sec peak, we need ~100K classification cores.

Output: spam vs inbox
Text
---------- Routing decision ----------
spam_probability < 0.1   -> inbox
0.1 <= prob < 0.5        -> inbox with 'suspicious' label
0.5 <= prob < 0.95       -> spam folder (recoverable)
prob >= 0.95             -> spam folder + alert if user opens

We never delete; users can recover from spam. False positives are worse than false negatives in email (a missed legitimate email is a real-world failure).

Per-User Search Index

Why per-user, not global?

A global index would mean every search joins the user's mailbox to a global posting list. The posting lists for common words ("order", "invoice") have billions of entries; filtering down to one user's hits is wasteful.

Per-user indexes invert the question: "in this user's posting list for 'invoice', return the top hits". Posting lists are small (the user's own mailbox); BM25 ranking is fast.

Sharding the index
Text
---------- Search index sharding ----------
Users:                          1.8B
Shards:                         10,000 (180K users per shard avg)
Users per shard:                bucketed by hash(user_id)
Shard storage:                  ~50 TB (per-user indexes for that shard's users)

A shard contains many small per-user indexes. Reads always target one user's index within one shard. Writes (new email) update one user's index within one shard. No cross-shard joins.

Index data structure

Lucene-style segments:

Text
---------- Per-user index segment ----------
For each (user_id, term):
  posting list = [(message_id_1, positions), (message_id_2, positions), ...]

Stored on disk in compressed format.
Merged periodically (small new segments -> larger old segments).
Query: open segments, intersect posting lists, score with BM25, return top K.

For a 10 GB mailbox (50K messages), the per-user index is ~3 GB. A search across this index takes ~10-50 ms on SSD.

Updates: near-real-time but not synchronous

After the Mailbox Writer persists a message, an email.persisted event is emitted to Kafka. The Index Pipeline consumes and updates the per-user index. End-to-end indexing latency is ~1-5 seconds, which is fast enough for 'just received' email to appear in search.

Search query
// Search service routes to the right shard then queries the user's index
async function search(userId, query) {
    const shardId = hashShard(userId);              // e.g. user 1234 -> shard 17
    const shard = await getShardClient(shardId);
    const parsed = parseQuery(query);               // 'invoice from:acme'
    const results = await shard.userSearch({
        userId,
        terms: parsed.terms,                        // ['invoice']
        filters: parsed.filters,                    // [{from: 'acme'}]
        limit: 100
    });
    // results is [(message_id, score), ...]
    const messages = await mailbox.bulkGet(
        userId,
        results.map((r) => r.messageId)
    );
    return mergeScores(messages, results);
}

Attachment Deduplication

The same PDF gets emailed to 1000 recipients of a corporate newsletter. Naively, we store 1000 copies. Dedup:

Text
---------- Attachment dedup flow ----------
1. On message receive, compute SHA-256 of each attachment.
2. Look up SHA-256 in attachments table.
3. If exists, increment ref_count, link the new message to existing blob.
4. If not, write blob to S3, insert row in attachments table with ref_count=1.
5. Per-user mailbox stores only (message_id, attachment_sha256) tuple.

Result: ~3-5x storage savings on attachments across all users. The corporate newsletter PDF is stored once, referenced from 1000 mailboxes.

Garbage collection: when a user deletes a message, decrement ref_count. When ref_count hits 0, delete the blob. Run as a background job; deletion is eventual.

Outbound Email

Submission flow

The user's mail client connects to port 587 with auth. We accept the message, run our own SPF/DKIM signing on the way out (so receiving servers can verify), and submit to the recipient's MX.

Sender reputation

Outbound IP reputation is everything in email. Each outbound IP has a reputation score with major receivers (Gmail, Yahoo, Microsoft). Sending spam from an IP destroys its reputation; recovery takes weeks.

Mitigations:

  • Multiple outbound IP pools, used for different categories (transactional vs marketing vs internal).
  • A single bad sender doesn't poison the whole pool: rate-limit per-user outbound, terminate accounts that send spam.
  • 'IP warming': new IPs ramp up volume slowly to build reputation.
Bounce handling

The receiving server may reject (5xx) immediately or accept-then-bounce later via a delivery status notification (DSN). We classify bounces:

Text
---------- Bounce categories ----------
Hard bounce (5xx, permanent): mailbox doesn't exist; remove from active sends.
Soft bounce (4xx, transient): full mailbox / temporary block; retry with backoff.
Block (5xx, reputation):       receiving server blacklisted us; alert ops.

Data Model

Bigtable / Spanner: messages and mailbox

Gmail historically used Bigtable; modern Gmail uses Spanner. The data model is wide-row with column families.

Text
---------- Bigtable schema (conceptual) ----------
Row key: <user_id>#<message_id>
Column families:
  meta:       from, to, subject, received_at, size, has_attach
  body:       text, html, headers (full RFC 5322)
  labels:     inbox=true, important=false, label_xyz=true
  flags:      read, starred, trash

Reads scan a row range for one user efficiently. Writes target one row.

Folders/labels are inverted to 'rows-with-this-label' lists per user, also stored in Bigtable.

Postgres: users, accounts, settings

SQL
CREATE TABLE users (
    id              BIGINT PRIMARY KEY,
    primary_email   VARCHAR(254) UNIQUE NOT NULL,
    storage_used    BIGINT NOT NULL DEFAULT 0,
    plan            VARCHAR(16) NOT NULL,           -- free, premium
    created_at      TIMESTAMPTZ NOT NULL
);

CREATE TABLE aliases (
    user_id  BIGINT NOT NULL REFERENCES users(id),
    address  VARCHAR(254) NOT NULL,
    PRIMARY KEY (address)
);

CREATE TABLE labels (
    user_id     BIGINT NOT NULL,
    label_id    INT NOT NULL,
    name        VARCHAR(64) NOT NULL,
    color       VARCHAR(7),
    PRIMARY KEY (user_id, label_id)
);

Object Storage (S3 / GCS): blobs and attachments

Text
---------- Blob storage layout ----------
Bucket: gmail-blobs-prod
  attachments/<sha256-prefix>/<sha256>.bin   (deduped; referenced by msgs)
  raw/<user_id>/<message_id>.eml             (full RFC 5322 archive)

Search Index Storage

Distributed file system (Colossus / HDFS-like) holding Lucene-style segments per user. Index nodes load segments on demand with caching.

Redis: hot caches

Text
---------- Redis keys ----------
inbox:<user_id>:cursor             -> latest 50 message_id stubs   TTL 1h
folder:<user_id>:<folder_name>     -> recent message_ids           TTL 1h
ip_reputation:<ip>                  -> score 0-100                 no TTL
sender_reputation:<domain>          -> score 0-100                 no TTL
greylist:<sender_ip>:<rcpt>:<from> -> defer count                 TTL 1d

Scaling and Bottlenecks

The 50 GB inbox problem

A power user's mailbox with 100K messages strains paging. Mitigations:

  • Cursor-based pagination (always 'next 50 after received_at = X'), not offset (which scans the whole prefix).
  • Aggressive caching of the most recent N messages in Redis.
  • Background generation of label aggregates: the count for 'unread in inbox' is precomputed, not counted on demand.

The 100K-recipient mailing list

A legitimate corporate announcement sends to 100K employees. Our SMTP gateway sees 100K duplicate MAIL FROM lines for the same DATA. We can't reject (it's legitimate); we can't create 100K storage copies (waste).

Mitigations:

  • Detect identical-body messages via content hash and dedupe at ingestion (only the headers differ).
  • Spread the 100K writes across mailbox shards by recipient.

Search index updates lag

Index updates run asynchronously. Under heavy load, lag may grow from 1-5s to minutes. Acceptable for search; not acceptable for the 'I just sent this email and want to find it' case. Solution: a 'recent inbox' overlay queried directly from Bigtable for the last 5 minutes, merged with the index for older results.

Hot index shard

A shard containing a viral newsletter recipient (someone who got 1M emails in a day) becomes a hotspot for index updates. Solution: split that user's index across multiple sub-shards by date range; reads merge across them.

Multi-region durability

All writes replicate to at least 2 regions before ack. Reads serve from the nearest region. For email, eventual cross-region consistency is acceptable: 'I sent an email from US, opened it from EU 30 seconds later' might briefly show stale state.

The migration problem (40 PB user wants to leave)

Users can export their entire mailbox. For huge mailboxes, this is a multi-day archive job. Not a hot-path issue, but the design must support it (e.g., scheduled Spanner export to S3, then a downloadable archive).

Trade-offs and Alternatives

Per-user index vs global index

A global Lucene-style index would let us run ranking algorithms across all users (popularity signals, etc.). But every query would have to filter by user_id, and posting lists for common terms would be billions of entries. Per-user indexes lose cross-user signals but make queries fast and naturally enforce isolation. Gmail picks per-user; web search engines like Google Search go global.

Bigtable vs Spanner for mailboxes

Bigtable: high-throughput, simple key-value. Gmail historically used it. Spanner: SQL on top of Bigtable's storage with global transactions. Modern Gmail uses Spanner because the new threading and labeling features need transactional updates across multiple rows. Trade-off: Spanner is more expensive but simpler for evolving schemas.

Why not store full bodies in the mailbox row?

Large HTML emails can be 1 MB. Embedding them in the row inflates row reads. We store body in a separate column family or in object storage with a pointer; metadata in the main row stays small.

Why greylist, given the user-experience cost?

Before ML classifiers, greylisting blocked huge fractions of spam at near-zero cost. Today ML does most of the work, but greylisting still catches the cheapest blasts. It's a 5-minute first-mail latency in exchange for ~30% spam volume reduction; usually worth it for new senders.

Why dedupe attachments only?

Message bodies are usually unique enough that dedup wouldn't save much. Attachments (PDFs, images) are often identical across many recipients. The cost-benefit of a content-hash lookup makes sense for binary blobs but not for short HTML.

Why per-user storage quotas?

Free-tier storage quotas (15 GB at Google) prevent a single user from costing more than the ad revenue they generate. A small fraction of users would fill petabytes if uncapped. Trade-off: paid tiers are a real revenue driver because many users prefer to pay than to delete email.

Real-time vs batch index updates

Batch (every 10 minutes) is cheaper and simpler. Real-time (per email) means the index can never be more than a few seconds behind. For a 'search my latest email' UX, real-time wins. The cost: significantly more complex infrastructure (a streaming index update pipeline vs nightly batches).

Real-World Examples

How real systems implement this in production

Gmail

Gmail handles ~1.8B users with mailboxes stored on Spanner (migrated from Bigtable). Search uses per-user Lucene-style indexes. Spam filtering involves heavy ML; Google reports 99.9% spam blocked at the edge. Attachment storage uses Colossus (Google's distributed file system) with content-hash dedup.

Trade-off: Gmail's investment in ML spam filtering created a moat: small competitors can't match Gmail's spam recall because they lack training data. The lesson: in spam, scale begets accuracy begets more scale.

ProtonMail

ProtonMail emphasizes end-to-end encryption: the server stores only encrypted ciphertext. Search runs client-side after the encrypted blob is downloaded, or via a partial server-side index over decrypted-on-server snippets (with the user's password held in memory).

Trade-off: ProtonMail's privacy guarantees mean server-side ML spam filtering is much weaker (can't read the body). They compensate with header-only filtering and IP reputation. Trade-off: privacy at the cost of spam recall.

Outlook/Office 365

Microsoft runs Exchange Online on a different model: tenant-isolated mailboxes (each company gets its own logical isolation), stored in a custom database (ESE engine evolved from on-premises Exchange). Search uses per-mailbox indexes managed by FAST Search.

Trade-off: Per-tenant isolation simplifies compliance (a US tenant's mail never leaves US data centers) but means cross-tenant features (forwarding, calendar share) need extra plumbing. Microsoft trades some flexibility for enterprise compliance wins.

Yahoo Mail

Yahoo Mail stored mailboxes in MySQL clusters historically, sharded by user. They famously suffered the 2013 breach of 3B accounts. Modern Yahoo Mail uses a custom NoSQL message store with per-user sharding and Lucene-style search.

Trade-off: Yahoo's MySQL legacy made some operations expensive (cross-mailbox search for legal hold required scatter-gather over hundreds of shards). The lesson: pick a storage model that supports your auditing and compliance access patterns from day one.

Quick Interview Phrases

Key terms to use in your answer

MX gateway with SPF/DKIM/DMARC
greylisting and IP reputation
per-user inverted index
attachment deduplication by content hash
sender reputation and IP warming
Bigtable wide-row mailbox

Common Interview Questions

Questions you might be asked about this topic

Sender's MTA looks up MX, connects to mx-gateway:25. Gateway runs IP reputation, SPF check, greylist (defer if first contact). Accepts DATA, runs DKIM signature verify, DMARC alignment. Enqueues raw to Kafka topic ingress.raw partitioned by recipient_user_id. Spam pipeline pulls, runs rules + ML + virus scan; routes to inbox or spam topic. Mailbox Writer dedupes attachments (SHA-256 lookup), persists message in Bigtable row keyed by user#message_id, links thread, emits email.persisted event. Index Pipeline updates per-user inverted index. Total ingestion latency ~5 seconds; visible in user inbox immediately.

Interview Tips

How to discuss this topic effectively

1

Lead by separating the public SMTP ingress from the user-facing read path. The asymmetry is the entire design; saying 'untrusted public protocol meets trusted user reads' wins points immediately.

2

Mention SPF + DKIM + DMARC by name and explain what each does. Most candidates only know SPF; full coverage signals you've actually run a mail server.

3

When asked about search, commit to per-user indexes from the start. Designing a global index for personal email is a classic anti-pattern that an experienced interviewer will catch.

4

Cite the spam ratio (~90% of inbound is spam). It justifies why we filter at the SMTP gateway, not in storage. Gmail's published 90%+ blocked-at-edge rate is a famous interview number.

5

Highlight attachment dedup. It's a small mention but signals you think about cost at scale; 1 PB saved is real money.

Common Mistakes

Pitfalls to avoid in interviews

Accepting all SMTP traffic and filtering after storage

If 90% of inbound is spam and you accept everything, you waste 90% of writes, 90% of bandwidth, and 90% of disk. The MX gateway must reject at SMTP time using IP reputation, SPF/DKIM, greylisting, and rate limits. Only ~10% should hit the spam pipeline.

Single global inverted index for search

A global index has billion-entry posting lists for common words; filtering down to one user is wasteful. Use per-user inverted indexes sharded by user_id; a search hits one user's small index and returns in <50 ms.

Storing every attachment copy without dedup

A corporate newsletter with a 10 MB PDF sent to 100K employees stores 1 TB without dedup. Hash attachments (SHA-256), reference-count, and store the blob once. Saves 3-5x on storage.

Treating outbound IP reputation as something you can fix on demand

IP reputation takes weeks to build and weeks to recover after a hit. You must isolate outbound traffic by category (transactional/marketing/user-submitted) on separate IP pools, rate-limit per user, and warm new IPs slowly. There is no 'just rotate the IP' shortcut.

Ignoring threading and treating each email as standalone

Users see threads, not individual messages. Group by Message-ID/In-Reply-To/References headers at write time. Without threading, a 50-reply conversation looks like 50 separate inbox entries and the UX collapses.