System Design Article

Design Pastebin

Difficulty: Easy

Design a service like Pastebin or GitHub Gist where users dump up to 10 MB of text and share a link. The interview twist over a URL shortener: pastes are big, so you store them in object storage (S3) and only keep metadata in your database. This lesson covers the metadata vs blob split, expiration via S3 lifecycle policies, presigned URLs for direct uploads, syntax highlighting strategy, and how to handle the read pattern when most pastes are read once and never again.

System Design
/

Design Pastebin

Design Pastebin

Design a service like Pastebin or GitHub Gist where users dump up to 10 MB of text and share a link. The interview twist over a URL shortener: pastes are big, so you store them in object storage (S3) and only keep metadata in your database. This lesson covers the metadata vs blob split, expiration via S3 lifecycle policies, presigned URLs for direct uploads, syntax highlighting strategy, and how to handle the read pattern when most pastes are read once and never again.

System Design
Easy
design-pastebin
case-study
social-content-platforms
blob-storage
cdn
expiration-policy
presigned-urls
read-heavy
system-design
beginner
free

601 views

15

Requirements

Pastebin is deceptively simple, which is why it makes a great interview problem: the interviewer is watching to see if you slow down and ask the right scoping questions, or rush into a URL shortener answer that doesn't fit.

Functional Requirements

  1. Create a paste: user submits text (up to 10 MB), gets back a short URL.
  2. View a paste: anyone with the URL can read the content.
  3. Expiration: pastes can have TTL (10 min, 1 hour, 1 day, 1 week, 1 month, never).
  4. Visibility: public (listed/searchable), unlisted (only by URL), private (auth required).
  5. Syntax highlighting: detect language and render with highlighted code.
  6. Optional features: edit history, paste forks, comments. (Mention them, defer them.)

Non-Functional Requirements

  1. Highly available reads (99.99%); creates can tolerate 99.9%.
  2. Read latency p99 < 200 ms (we are serving up to 10 MB; the bytes themselves take time over the wire).
  3. Write latency p99 < 500 ms (uploads of large pastes are inherently slow).
  4. Scale: 1M new pastes per day, 10M reads per day. Read:write ratio 10:1 (much lower than URL shortener).
  5. Durable: pastes that should live forever must survive any single AZ failure.
  6. Eventual consistency on view counts is fine; the paste content itself must be exactly what was submitted.

Out of Scope (state explicitly)

  • Full-text search across all pastes (Pastebin actually has it; we'll mention how to bolt it on).
  • Real-time collaborative editing (that's Google Docs, a different problem).
  • DRM / paid content / paywalls.

Back-of-the-Envelope Estimation

Traffic

Text
---------- Traffic estimation ----------
Writes per day:        1M
Writes per second:     1M / 86400        ~= 12 writes/sec
Reads per day:         10M
Reads per second:      10M / 86400       ~= 116 reads/sec

Peak (3x):             writes ~ 36/sec, reads ~ 350/sec

This is a tiny-QPS service. The challenge is size, not rate.

Storage

Paste size distribution (real-world from Pastebin telemetry):

  • 50% are < 1 KB (configs, short scripts)
  • 40% are 1 KB to 100 KB (logs, code snippets)
  • 9% are 100 KB to 1 MB (longer logs, dumps)
  • 1% are 1 MB to 10 MB (memory dumps, full SQL dumps)

Average size: ~50 KB.

Text
---------- Storage growth ----------
Daily new content:     1M * 50 KB        = 50 GB/day
Yearly:                50 GB * 365       = 18 TB/year
5 years:               18 TB * 5         = 90 TB

In S3 standard at $0.023/GB/month:       ~$2,000/month for 90 TB
In S3 IA (infrequent access):             ~$1,250/month
Mostly we move old pastes to IA after 30 days.

Plus metadata (~500 bytes/paste): 5 years * 365 * 1M * 500 bytes = 900 GB. Trivial. Goes in a SQL database.

Bandwidth

Text
---------- Bandwidth ----------
Write bandwidth:    36 * 50 KB           = 1.8 MB/s
Read bandwidth:     350 * 50 KB          = 17.5 MB/s

Big pastes hurt: a single 10 MB paste read = 10 MB egress.
10 MB at 1 Gbps takes ~80 ms just to push bytes.

This is why we use a CDN for reads.

Read Pattern (the interesting part)

Unlike URL shorteners (where 80% of reads hit 20% of URLs), pastebin reads are flat: most pastes are shared with one or two people who read them once, then never again. The few viral pastes are outliers.

Implication: a hot-key cache buys you almost nothing. The right cache strategy is CDN edge caching with short TTLs (the few hot pastes warm the CDN naturally). No application-side Redis cache needed.

High-Level Design

Text
---------- High-level architecture ----------
             +-------------+
             |   Client    |
             +-------------+
                    |
                    v
         +----------------------+
         |  CDN (Cloudfront)    |  <- caches paste content for 5 min
         +----------------------+
                    |
                    v
         +----------------------+
         |    Load Balancer     |
         +----------------------+
              |             |
              v             v
   +-----------------+  +------------------+
   |  API Service    |  |   Read Service   |
   |  (POST /pastes) |  |  (GET /:id)      |
   +-----------------+  +------------------+
              |             |        |
              v             v        v
   +-----------------+  +-----------+ +----------+
   | Snowflake ID    |  | Postgres  | |    S3    |
   | Generator       |  | (metadata)| | (blobs)  |
   +-----------------+  +-----------+ +----------+
                                          |
                                          v
                                +-----------------------+
                                | Lifecycle: expire     |
                                | + tier to IA after 30d|
                                +-----------------------+

API Design

Jsonc
// Create a paste (small, < 1 MB content goes inline)
POST /api/v1/pastes
Content-Type: application/json

{
    "content": "function hello() {\n  console.log('hi');\n}",
    "language": "javascript",        // optional, auto-detected if missing
    "visibility": "unlisted",        // public | unlisted | private
    "expires_in_seconds": 86400,     // 1 day
    "title": "Quick test"            // optional
}

// Response 201 Created
{
    "id": "aZbK9eR",
    "url": "https://pasted.io/aZbK9eR",
    "raw_url": "https://pasted.io/raw/aZbK9eR",
    "created_at": "2026-04-26T10:00:00Z",
    "expires_at": "2026-04-27T10:00:00Z"
}
Jsonc
// Create a LARGE paste (> 1 MB): get a presigned upload URL
POST /api/v1/pastes/upload-url

{
    "content_length": 5242880,       // 5 MB
    "visibility": "unlisted",
    "expires_in_seconds": 0          // never
}

// Response 200 OK
{
    "id": "aZbK9eR",
    "upload_url": "https://s3.amazonaws.com/pastebin/aZbK9eR?X-Amz-Signature=...",
    "upload_expires_in": 900         // 15 min to upload
}

// Then client PUTs directly to S3:
PUT https://s3.amazonaws.com/pastebin/aZbK9eR?X-Amz-Signature=...
Content-Type: text/plain
Body: <5 MB of text>

// After successful upload, client confirms:
POST /api/v1/pastes/aZbK9eR/finalize
Jsonc
// Read a paste
GET /api/v1/pastes/:id

// Response 200 OK
{
    "id": "aZbK9eR",
    "content": "function hello() {...",
    "language": "javascript",
    "created_at": "2026-04-26T10:00:00Z",
    "expires_at": "2026-04-27T10:00:00Z",
    "view_count": 12
}

// Or get raw text only (for curl / wget)
GET /raw/:id
Content-Type: text/plain
Body: <paste content>

Read Path

For most reads (paste < 1 MB, common case):

  1. Client requests GET /api/v1/pastes/aZbK9eR.
  2. CDN edge: cache hit -> return immediately. (Most popular pastes hit here.)
  3. CDN miss -> origin Read Service.
  4. Read Service queries Postgres for metadata (returns 404 if not found, 410 if expired).
  5. Reads paste content from S3 by key.
  6. Returns JSON to CDN with Cache-Control: public, max-age=300 (5 min).

For large pastes (> 1 MB), we redirect to a presigned S3 URL so the client downloads directly from S3, bypassing our service:

Text
---------- Large paste read ----------
1. Client GET /raw/aZbK9eR
2. Read Service checks metadata (Postgres)
3. Returns 302 with Location: <presigned S3 URL valid 1 hour>
4. Client downloads from S3 directly (CDN-fronted bucket)

This is the same pattern as YouTube serving video segments: never proxy big bytes through your application.

Detailed Design

The two interesting components are the metadata/blob split and expiration via lifecycle policies.

Metadata vs Blob Split

The core decision: where does the paste content actually live?

Why not store content in Postgres?
  • A 10 MB row blows up TOAST tables in Postgres, kills replication throughput, and hurts cache locality on every other row.
  • Backups become huge.
  • Replicating 10 MB rows over the wire on every read replica replication tick is wasteful.
Why not store metadata in S3 too?
  • S3 has no query language. Listing 'all unexpired pastes by user X' would require a full scan or a separate index.
  • S3 latency (10-50 ms) is too high to do multiple lookups per request.
The split
DataWhereWhy
paste_id, owner_id, created_at, expires_at, language, visibility, view_count, content_sizePostgres (or DynamoDB)Cheap to query, indexed lookups, joins for user pages
Paste content (the actual bytes)S3Cheap per GB, scales independently, native CDN integration, lifecycle policies
S3 Key Convention
Text
---------- S3 keys ----------
Bucket: pastebin-content-prod
Key:    pastes/<first-2-chars-of-id>/<id>
        e.g., pastes/aZ/aZbK9eR

The 2-char prefix avoids S3's old hot-prefix problem (a single physical partition handling millions of keys with the same prefix). Modern S3 auto-shards prefixes, but the convention is still good hygiene at petabyte scale.

Expiration via S3 Lifecycle Policies

Most designs propose a 'sweeper job' that scans the database for expired rows. Don't. S3 has built-in lifecycle policies that run for free.

Approach 1: Set object expiration on PUT

When uploading, set x-amz-expiration based on the requested TTL. S3 deletes the object on the schedule. The application doesn't run any cron.

Text
---------- Lifecycle rule example ----------
Rule: "Delete expired pastes"
  Filter: tag PasteExpiry=24h
  Action: Expiration after 1 day

Rule: "Tier old pastes to IA"
  Filter: prefix pastes/
  Action: Transition to STANDARD_IA after 30 days
  Action: Transition to GLACIER after 365 days

For variable TTLs (10 min, 1 hour, etc.), tag each object with its TTL bucket and have one lifecycle rule per bucket. Or: store all pastes in 'permanent' S3 and run a tiny cleanup Lambda triggered by S3 Notifications when a paste's expires_at is reached, by enqueueing a delayed message to SQS.

Approach 2: Lazy expiration on read

Even without lifecycle rules, the read path checks expires_at and returns 410 Gone if past. This catches everything immediately for the user, while the lifecycle rule handles physical deletion eventually.

Always do both
  • Lazy expiration: instant correctness for users.
  • Lifecycle rule: eventual physical deletion (saves storage cost).
  • Postgres metadata: keep an index on expires_at for the rare 'list my expired pastes' query.

Syntax Highlighting

Do NOT highlight on the server. Two reasons:

  1. Highlighter libraries (like Prism.js, highlight.js) are designed for the browser; they handle 200+ languages.
  2. Server-side rendering forces you to ship HTML, which is bigger than the raw text and breaks Content-Type: text/plain consumers like curl.

Server returns raw content + a detected language hint. Browser highlights with a JS library:

import Prism from 'prismjs';

function renderPaste(content, language) {
    const highlighted = Prism.highlight(content, Prism.languages[language], language);
    document.getElementById('content').innerHTML = highlighted;
}

Language Detection

Use a small library like Linguist or a heuristic on shebang / file-extension hints in the title. If we cannot decide, default to 'plain text'. This is a non-critical feature; do not block paste creation on it.

Data Model

Postgres: pastes table

SQL
CREATE TABLE pastes (
    id              VARCHAR(8) PRIMARY KEY,
    owner_id        UUID,                    -- nullable for anonymous
    title           VARCHAR(255),
    language        VARCHAR(32),
    visibility      VARCHAR(16) NOT NULL,    -- public | unlisted | private
    content_size    INTEGER NOT NULL,        -- bytes; for billing/limits
    s3_key          VARCHAR(255) NOT NULL,
    view_count      BIGINT DEFAULT 0,
    created_at      TIMESTAMPTZ NOT NULL,
    expires_at      TIMESTAMPTZ              -- nullable = never
);

CREATE INDEX idx_pastes_owner ON pastes (owner_id, created_at DESC);
CREATE INDEX idx_pastes_expires ON pastes (expires_at) WHERE expires_at IS NOT NULL;
CREATE INDEX idx_pastes_public_recent ON pastes (created_at DESC) WHERE visibility = 'public';

Why Postgres over Cassandra/DynamoDB?

  • The metadata is small (~1 GB after years of growth) and easily fits one Postgres instance.
  • We need range queries ("my pastes", "recent public pastes") and rich indexing.
  • We rarely need horizontal scaling on metadata; the bytes scale separately in S3.

Partitioning (when we need it): shard by id after a few hundred GB. For the first few years a single primary + read replicas is plenty.

S3 layout

Text
---------- S3 layout ----------
Bucket: pastebin-content-prod (versioning OFF, encryption AES256)
  pastes/aZ/aZbK9eR        <- raw paste content (text/plain)
  pastes/aZ/aZbK9eR.meta   <- optional: highlighted HTML cache

View Counter

Most designs slap a UPDATE view_count = view_count + 1 on the read path. Don't. That's a write per read; it ruins your read scalability and contests Postgres locks on hot pastes.

Instead:

  1. Read path publishes a fire-and-forget event to Kafka (or Redis Stream).
  2. A consumer batches every 10 seconds and runs UPDATE pastes SET view_count = view_count + N WHERE id = ? for each paste with N views in the batch.
  3. This collapses 10K writes/sec on a viral paste into 1 write per 10 seconds.

Scaling and Bottlenecks

This is a tiny-QPS service so scaling is mostly about cost and storage growth, not throughput.

When the metadata DB stops fitting

At ~5 years (1.8 billion paste rows, ~1 TB of metadata), Postgres still works on a beefy instance, but query latency creeps up. Two paths:

  1. Vertical scale: bigger instance, faster disk, more RAM. Easy, expensive, has a ceiling.
  2. Sharding by paste_id: route paste 'aZbK9eR' to shard hash(id) % N. The application picks the shard on every read. No cross-shard queries needed for the hot path.

When S3 cost becomes painful

  • Move pastes older than 30 days to S3 Standard-IA: 50% cost savings, occasional retrieval is still ~10 ms.
  • Pastes older than a year: Glacier Instant Retrieval. ~80% savings, retrieval still single-digit-second.
  • Hard-delete expired pastes (lifecycle rule).

Hot Paste Goes Viral

A paste linked from Hacker News can do 10K reads/sec for an hour. Mitigations:

  • CDN absorbs ~99% with Cache-Control: max-age=300.
  • The remaining 1% (cache misses, fresh edge nodes) hits Postgres + S3, both of which can handle hundreds of QPS for a single key trivially.

No Redis cache needed because:

  • Postgres in-memory page cache holds the metadata row.
  • S3 GET is fast for objects we're requesting frequently.
  • CDN is the actual cache.

Search

Full-text search across pastes is its own beast (think Elasticsearch). The pragmatic add-on:

  • A Kafka consumer indexes new public pastes into Elasticsearch.
  • Search hits Elasticsearch, returns paste IDs, then fetches metadata from Postgres for display.
  • Index lag of a few minutes is fine for Pastebin search.

Do not use Postgres full-text search at scale: it locks during reindex and competes with the OLTP workload.

Trade-offs and Alternatives

Why not store everything in DynamoDB?

DynamoDB has a 400 KB item limit. Pastes can be 10 MB. You'd hit the limit and end up storing pointers to S3 anyway, which is the design we chose. DynamoDB is fine for the metadata if you need DynamoDB; Postgres is simpler for this size.

Why presigned URLs for large uploads?

If 10 MB pastes flowed through your application servers, you'd need:

  • 100 Mbps of inbound bandwidth per concurrent upload.
  • Memory to buffer the body.
  • Risk of timeouts on slow client connections.

With presigned URLs, the client uploads directly to S3. Your server returns immediately. S3 absorbs the bandwidth and bills you the standard storage rate. Always use presigned URLs for files > 1 MB.

Why not run a daily expiration cron?

  • Crons miss while they sleep. A paste with expires_at between cron runs is served stale until the next run.
  • Lazy expiration on read catches everything instantly.
  • S3 lifecycle policies are managed by AWS; no operational burden.

A cron only adds value if you must purge metadata rows from Postgres on a schedule. Even then, prefer a Kafka delayed-message approach.

Why not use the same design as URL Shortener?

Key differences from URL shortener:

  • Pastebin reads return up to 10 MB of bytes, not a 32-byte HTTP redirect. Bandwidth dominates.
  • Read pattern is flat (no big hot keys), so the cache strategy is CDN-only, not Redis.
  • Storage growth is gigabytes per day, not megabytes; lifecycle policies and tiered storage matter.
  • No 100:1 read:write skew; closer to 10:1.

Misidentifying these differences and copying the URL shortener design is the most common mistake. Slow down. Re-derive the access pattern.

Public visibility and abuse

Public pastes can be used for malware, dumps of stolen credentials, spam. Real Pastebin partners with Have I Been Pwned and runs ML scrapers. For the interview, mention:

  • Rate limiting per IP/account.
  • Asynchronous content scan (regex for known credential formats; ML for malware).
  • A take_down flag on the metadata that the read path checks.
  • Reporting endpoint for users to flag pastes.

Real-World Examples

How real systems implement this in production

Pastebin.com

The original since 2002. Stores ~250M pastes, supports 100+ syntax languages, offers paid 'Pro' tier with longer TTLs and private pastes. Uses MySQL for metadata and a custom blob store for content. Heavy use of Cloudflare for edge caching.

Trade-off: Pastebin's challenge is abuse (credential dumps, malware): they invest heavily in scanning rather than expanding features. Open paste services attract spam; design moderation in from day one.

GitHub Gist

Pastebin tied to a GitHub account. Stores gists as actual git repositories under the hood (each gist is a full repo with edit history). Supports comments, forks, embedding. Auth gates private gists.

Trade-off: Using git as the storage engine gives you free history, diffing, and cloning, but git operations are slower than a flat blob read. GitHub trades read latency for collaboration features.

hastebin / dpaste

Minimalist open-source pastebins (Node.js + Redis, often). No accounts, no expiration UI, no syntax highlighting beyond client-side libraries. Designed to be deployable in 5 minutes.

Trade-off: Skipping object storage and using Redis for content limits paste size to a few MB and limits durability to whatever Redis persistence gives you. Fine for internal tools, not for public services with billions of pastes.

JSFiddle / CodePen

Pastebins specialized for runnable web code (HTML/CSS/JS). Add execution sandboxes, npm package resolution, and live preview. Storage is similar (small inline metadata + larger content blob), but the runtime is the differentiator.

Trade-off: Specializing for one content type (web code) lets you add huge value (live preview, package management) at the cost of being useless for general text. Choose a niche or stay general; it's hard to be both.

Quick Interview Phrases

Key terms to use in your answer

metadata vs blob split
presigned upload URL
S3 lifecycle policy
lazy expiration on read
fire-and-forget view counter
CDN edge caching

Common Interview Questions

Questions you might be asked about this topic

First the CDN absorbs most reads with a 5-minute TTL on the Cache-Control header. Misses go to the Read Service, which serves the metadata from Postgres and either returns the content inline (small) or returns a 302 to a presigned S3 URL (large). At 10 MB per read, even 100 concurrent users to your origin is 1 GB/s of egress, so the CDN is doing the real work. Add tiered CDN caching (edge -> regional shield -> S3) to reduce S3 GET cost.

Interview Tips

How to discuss this topic effectively

1

First sentence in the interview: 'Pastes can be up to 10 MB, so the access pattern is bandwidth-bound, not QPS-bound.' This shifts the discussion to S3, lifecycle policies, and CDNs immediately.

2

When the interviewer asks about caching, say 'CDN at the edge, no application Redis. Reads are flat; there are no hot keys to memoize.' This shows you reasoned from the access pattern.

3

Bring up presigned URLs proactively for any upload > 1 MB. It's the single most important pattern for systems that handle user files.

4

Mention S3 lifecycle policies for expiration. Most candidates default to a cron job; lifecycle policies are simpler, cheaper, and more reliable.

5

Avoid copying the URL shortener answer wholesale. Pastebin's read pattern is fundamentally different and your design should reflect that.

Common Mistakes

Pitfalls to avoid in interviews

Storing paste content in the database alongside metadata

Multi-megabyte rows in Postgres ruin replication, backups, and cache locality. Always split: metadata in the database, content in object storage with the database holding the S3 key as a pointer.

Proxying large uploads through application servers

A 10 MB paste through your servers means 80 ms of inbound bandwidth per upload, server memory buffering, and timeout risk. Use presigned S3 URLs so the client uploads directly to S3 and your server returns immediately with the metadata record.

Running a cron job to delete expired pastes

Crons run on intervals; pastes can be served past their expires_at between runs. Use lazy expiration on read (check expires_at, return 410 Gone) plus S3 lifecycle policies for physical deletion. No cron needed.

Adding a Redis cache for paste content

Pastebin reads are flat: most pastes are read once. A cache helps only when there's a hot key. The CDN already handles the few viral pastes. Skipping Redis here is a sign you understood the access pattern.

Updating view_count on every read with a synchronous SQL UPDATE

Hot pastes generate thousands of writes per second to a single row, killing throughput and creating lock contention. Publish a fire-and-forget event, batch in a consumer, and apply view increments every 10 seconds. The hot read path stays read-only.