System Design Article
Design Pastebin
Difficulty: Easy
Design a service like Pastebin or GitHub Gist where users dump up to 10 MB of text and share a link. The interview twist over a URL shortener: pastes are big, so you store them in object storage (S3) and only keep metadata in your database. This lesson covers the metadata vs blob split, expiration via S3 lifecycle policies, presigned URLs for direct uploads, syntax highlighting strategy, and how to handle the read pattern when most pastes are read once and never again.
Design Pastebin
Design a service like Pastebin or GitHub Gist where users dump up to 10 MB of text and share a link. The interview twist over a URL shortener: pastes are big, so you store them in object storage (S3) and only keep metadata in your database. This lesson covers the metadata vs blob split, expiration via S3 lifecycle policies, presigned URLs for direct uploads, syntax highlighting strategy, and how to handle the read pattern when most pastes are read once and never again.
601 views
15
Requirements
Pastebin is deceptively simple, which is why it makes a great interview problem: the interviewer is watching to see if you slow down and ask the right scoping questions, or rush into a URL shortener answer that doesn't fit.
Functional Requirements
- Create a paste: user submits text (up to 10 MB), gets back a short URL.
- View a paste: anyone with the URL can read the content.
- Expiration: pastes can have TTL (10 min, 1 hour, 1 day, 1 week, 1 month, never).
- Visibility: public (listed/searchable), unlisted (only by URL), private (auth required).
- Syntax highlighting: detect language and render with highlighted code.
- Optional features: edit history, paste forks, comments. (Mention them, defer them.)
Non-Functional Requirements
- Highly available reads (99.99%); creates can tolerate 99.9%.
- Read latency p99 < 200 ms (we are serving up to 10 MB; the bytes themselves take time over the wire).
- Write latency p99 < 500 ms (uploads of large pastes are inherently slow).
- Scale: 1M new pastes per day, 10M reads per day. Read:write ratio 10:1 (much lower than URL shortener).
- Durable: pastes that should live forever must survive any single AZ failure.
- Eventual consistency on view counts is fine; the paste content itself must be exactly what was submitted.
Out of Scope (state explicitly)
- Full-text search across all pastes (Pastebin actually has it; we'll mention how to bolt it on).
- Real-time collaborative editing (that's Google Docs, a different problem).
- DRM / paid content / paywalls.
Back-of-the-Envelope Estimation
Traffic
---------- Traffic estimation ----------
Writes per day: 1M
Writes per second: 1M / 86400 ~= 12 writes/sec
Reads per day: 10M
Reads per second: 10M / 86400 ~= 116 reads/sec
Peak (3x): writes ~ 36/sec, reads ~ 350/secThis is a tiny-QPS service. The challenge is size, not rate.
Storage
Paste size distribution (real-world from Pastebin telemetry):
- 50% are < 1 KB (configs, short scripts)
- 40% are 1 KB to 100 KB (logs, code snippets)
- 9% are 100 KB to 1 MB (longer logs, dumps)
- 1% are 1 MB to 10 MB (memory dumps, full SQL dumps)
Average size: ~50 KB.
---------- Storage growth ----------
Daily new content: 1M * 50 KB = 50 GB/day
Yearly: 50 GB * 365 = 18 TB/year
5 years: 18 TB * 5 = 90 TB
In S3 standard at $0.023/GB/month: ~$2,000/month for 90 TB
In S3 IA (infrequent access): ~$1,250/month
Mostly we move old pastes to IA after 30 days.Plus metadata (~500 bytes/paste): 5 years * 365 * 1M * 500 bytes = 900 GB. Trivial. Goes in a SQL database.
Bandwidth
---------- Bandwidth ----------
Write bandwidth: 36 * 50 KB = 1.8 MB/s
Read bandwidth: 350 * 50 KB = 17.5 MB/s
Big pastes hurt: a single 10 MB paste read = 10 MB egress.
10 MB at 1 Gbps takes ~80 ms just to push bytes.This is why we use a CDN for reads.
Read Pattern (the interesting part)
Unlike URL shorteners (where 80% of reads hit 20% of URLs), pastebin reads are flat: most pastes are shared with one or two people who read them once, then never again. The few viral pastes are outliers.
Implication: a hot-key cache buys you almost nothing. The right cache strategy is CDN edge caching with short TTLs (the few hot pastes warm the CDN naturally). No application-side Redis cache needed.
High-Level Design
---------- High-level architecture ----------
+-------------+
| Client |
+-------------+
|
v
+----------------------+
| CDN (Cloudfront) | <- caches paste content for 5 min
+----------------------+
|
v
+----------------------+
| Load Balancer |
+----------------------+
| |
v v
+-----------------+ +------------------+
| API Service | | Read Service |
| (POST /pastes) | | (GET /:id) |
+-----------------+ +------------------+
| | |
v v v
+-----------------+ +-----------+ +----------+
| Snowflake ID | | Postgres | | S3 |
| Generator | | (metadata)| | (blobs) |
+-----------------+ +-----------+ +----------+
|
v
+-----------------------+
| Lifecycle: expire |
| + tier to IA after 30d|
+-----------------------+API Design
// Create a paste (small, < 1 MB content goes inline)
POST /api/v1/pastes
Content-Type: application/json
{
"content": "function hello() {\n console.log('hi');\n}",
"language": "javascript", // optional, auto-detected if missing
"visibility": "unlisted", // public | unlisted | private
"expires_in_seconds": 86400, // 1 day
"title": "Quick test" // optional
}
// Response 201 Created
{
"id": "aZbK9eR",
"url": "https://pasted.io/aZbK9eR",
"raw_url": "https://pasted.io/raw/aZbK9eR",
"created_at": "2026-04-26T10:00:00Z",
"expires_at": "2026-04-27T10:00:00Z"
}// Create a LARGE paste (> 1 MB): get a presigned upload URL
POST /api/v1/pastes/upload-url
{
"content_length": 5242880, // 5 MB
"visibility": "unlisted",
"expires_in_seconds": 0 // never
}
// Response 200 OK
{
"id": "aZbK9eR",
"upload_url": "https://s3.amazonaws.com/pastebin/aZbK9eR?X-Amz-Signature=...",
"upload_expires_in": 900 // 15 min to upload
}
// Then client PUTs directly to S3:
PUT https://s3.amazonaws.com/pastebin/aZbK9eR?X-Amz-Signature=...
Content-Type: text/plain
Body: <5 MB of text>
// After successful upload, client confirms:
POST /api/v1/pastes/aZbK9eR/finalize// Read a paste
GET /api/v1/pastes/:id
// Response 200 OK
{
"id": "aZbK9eR",
"content": "function hello() {...",
"language": "javascript",
"created_at": "2026-04-26T10:00:00Z",
"expires_at": "2026-04-27T10:00:00Z",
"view_count": 12
}
// Or get raw text only (for curl / wget)
GET /raw/:id
Content-Type: text/plain
Body: <paste content>Read Path
For most reads (paste < 1 MB, common case):
- Client requests
GET /api/v1/pastes/aZbK9eR. - CDN edge: cache hit -> return immediately. (Most popular pastes hit here.)
- CDN miss -> origin Read Service.
- Read Service queries Postgres for metadata (returns 404 if not found, 410 if expired).
- Reads paste content from S3 by key.
- Returns JSON to CDN with
Cache-Control: public, max-age=300(5 min).
For large pastes (> 1 MB), we redirect to a presigned S3 URL so the client downloads directly from S3, bypassing our service:
---------- Large paste read ----------
1. Client GET /raw/aZbK9eR
2. Read Service checks metadata (Postgres)
3. Returns 302 with Location: <presigned S3 URL valid 1 hour>
4. Client downloads from S3 directly (CDN-fronted bucket)This is the same pattern as YouTube serving video segments: never proxy big bytes through your application.
Detailed Design
The two interesting components are the metadata/blob split and expiration via lifecycle policies.
Metadata vs Blob Split
The core decision: where does the paste content actually live?
Why not store content in Postgres?
- A 10 MB row blows up TOAST tables in Postgres, kills replication throughput, and hurts cache locality on every other row.
- Backups become huge.
- Replicating 10 MB rows over the wire on every read replica replication tick is wasteful.
Why not store metadata in S3 too?
- S3 has no query language. Listing 'all unexpired pastes by user X' would require a full scan or a separate index.
- S3 latency (10-50 ms) is too high to do multiple lookups per request.
The split
| Data | Where | Why |
|---|---|---|
| paste_id, owner_id, created_at, expires_at, language, visibility, view_count, content_size | Postgres (or DynamoDB) | Cheap to query, indexed lookups, joins for user pages |
| Paste content (the actual bytes) | S3 | Cheap per GB, scales independently, native CDN integration, lifecycle policies |
S3 Key Convention
---------- S3 keys ----------
Bucket: pastebin-content-prod
Key: pastes/<first-2-chars-of-id>/<id>
e.g., pastes/aZ/aZbK9eRThe 2-char prefix avoids S3's old hot-prefix problem (a single physical partition handling millions of keys with the same prefix). Modern S3 auto-shards prefixes, but the convention is still good hygiene at petabyte scale.
Expiration via S3 Lifecycle Policies
Most designs propose a 'sweeper job' that scans the database for expired rows. Don't. S3 has built-in lifecycle policies that run for free.
Approach 1: Set object expiration on PUT
When uploading, set x-amz-expiration based on the requested TTL. S3 deletes the object on the schedule. The application doesn't run any cron.
---------- Lifecycle rule example ----------
Rule: "Delete expired pastes"
Filter: tag PasteExpiry=24h
Action: Expiration after 1 day
Rule: "Tier old pastes to IA"
Filter: prefix pastes/
Action: Transition to STANDARD_IA after 30 days
Action: Transition to GLACIER after 365 daysFor variable TTLs (10 min, 1 hour, etc.), tag each object with its TTL bucket and have one lifecycle rule per bucket. Or: store all pastes in 'permanent' S3 and run a tiny cleanup Lambda triggered by S3 Notifications when a paste's expires_at is reached, by enqueueing a delayed message to SQS.
Approach 2: Lazy expiration on read
Even without lifecycle rules, the read path checks expires_at and returns 410 Gone if past. This catches everything immediately for the user, while the lifecycle rule handles physical deletion eventually.
Always do both
- Lazy expiration: instant correctness for users.
- Lifecycle rule: eventual physical deletion (saves storage cost).
- Postgres metadata: keep an index on
expires_atfor the rare 'list my expired pastes' query.
Syntax Highlighting
Do NOT highlight on the server. Two reasons:
- Highlighter libraries (like Prism.js, highlight.js) are designed for the browser; they handle 200+ languages.
- Server-side rendering forces you to ship HTML, which is bigger than the raw text and breaks
Content-Type: text/plainconsumers like curl.
Server returns raw content + a detected language hint. Browser highlights with a JS library:
import Prism from 'prismjs';
function renderPaste(content, language) {
const highlighted = Prism.highlight(content, Prism.languages[language], language);
document.getElementById('content').innerHTML = highlighted;
}Language Detection
Use a small library like Linguist or a heuristic on shebang / file-extension hints in the title. If we cannot decide, default to 'plain text'. This is a non-critical feature; do not block paste creation on it.
Data Model
Postgres: pastes table
CREATE TABLE pastes (
id VARCHAR(8) PRIMARY KEY,
owner_id UUID, -- nullable for anonymous
title VARCHAR(255),
language VARCHAR(32),
visibility VARCHAR(16) NOT NULL, -- public | unlisted | private
content_size INTEGER NOT NULL, -- bytes; for billing/limits
s3_key VARCHAR(255) NOT NULL,
view_count BIGINT DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL,
expires_at TIMESTAMPTZ -- nullable = never
);
CREATE INDEX idx_pastes_owner ON pastes (owner_id, created_at DESC);
CREATE INDEX idx_pastes_expires ON pastes (expires_at) WHERE expires_at IS NOT NULL;
CREATE INDEX idx_pastes_public_recent ON pastes (created_at DESC) WHERE visibility = 'public';Why Postgres over Cassandra/DynamoDB?
- The metadata is small (~1 GB after years of growth) and easily fits one Postgres instance.
- We need range queries ("my pastes", "recent public pastes") and rich indexing.
- We rarely need horizontal scaling on metadata; the bytes scale separately in S3.
Partitioning (when we need it): shard by id after a few hundred GB. For the first few years a single primary + read replicas is plenty.
S3 layout
---------- S3 layout ----------
Bucket: pastebin-content-prod (versioning OFF, encryption AES256)
pastes/aZ/aZbK9eR <- raw paste content (text/plain)
pastes/aZ/aZbK9eR.meta <- optional: highlighted HTML cacheView Counter
Most designs slap a UPDATE view_count = view_count + 1 on the read path. Don't. That's a write per read; it ruins your read scalability and contests Postgres locks on hot pastes.
Instead:
- Read path publishes a fire-and-forget event to Kafka (or Redis Stream).
- A consumer batches every 10 seconds and runs
UPDATE pastes SET view_count = view_count + N WHERE id = ?for each paste with N views in the batch. - This collapses 10K writes/sec on a viral paste into 1 write per 10 seconds.
Scaling and Bottlenecks
This is a tiny-QPS service so scaling is mostly about cost and storage growth, not throughput.
When the metadata DB stops fitting
At ~5 years (1.8 billion paste rows, ~1 TB of metadata), Postgres still works on a beefy instance, but query latency creeps up. Two paths:
- Vertical scale: bigger instance, faster disk, more RAM. Easy, expensive, has a ceiling.
- Sharding by paste_id: route paste 'aZbK9eR' to shard hash(id) % N. The application picks the shard on every read. No cross-shard queries needed for the hot path.
When S3 cost becomes painful
- Move pastes older than 30 days to S3 Standard-IA: 50% cost savings, occasional retrieval is still ~10 ms.
- Pastes older than a year: Glacier Instant Retrieval. ~80% savings, retrieval still single-digit-second.
- Hard-delete expired pastes (lifecycle rule).
Hot Paste Goes Viral
A paste linked from Hacker News can do 10K reads/sec for an hour. Mitigations:
- CDN absorbs ~99% with
Cache-Control: max-age=300. - The remaining 1% (cache misses, fresh edge nodes) hits Postgres + S3, both of which can handle hundreds of QPS for a single key trivially.
No Redis cache needed because:
- Postgres in-memory page cache holds the metadata row.
- S3 GET is fast for objects we're requesting frequently.
- CDN is the actual cache.
Search
Full-text search across pastes is its own beast (think Elasticsearch). The pragmatic add-on:
- A Kafka consumer indexes new public pastes into Elasticsearch.
- Search hits Elasticsearch, returns paste IDs, then fetches metadata from Postgres for display.
- Index lag of a few minutes is fine for Pastebin search.
Do not use Postgres full-text search at scale: it locks during reindex and competes with the OLTP workload.
Trade-offs and Alternatives
Why not store everything in DynamoDB?
DynamoDB has a 400 KB item limit. Pastes can be 10 MB. You'd hit the limit and end up storing pointers to S3 anyway, which is the design we chose. DynamoDB is fine for the metadata if you need DynamoDB; Postgres is simpler for this size.
Why presigned URLs for large uploads?
If 10 MB pastes flowed through your application servers, you'd need:
- 100 Mbps of inbound bandwidth per concurrent upload.
- Memory to buffer the body.
- Risk of timeouts on slow client connections.
With presigned URLs, the client uploads directly to S3. Your server returns immediately. S3 absorbs the bandwidth and bills you the standard storage rate. Always use presigned URLs for files > 1 MB.
Why not run a daily expiration cron?
- Crons miss while they sleep. A paste with
expires_atbetween cron runs is served stale until the next run. - Lazy expiration on read catches everything instantly.
- S3 lifecycle policies are managed by AWS; no operational burden.
A cron only adds value if you must purge metadata rows from Postgres on a schedule. Even then, prefer a Kafka delayed-message approach.
Why not use the same design as URL Shortener?
Key differences from URL shortener:
- Pastebin reads return up to 10 MB of bytes, not a 32-byte HTTP redirect. Bandwidth dominates.
- Read pattern is flat (no big hot keys), so the cache strategy is CDN-only, not Redis.
- Storage growth is gigabytes per day, not megabytes; lifecycle policies and tiered storage matter.
- No 100:1 read:write skew; closer to 10:1.
Misidentifying these differences and copying the URL shortener design is the most common mistake. Slow down. Re-derive the access pattern.
Public visibility and abuse
Public pastes can be used for malware, dumps of stolen credentials, spam. Real Pastebin partners with Have I Been Pwned and runs ML scrapers. For the interview, mention:
- Rate limiting per IP/account.
- Asynchronous content scan (regex for known credential formats; ML for malware).
- A
take_downflag on the metadata that the read path checks. - Reporting endpoint for users to flag pastes.
Real-World Examples
How real systems implement this in production
The original since 2002. Stores ~250M pastes, supports 100+ syntax languages, offers paid 'Pro' tier with longer TTLs and private pastes. Uses MySQL for metadata and a custom blob store for content. Heavy use of Cloudflare for edge caching.
Trade-off: Pastebin's challenge is abuse (credential dumps, malware): they invest heavily in scanning rather than expanding features. Open paste services attract spam; design moderation in from day one.
Pastebin tied to a GitHub account. Stores gists as actual git repositories under the hood (each gist is a full repo with edit history). Supports comments, forks, embedding. Auth gates private gists.
Trade-off: Using git as the storage engine gives you free history, diffing, and cloning, but git operations are slower than a flat blob read. GitHub trades read latency for collaboration features.
Minimalist open-source pastebins (Node.js + Redis, often). No accounts, no expiration UI, no syntax highlighting beyond client-side libraries. Designed to be deployable in 5 minutes.
Trade-off: Skipping object storage and using Redis for content limits paste size to a few MB and limits durability to whatever Redis persistence gives you. Fine for internal tools, not for public services with billions of pastes.
Pastebins specialized for runnable web code (HTML/CSS/JS). Add execution sandboxes, npm package resolution, and live preview. Storage is similar (small inline metadata + larger content blob), but the runtime is the differentiator.
Trade-off: Specializing for one content type (web code) lets you add huge value (live preview, package management) at the cost of being useless for general text. Choose a niche or stay general; it's hard to be both.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
First the CDN absorbs most reads with a 5-minute TTL on the Cache-Control header. Misses go to the Read Service, which serves the metadata from Postgres and either returns the content inline (small) or returns a 302 to a presigned S3 URL (large). At 10 MB per read, even 100 concurrent users to your origin is 1 GB/s of egress, so the CDN is doing the real work. Add tiered CDN caching (edge -> regional shield -> S3) to reduce S3 GET cost.
Two complementary mechanisms. (1) Lazy expiration: every read checks expires_at and returns 410 Gone if past. This is instant and authoritative. (2) S3 lifecycle policies for physical deletion: tag each object with its TTL bucket and define one lifecycle rule per bucket. Pastes with no expiration get no tag and live forever (or until manually deleted). This avoids any cron job.
Add Elasticsearch as a separate read-side index. A Kafka consumer subscribes to paste creation events, fetches the content from S3, and indexes it asynchronously. Search hits Elasticsearch, returns paste IDs, then fetches metadata from Postgres for display. This separates the search workload from the hot read path entirely. Index lag of a few minutes is acceptable for search.
Two paths. (1) The presigned URL itself has a 15-minute expiration; after that, S3 rejects further uploads to that key. (2) The metadata row is created with status='pending' before the upload URL is issued; a sweeper job deletes pending rows older than 1 hour and triggers an S3 DeleteObject for safety. The user retries by requesting a new presigned URL.
Three layers. (1) Hard size limit (10 MB) enforced both client-side (UX) and server-side (the presigned URL's Content-Length-Range condition). (2) Asynchronous scan: a Lambda triggered by S3 ObjectCreated event runs the content through a malware scanner and updates a take_down flag in Postgres. The read path checks the flag. (3) Rate limit per IP/account at the API gateway: 100 pastes per hour for anonymous, configurable for authenticated users.
Interview Tips
How to discuss this topic effectively
First sentence in the interview: 'Pastes can be up to 10 MB, so the access pattern is bandwidth-bound, not QPS-bound.' This shifts the discussion to S3, lifecycle policies, and CDNs immediately.
When the interviewer asks about caching, say 'CDN at the edge, no application Redis. Reads are flat; there are no hot keys to memoize.' This shows you reasoned from the access pattern.
Bring up presigned URLs proactively for any upload > 1 MB. It's the single most important pattern for systems that handle user files.
Mention S3 lifecycle policies for expiration. Most candidates default to a cron job; lifecycle policies are simpler, cheaper, and more reliable.
Avoid copying the URL shortener answer wholesale. Pastebin's read pattern is fundamentally different and your design should reflect that.
Common Mistakes
Pitfalls to avoid in interviews
Storing paste content in the database alongside metadata
Multi-megabyte rows in Postgres ruin replication, backups, and cache locality. Always split: metadata in the database, content in object storage with the database holding the S3 key as a pointer.
Proxying large uploads through application servers
A 10 MB paste through your servers means 80 ms of inbound bandwidth per upload, server memory buffering, and timeout risk. Use presigned S3 URLs so the client uploads directly to S3 and your server returns immediately with the metadata record.
Running a cron job to delete expired pastes
Crons run on intervals; pastes can be served past their expires_at between runs. Use lazy expiration on read (check expires_at, return 410 Gone) plus S3 lifecycle policies for physical deletion. No cron needed.
Adding a Redis cache for paste content
Pastebin reads are flat: most pastes are read once. A cache helps only when there's a hot key. The CDN already handles the few viral pastes. Skipping Redis here is a sign you understood the access pattern.
Updating view_count on every read with a synchronous SQL UPDATE
Hot pastes generate thousands of writes per second to a single row, killing throughput and creating lock contention. Publish a fire-and-forget event, batch in a consumer, and apply view increments every 10 seconds. The hot read path stays read-only.
