System Design Article
Design Instagram (Photo Sharing)
Difficulty: Medium
Design a photo sharing service like Instagram with 500M daily active users uploading 100M photos a day, served as personalized feeds at sub-200 ms p99. The interview centerpiece is the news feed: fan-out on write versus fan-out on read, the celebrity problem, and the hybrid pull-on-read model that real Instagram uses. We also cover photo upload pipelines (presigned URLs, multi-resolution generation, CDN), the metadata data model, and how to scale follow graphs that go from a few friends to hundreds of millions of followers.
Design Instagram (Photo Sharing)
Design a photo sharing service like Instagram with 500M daily active users uploading 100M photos a day, served as personalized feeds at sub-200 ms p99. The interview centerpiece is the news feed: fan-out on write versus fan-out on read, the celebrity problem, and the hybrid pull-on-read model that real Instagram uses. We also cover photo upload pipelines (presigned URLs, multi-resolution generation, CDN), the metadata data model, and how to scale follow graphs that go from a few friends to hundreds of millions of followers.
797 views
18
Requirements
Functional Requirements
- Upload a photo with a caption, optional location and tags.
- View a personalized feed of photos from people you follow, ranked by recency (and, in extended scope, ML ranking).
- Follow / unfollow other users.
- Like and comment on photos.
- View any user's profile (their photos in reverse chronological order).
- Search for users by username (handle).
Out of Scope (state explicitly)
- Stories, Reels, IGTV (each is its own design problem).
- Direct messages (use the Chat System case study).
- Hashtag pages (similar to feeds, mention the difference).
- Ads / promoted posts.
Non-Functional Requirements
- Scale: 500M DAU, 100M photos uploaded per day, 4.5B feed reads per day.
- Read-heavy: 100:1 feed reads vs uploads.
- Low feed latency: p99 < 200 ms.
- High availability: 99.99%. The product is the feed; if it's down the app is dead.
- Eventually consistent likes / comments / follower counts: a small lag is fine.
- Photo durability: 11 nines (S3 standard). Photos must never be lost.
Back-of-the-Envelope Estimation
Users and Activity
---------- User and traffic estimation ----------
DAU: 500M
Uploads / DAU / day: 0.2 (1 upload every 5 days)
Uploads per day: 500M * 0.2 = 100M
Uploads per second (avg): 100M / 86400 ~= 1,200 /sec
Uploads per second (peak 3x): ~3,500 /sec
Feed views / DAU / day: 9 (open the app a few times)
Feed reads per day: 500M * 9 = 4.5B
Feed reads per second (avg): 4.5B / 86400 ~= 52,000 /sec
Feed reads per second (peak): ~150,000 /secStorage
---------- Photo storage ----------
Photo size (compressed JPEG): ~200 KB original + multiple resolutions
+ thumbnail (150x150): ~10 KB
+ small (320x320): ~30 KB
+ medium (640x640): ~80 KB
+ large (1080x1080): ~200 KB (the upload)
+ original (2048x2048): ~500 KB
Total per photo (5 versions): ~820 KB ~= 1 MB
Photos per day: 100M * 1 MB = 100 TB/day
Photos per year: 100 TB * 365 = 36.5 PB/year
Over 5 years: ~180 PB
Replication (3x in S3): 550 PB total storedThis is the dominant cost. 5 years of growth at $0.023/GB/month for hot storage = ~$12M/month. Real Instagram uses tiered storage, regional replication, and aggressive compression; we'll do the same.
Metadata Storage
---------- Metadata ----------
Per-photo row: ~500 bytes (id, owner, caption, location, timestamps, counts)
Per day: 100M * 500 = 50 GB metadata/day
5 years: 50 GB * 365 * 5 = 90 TB metadataFits in a sharded SQL or wide-column store. Trivial compared to the photo bytes.
Bandwidth
---------- Bandwidth ----------
Upload bandwidth: 1,200 uploads/sec * 200 KB = 240 MB/s = ~2 Gbps
Feed read bandwidth: Each feed view fetches ~10 photos at ~100 KB (thumbnail/small)
52,000 reads/sec * 10 photos * 100 KB = 52 GB/s = ~415 GbpsServing 415 Gbps from origin servers is impossible. The CDN is doing 95%+ of the work.
Cache
- Active feed cache (in-memory in a Redis cluster):
Active users (~50M most active): 50M
Feed entries cached (50 IDs per user): 50 * 50M = 2.5B IDs
Per entry (post_id + score): ~40 bytes
Total: ~100 GB across the clusterFits comfortably in 30-50 Redis nodes.
High-Level Design
---------- High-level architecture ----------
+----------+
| Client |
+----------+
|
v
+---------------------+
| CDN (Cloudfront) | <- serves photos globally
+---------------------+
|
v
+---------------------+
| API Gateway |
+---------------------+
/ | \
v v v
+---------+ +---------+ +-------------+
| Upload | | Feed | | User/Graph |
| Service | | Service | | Service |
+---------+ +---------+ +-------------+
| | |
v v v
+---------+ +---------+ +-------------+
| S3 | | Redis | | Postgres / |
| (blobs) | | (feeds, | | Cassandra |
+---------+ | counts) | | (users, |
| +---------+ | graph, |
v ^ | posts) |
+---------+ | +-------------+
| Kafka | | |
|(events) |---+ v
+---------+ +----------+
| | Search |
v | (Elastic)|
+-----------+ +----------+
| Transcode |
| Workers |
+-----------+API Design
// Step 1: get a presigned upload URL
POST /api/v1/photos/upload-url
Authorization: Bearer <token>
{
"file_size": 2400000,
"content_type": "image/jpeg"
}
// Response
{
"upload_url": "https://s3.../uploads/<uuid>?X-Amz-Signature=...",
"photo_id": "01HW3M9...", // ULID
"upload_expires_in": 900
}
// Step 2: client uploads original directly to S3 (PUT)
// Step 3: client finalizes; this triggers transcoding
POST /api/v1/photos
{
"photo_id": "01HW3M9...",
"caption": "Sunset in Lisbon",
"location": { "lat": 38.7, "lng": -9.1 },
"tags": ["@friend1", "#sunset"]
}// Get personalized feed (paginated)
GET /api/v1/feed?limit=20&cursor=<opaque>
// Response
{
"items": [
{
"photo_id": "01HW3M9...",
"author": { "id": "u1", "handle": "alice", "avatar_url": "..." },
"caption": "Sunset in Lisbon",
"urls": {
"thumbnail": "https://cdn.../t/01HW3M9.jpg",
"small": "https://cdn.../s/01HW3M9.jpg",
"medium": "https://cdn.../m/01HW3M9.jpg",
"large": "https://cdn.../l/01HW3M9.jpg"
},
"like_count": 1234,
"comment_count": 45,
"created_at": "2026-04-26T10:00:00Z"
}
],
"next_cursor": "<opaque>"
}Photo Upload Pipeline (Asynchronous)
The upload is a multi-step process to avoid coupling the user-facing latency to image transcoding.
---------- Photo upload flow ----------
1. Client: POST /photos/upload-url (get presigned URL) ~50 ms
2. Client: PUT to S3 directly (uploads original 200 KB) ~200 ms
3. Client: POST /photos with metadata (creates metadata row) ~50 ms
4. Upload Service: enqueue Kafka event photo.created
5. Transcode Worker: pulls original from S3 resizes to 5 versions ~1-3 sec
6. Transcode Worker: uploads each to S3 under predictable keys
7. Transcode Worker: marks photo as 'ready' in metadata
8. Upload Service: enqueue Kafka event photo.ready -> fan-outUntil step 7 the photo exists in metadata but is not visible in feeds. This decouples upload UX from transcoding throughput.
Detailed Design
The two interesting components are the news feed (fan-out) and the photo storage pipeline.
Feed Generation: Fan-out Tradeoffs
This is the central interview question: when Alice posts a photo, how does it reach the feeds of her 1,000 followers? Three strategies; each has a sharp use case.
Strategy 1: Fan-out on Write (Push)
When Alice posts, write the photo_id into a precomputed feed list for each of her 1,000 followers.
---------- Fan-out on write ----------
Alice posts photo P at time T.
Look up Alice's followers: [bob, carol, dave, ... (1,000 ids)]
For each follower:
Redis ZADD feed:bob T P
Redis ZADD feed:carol T P
...
When Bob opens the app:
Redis ZREVRANGE feed:bob 0 19 -> instant; 20 photo_ids
Hydrate photo metadata in batch -> renderPros: Read latency is sub-10 ms. Bob's feed is precomputed, just fetch and display.
Cons: A user with 100M followers (a celebrity) generates 100M writes per post. Storage explodes (every follower stores a copy of every post_id). Updates are amplified: if a follower unfollows Alice, you need to remove her posts from their feed.
Strategy 2: Fan-out on Read (Pull)
When Bob opens the app, look up everyone Bob follows and merge their recent posts into a feed at read time.
---------- Fan-out on read ----------
Bob follows [alice, carol, dave, ... (300 ids)]
For each followee:
SELECT photo_id, created_at FROM photos
WHERE owner = ? AND created_at > <Bob's last read>
ORDER BY created_at DESC LIMIT 20
Merge results into a single timeline
Return top 20Pros: No storage amplification. Celebrities are no different from normal users. Easy to support 'I want to add a freshly-followed user's posts to my feed retroactively.'
Cons: Read is expensive. If Bob follows 300 users, that's 300 database queries per feed open. At 150K feed opens/sec that's 45M database queries/sec. Unscalable.
Strategy 3: Hybrid Fan-out (the real answer)
Use fan-out on write for normal users. Use fan-out on read for celebrities (above some follower threshold like 1M).
---------- Hybrid fan-out ----------
Alice posts a photo:
if Alice.follower_count < 1M:
fan-out on write to all followers
else:
record post in her own timeline only (no fan-out)
Bob opens the app:
feed = Redis ZREVRANGE feed:bob 0 19 (precomputed normal posts)
celebs = list of celebrities Bob follows (~5)
for celeb in celebs:
posts = recent posts from celeb (pull from celeb's timeline)
merge into feed
return top 20This caps fan-out write cost at the 1M threshold, while keeping the read path fast (only ~5 extra pulls for celebrities, not 300).
Defending the Hybrid in an Interview
The magic words: 'pure fan-out on write doesn't work because of the celebrity problem - a single Justin Bieber post would generate 100M writes. Pure fan-out on read doesn't work either because the read fan-out is too expensive at scale. The hybrid bounds both costs.'
Feed Storage in Redis
We store each user's feed as a sorted set keyed by user_id, scored by post timestamp:
Key: feed:<user_id>
Value: ZSET of (post_id, timestamp_score)
TTL: 7 days (rebuild on demand for inactive users)
Size: ~500 entries per active user
Memory: 50M users * 500 entries * ~30 bytes = ~750 GB across clusterTrim the ZSET to 500 entries to bound memory. Inactive users (no app open in 30 days) get evicted; on next login, rebuild from a fan-out-on-read scan.
Photo Storage Pipeline
Why presigned URLs?
The alternative is uploading through your application servers, which means:
- 200 KB photo * 3,500 uploads/sec = 700 MB/s of inbound traffic per server cluster.
- Each upload occupies a connection slot for ~200 ms.
- The server is now in the data path of every photo, which is wasteful.
With presigned URLs:
- Client uploads directly to S3 over a connection that S3 absorbs.
- Application server returns the URL in ~10 ms and is free for the next request.
Why multiple resolutions?
Mobile clients need a small thumbnail in the feed list (10 KB) and a larger version when the user taps to expand (200 KB). Loading a 1 MB original in the feed wastes bandwidth and slows scroll.
Generate 5 versions during transcoding; let the client pick based on screen size and DPI.
Transcode worker pool
- Kafka topic
photo.createdtriggers transcode workers. - Each worker pulls the original, resizes via libvips (faster than ImageMagick at scale), uploads each variant to S3 under a deterministic key like
photos/01HW3M9/medium.jpg. - Worker pool autoscales by Kafka lag; at 1,200 uploads/sec average and ~2 sec per transcode, we need ~2,400 worker concurrency. With 1 worker per CPU core, that's ~600 4-core machines.
- After all variants are uploaded, the worker writes
status='ready'to the photo metadata and emitsphoto.ready.
CDN
Photos are served from S3 through Cloudfront (or Akamai). CDN cache key is the S3 URL. Set Cache-Control: public, max-age=31536000, immutable because photos under their canonical URL never change. (User edits create a new variant under a new URL.)
Data Model
We split into three storage systems for three different access patterns.
Postgres (sharded): users, follows, photos metadata
-- users
CREATE TABLE users (
id BIGINT PRIMARY KEY, -- Snowflake ID
handle VARCHAR(32) UNIQUE NOT NULL,
name VARCHAR(64),
bio TEXT,
avatar_url VARCHAR(255),
created_at TIMESTAMPTZ NOT NULL
);
-- photos metadata
CREATE TABLE photos (
id BIGINT PRIMARY KEY, -- Snowflake (sortable by time)
owner_id BIGINT NOT NULL REFERENCES users(id),
caption TEXT,
location GEOGRAPHY,
status VARCHAR(16) NOT NULL, -- pending | ready | deleted
created_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_photos_owner_time ON photos (owner_id, created_at DESC);
-- follows (the graph)
CREATE TABLE follows (
follower_id BIGINT NOT NULL,
followee_id BIGINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
PRIMARY KEY (follower_id, followee_id)
);
CREATE INDEX idx_followers ON follows (followee_id, created_at DESC);Shard users, photos, follows by user_id (or follower_id for follows). The follows table is the largest (500M users * average 200 follows = 100B rows).
Cassandra: likes and comments (high-write, append-only)
CREATE TABLE likes (
photo_id bigint,
user_id bigint,
created_at timestamp,
PRIMARY KEY ((photo_id), user_id) -- partition by photo_id
) WITH CLUSTERING ORDER BY (user_id ASC);
CREATE TABLE comments (
photo_id bigint,
comment_id bigint,
user_id bigint,
text text,
created_at timestamp,
PRIMARY KEY ((photo_id), comment_id)
) WITH CLUSTERING ORDER BY (comment_id DESC);Likes can hit thousands per second on a viral photo; Cassandra handles per-partition writes well.
Redis: feeds, counters, online presence
feed:<user_id>-> ZSET of post_ids by timestamp (precomputed feed).count:photo:<id>:likes-> integer (with periodic flush to Postgres).count:photo:<id>:comments-> integer.seen:<user_id>-> highest post_id Bob has seen (for 'new posts' badge).
S3: photo bytes
Bucket: instagram-photos-prod
photos/<photo_id>/original.jpg (~500 KB)
photos/<photo_id>/large.jpg (~200 KB, 1080x1080)
photos/<photo_id>/medium.jpg (~80 KB, 640x640)
photos/<photo_id>/small.jpg (~30 KB, 320x320)
photos/<photo_id>/thumb.jpg (~10 KB, 150x150)Lifecycle: move originals to S3 IA after 90 days (originals are rarely re-fetched once variants exist). Variants stay in S3 Standard.
Scaling and Bottlenecks
Hot Photo: a celebrity post hits 10M likes in an hour
- Cassandra likes table is partitioned by
photo_id; this single partition gets 3K writes/sec for an hour. - Mitigation: Cassandra handles thousands of writes per partition. If it gets worse, write through a Redis HyperLogLog for approximate counts and persist to Cassandra in batches.
Feed Hotspot: viral content
- Edge cache the popular post's media for everyone (CDN handles this naturally).
- The feed itself is per-user; no single feed key is a hotspot.
Follower-graph Scaling
- 100B follow rows is too big for a single shard. Shard by
follower_idso 'who do I follow' queries hit one shard. 'Who follows me?' (for fan-out on write) requires a denormalized inverse table sharded byfollowee_id. Maintain both with a write-through pattern.
Cross-region availability
- Multi-region active-active: 5 regions (NA, EU, APAC, etc.).
- Photos: S3 Cross-Region Replication. CDN routes users to the nearest origin.
- Metadata: each region writes to its primary, with async cross-region replication. Writes are stickied to the user's home region for consistency.
- Feed cache: regional Redis clusters; a user moving regions warms a new cache lazily.
When fan-out queues back up
If transcoding falls behind, photos appear in feeds without thumbnails. Mitigations:
- Autoscale workers off Kafka lag.
- Generate the thumbnail synchronously during upload (small enough), defer the larger variants asynchronously.
- Show a 'processing' placeholder in the client, swap when ready.
Trade-offs and Alternatives
Hybrid fan-out vs pure pull
We could push the celebrity threshold to 0 (everything pull) and rely on aggressive caching of each celebrity's recent posts. Real Instagram historically used variations of this. The hybrid has a slightly more complex code path but better tail latency for normal users with many followers.
Why not GraphQL?
The feed query has known shape; REST with batched hydration is simpler and cacheable. GraphQL shines when client needs vary widely; for Instagram's app-only world, a tight REST surface wins.
Why not a dedicated graph database (Neo4j, JanusGraph)?
Follows is a simple two-column relation; the queries are 1-hop ('who do I follow?', 'who follows me?'). Sharded SQL is faster and cheaper than a graph database for this. Graph databases shine on multi-hop traversals (friend-of-friend), which Instagram doesn't expose.
Why not Postgres for likes / comments?
Likes have very high write rate per popular photo (thousands per sec). Postgres on a single row would lock-contest. Cassandra (or DynamoDB) shards by photo_id and handles per-partition writes much better.
Strong vs eventual consistency on follower count
We show 'followed by 3.4M people' in the UI. That number is eventually consistent (incremented from a Kafka stream into a counter). A user might see 3.4M while Alice just hit 3,400,001 - fine.
Counter-based view: separate timeline service
Real Instagram has a dedicated Timeline Service (sometimes called 'Pixel') whose only job is feed assembly. We've collapsed it into the Feed Service for clarity, but in practice there's enough complexity (re-ranking, deduplication, ML scoring) to warrant its own team and SLO.
Real-World Examples
How real systems implement this in production
Visual discovery platform with a similar pin-based feed. Uses a heavily customized fan-out-on-write for normal users and a pull-based 'home feed' that re-ranks at read time using ML. Stores images on S3 with regional CDN distribution; metadata in MySQL with manual sharding via Vitess.
Trade-off: Pinterest spends more compute on read-time ML ranking than Instagram historically did because the discovery use case (find new things) requires deeper personalization than a chronological friend feed. The cost is higher feed latency budget (300 ms vs 200 ms).
Same multi-resolution media pipeline but for video, with HLS segments instead of JPEG variants. Critically, TikTok feeds are NOT fan-out from people you follow; the For You Page is purely ML-recommended from a global pool. Eliminates the celebrity problem entirely.
Trade-off: TikTok's recommendation-only feed avoids fan-out complexity but requires a massive ML inference platform (described in the TikTok case study). Trades systems complexity for ML complexity.
Stories model with 24-hour TTL. Uses fan-out on write more aggressively because content is ephemeral - storage growth is bounded. Unlike Instagram, Snapchat does not need to maintain a forever-archive of every post.
Trade-off: Ephemeral content means smaller storage costs and simpler retention but no monetizable history. Snapchat trades long-term value for product simplicity and a different user contract.
Federated social network where each instance runs its own database and connects via ActivityPub. Fan-out happens via inter-server federation (push events to remote followers' instances). No central feed service.
Trade-off: Federation eliminates centralized scaling problems but introduces inter-instance consistency and discovery problems. A celebrity on a small instance can take down the instance through federation traffic. Decentralization has its own scaling pathologies.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Client POSTs to a Likes endpoint, which writes to Cassandra (partition by photo_id) and increments a Redis counter. The friend's next feed read pulls the count from Redis (eventually consistent with Cassandra). A background job flushes Redis counters to Postgres metadata every minute for durability and analytics. The like itself is durable in Cassandra immediately; the displayed count may lag by seconds.
Their precomputed feed has the latest 500 posts from their 4,950 normal-user followees (fan-out on write). On feed read, we additionally pull the latest posts from each of the 50 celebrities they follow and merge by timestamp. This is 50 extra reads per feed open, which is acceptable. Cap the celebrity follow count if needed; in practice celebrity follows are a small fraction of any user's graph.
Immediately after metadata write, enqueue a Kafka 'photo.created' event. A fan-out worker reads it, looks up the author's followers, and writes the photo_id into each follower's Redis feed ZSET. For 1M followers, this takes ~30 seconds. For latency-sensitive followees, bump priority. For celebrities, skip fan-out and rely on pull-on-read instead. Show 'processing' for ~1 second in the author's app for confirmation.
Three levers. (1) Tiered CDN: edge -> regional shield -> origin. Misses at edge fill the shield, which absorbs subsequent edge misses. (2) Pre-warm: when a hot post is detected, push its variants to a few CDN regions proactively. (3) Compress more aggressively: use AVIF or WebP for variants where supported (50% size reduction); fewer bytes per miss reduces origin egress.
Soft delete first: set status='deleted' on the photo metadata. The feed read filters out deleted photos. Asynchronously enqueue a hard-delete: remove from the author's profile, from all followers' Redis feeds (or rely on TTL eviction if quick), and schedule the S3 objects for deletion via lifecycle policy after a 30-day grace period. The CDN's cache eventually expires the URLs (or we issue an explicit invalidation for high-profile deletes).
Interview Tips
How to discuss this topic effectively
Lead with 'this is read-heavy and the feed is the centerpiece' so the interviewer knows you understand the product. Then drive immediately to fan-out.
When asked about fan-out, never commit to pure write or pure read. Say 'hybrid with a celebrity threshold' from the start. Then walk through the math.
Mention the celebrity problem by name. Saying 'a Justin Bieber post would create 100M writes' is the magic phrase that signals you've thought past textbook fan-out.
Always separate the upload pipeline from the feed pipeline in the diagram. Two distinct flows: presigned URL for writes, precomputed list for reads.
Cite real numbers: 500M DAU, 100M uploads/day, 100 KB photos. Numbers anchor every later decision (sharding, caching, CDN sizing).
Common Mistakes
Pitfalls to avoid in interviews
Picking pure fan-out on write without addressing celebrities
Pure fan-out on write means a user with 100M followers generates 100M writes per post. This melts your storage and write pipeline. Always cap fan-out at a follower threshold and pull-on-read for users above that threshold.
Storing photos in the database instead of S3
100 TB/day of photos in a database destroys backup, replication, and IOPS. Photos go in S3 (or equivalent object store); only the metadata (id, owner, caption, S3 key) goes in the database.
Doing transcoding synchronously during upload
Synchronous transcoding adds 1-3 seconds to upload latency and ties application throughput to image processing CPU. Push to a Kafka queue, transcode in workers, mark the photo 'ready' when done. The user gets immediate feedback that the upload succeeded.
Storing a single 'photo URL' instead of multiple resolutions
Mobile feeds need 10 KB thumbnails, not 1 MB originals. Generate 4-5 resolutions during transcoding and let the client pick based on screen DPI. Bandwidth savings dominate any storage cost.
Forgetting that the follower graph has 100B rows and needs sharding
500M users with 200 average follows is 100 billion rows. Shard by follower_id for forward queries and maintain a denormalized inverse table sharded by followee_id for fan-out. Both indexes are needed; neither alone is enough.
