System Design Article

Design Instagram (Photo Sharing)

Difficulty: Medium

Design a photo sharing service like Instagram with 500M daily active users uploading 100M photos a day, served as personalized feeds at sub-200 ms p99. The interview centerpiece is the news feed: fan-out on write versus fan-out on read, the celebrity problem, and the hybrid pull-on-read model that real Instagram uses. We also cover photo upload pipelines (presigned URLs, multi-resolution generation, CDN), the metadata data model, and how to scale follow graphs that go from a few friends to hundreds of millions of followers.

Design Instagram (Photo Sharing)

System Design

Medium

design-instagram

case-study

social-content-platforms

photo-sharing

fan-out-on-write

fan-out-on-read

hybrid-fan-out

celebrity-problem

feed-ranking

media-storage

thumbnail-generation

cdn

social-media

system-design

intermediate

free

797 views

Requirements

Functional Requirements

Upload a photo with a caption, optional location and tags.
View a personalized feed of photos from people you follow, ranked by recency (and, in extended scope, ML ranking).
Follow / unfollow other users.
Like and comment on photos.
View any user's profile (their photos in reverse chronological order).
Search for users by username (handle).

Out of Scope (state explicitly)

Stories, Reels, IGTV (each is its own design problem).
Direct messages (use the Chat System case study).
Hashtag pages (similar to feeds, mention the difference).
Ads / promoted posts.

Non-Functional Requirements

Scale: 500M DAU, 100M photos uploaded per day, 4.5B feed reads per day.
Read-heavy: 100:1 feed reads vs uploads.
Low feed latency: p99 < 200 ms.
High availability: 99.99%. The product is the feed; if it's down the app is dead.
Eventually consistent likes / comments / follower counts: a small lag is fine.
Photo durability: 11 nines (S3 standard). Photos must never be lost.

Back-of-the-Envelope Estimation

Users and Activity

Text

---------- User and traffic estimation ----------
DAU:                            500M
Uploads / DAU / day:            0.2 (1 upload every 5 days)
Uploads per day:                500M * 0.2 = 100M
Uploads per second (avg):       100M / 86400 ~= 1,200 /sec
Uploads per second (peak 3x):   ~3,500 /sec

Feed views / DAU / day:         9 (open the app a few times)
Feed reads per day:             500M * 9 = 4.5B
Feed reads per second (avg):    4.5B / 86400 ~= 52,000 /sec
Feed reads per second (peak):   ~150,000 /sec

Storage

Text

---------- Photo storage ----------
Photo size (compressed JPEG):    ~200 KB original + multiple resolutions
  + thumbnail (150x150):         ~10 KB
  + small (320x320):             ~30 KB
  + medium (640x640):            ~80 KB
  + large (1080x1080):           ~200 KB (the upload)
  + original (2048x2048):        ~500 KB
Total per photo (5 versions):    ~820 KB ~= 1 MB

Photos per day:                  100M * 1 MB = 100 TB/day
Photos per year:                  100 TB * 365 = 36.5 PB/year
Over 5 years:                    ~180 PB

Replication (3x in S3):          550 PB total stored

This is the dominant cost. 5 years of growth at $0.023/GB/month for hot storage = ~$12M/month. Real Instagram uses tiered storage, regional replication, and aggressive compression; we'll do the same.

Metadata Storage

Text

---------- Metadata ----------
Per-photo row:    ~500 bytes (id, owner, caption, location, timestamps, counts)
Per day:           100M * 500 = 50 GB metadata/day
5 years:           50 GB * 365 * 5 = 90 TB metadata

Fits in a sharded SQL or wide-column store. Trivial compared to the photo bytes.

Bandwidth

Text

---------- Bandwidth ----------
Upload bandwidth:     1,200 uploads/sec * 200 KB = 240 MB/s = ~2 Gbps
Feed read bandwidth:  Each feed view fetches ~10 photos at ~100 KB (thumbnail/small)
                      52,000 reads/sec * 10 photos * 100 KB = 52 GB/s = ~415 Gbps

Serving 415 Gbps from origin servers is impossible. The CDN is doing 95%+ of the work.

Cache

Active feed cache (in-memory in a Redis cluster):

Text

Active users (~50M most active):       50M
Feed entries cached (50 IDs per user): 50 * 50M = 2.5B IDs
Per entry (post_id + score):           ~40 bytes
Total:                                  ~100 GB across the cluster

Fits comfortably in 30-50 Redis nodes.

High-Level Design

Text

---------- High-level architecture ----------
            +----------+
            |  Client  |
            +----------+
                 |
                 v
        +---------------------+
        |   CDN (Cloudfront)  |  <- serves photos globally
        +---------------------+
                 |
                 v
        +---------------------+
        |    API Gateway      |
        +---------------------+
          /     |       \
         v      v        v
  +---------+ +---------+ +-------------+
  | Upload  | | Feed    | | User/Graph  |
  | Service | | Service | | Service     |
  +---------+ +---------+ +-------------+
     |           |             |
     v           v             v
  +---------+ +---------+ +-------------+
  |   S3    | | Redis   | | Postgres /  |
  | (blobs) | | (feeds, | | Cassandra   |
  +---------+ | counts) | | (users,     |
     |        +---------+ |  graph,     |
     v          ^         |  posts)     |
  +---------+   |         +-------------+
  | Kafka   |   |               |
  |(events) |---+               v
  +---------+              +----------+
     |                     |  Search  |
     v                     | (Elastic)|
  +-----------+            +----------+
  | Transcode |
  |  Workers  |
  +-----------+

API Design

Jsonc

// Step 1: get a presigned upload URL
POST /api/v1/photos/upload-url
Authorization: Bearer <token>

{
    "file_size": 2400000,
    "content_type": "image/jpeg"
}

// Response
{
    "upload_url": "https://s3.../uploads/<uuid>?X-Amz-Signature=...",
    "photo_id": "01HW3M9...",      // ULID
    "upload_expires_in": 900
}

// Step 2: client uploads original directly to S3 (PUT)

// Step 3: client finalizes; this triggers transcoding
POST /api/v1/photos
{
    "photo_id": "01HW3M9...",
    "caption": "Sunset in Lisbon",
    "location": { "lat": 38.7, "lng": -9.1 },
    "tags": ["@friend1", "#sunset"]
}

Jsonc

// Get personalized feed (paginated)
GET /api/v1/feed?limit=20&cursor=<opaque>

// Response
{
    "items": [
        {
            "photo_id": "01HW3M9...",
            "author": { "id": "u1", "handle": "alice", "avatar_url": "..." },
            "caption": "Sunset in Lisbon",
            "urls": {
                "thumbnail": "https://cdn.../t/01HW3M9.jpg",
                "small":     "https://cdn.../s/01HW3M9.jpg",
                "medium":    "https://cdn.../m/01HW3M9.jpg",
                "large":     "https://cdn.../l/01HW3M9.jpg"
            },
            "like_count": 1234,
            "comment_count": 45,
            "created_at": "2026-04-26T10:00:00Z"
        }
    ],
    "next_cursor": "<opaque>"
}

Photo Upload Pipeline (Asynchronous)

The upload is a multi-step process to avoid coupling the user-facing latency to image transcoding.

Text

---------- Photo upload flow ----------
1. Client: POST /photos/upload-url            (get presigned URL)         ~50 ms
2. Client: PUT to S3 directly                 (uploads original 200 KB)   ~200 ms
3. Client: POST /photos with metadata         (creates metadata row)      ~50 ms
4. Upload Service: enqueue Kafka event        photo.created
5. Transcode Worker: pulls original from S3   resizes to 5 versions       ~1-3 sec
6. Transcode Worker: uploads each to S3       under predictable keys
7. Transcode Worker: marks photo as 'ready'   in metadata
8. Upload Service: enqueue Kafka event        photo.ready -> fan-out

Until step 7 the photo exists in metadata but is not visible in feeds. This decouples upload UX from transcoding throughput.

Detailed Design

The two interesting components are the news feed (fan-out) and the photo storage pipeline.

Feed Generation: Fan-out Tradeoffs

This is the central interview question: when Alice posts a photo, how does it reach the feeds of her 1,000 followers? Three strategies; each has a sharp use case.

Strategy 1: Fan-out on Write (Push)

When Alice posts, write the photo_id into a precomputed feed list for each of her 1,000 followers.

Text

---------- Fan-out on write ----------
Alice posts photo P at time T.
Look up Alice's followers: [bob, carol, dave, ... (1,000 ids)]
For each follower:
  Redis ZADD feed:bob T P
  Redis ZADD feed:carol T P
  ...

When Bob opens the app:
  Redis ZREVRANGE feed:bob 0 19  -> instant; 20 photo_ids
  Hydrate photo metadata in batch -> render

Pros: Read latency is sub-10 ms. Bob's feed is precomputed, just fetch and display.

Cons: A user with 100M followers (a celebrity) generates 100M writes per post. Storage explodes (every follower stores a copy of every post_id). Updates are amplified: if a follower unfollows Alice, you need to remove her posts from their feed.

Strategy 2: Fan-out on Read (Pull)

When Bob opens the app, look up everyone Bob follows and merge their recent posts into a feed at read time.

Text

---------- Fan-out on read ----------
Bob follows [alice, carol, dave, ... (300 ids)]
For each followee:
  SELECT photo_id, created_at FROM photos
    WHERE owner = ? AND created_at > <Bob's last read>
    ORDER BY created_at DESC LIMIT 20
Merge results into a single timeline
Return top 20

Pros: No storage amplification. Celebrities are no different from normal users. Easy to support 'I want to add a freshly-followed user's posts to my feed retroactively.'

Cons: Read is expensive. If Bob follows 300 users, that's 300 database queries per feed open. At 150K feed opens/sec that's 45M database queries/sec. Unscalable.

Strategy 3: Hybrid Fan-out (the real answer)

Use fan-out on write for normal users. Use fan-out on read for celebrities (above some follower threshold like 1M).

Text

---------- Hybrid fan-out ----------
Alice posts a photo:
  if Alice.follower_count < 1M:
    fan-out on write to all followers
  else:
    record post in her own timeline only (no fan-out)

Bob opens the app:
  feed = Redis ZREVRANGE feed:bob 0 19          (precomputed normal posts)
  celebs = list of celebrities Bob follows       (~5)
  for celeb in celebs:
    posts = recent posts from celeb              (pull from celeb's timeline)
    merge into feed
  return top 20

This caps fan-out write cost at the 1M threshold, while keeping the read path fast (only ~5 extra pulls for celebrities, not 300).

Defending the Hybrid in an Interview

The magic words: 'pure fan-out on write doesn't work because of the celebrity problem - a single Justin Bieber post would generate 100M writes. Pure fan-out on read doesn't work either because the read fan-out is too expensive at scale. The hybrid bounds both costs.'

Feed Storage in Redis

We store each user's feed as a sorted set keyed by user_id, scored by post timestamp:

Text

Key:    feed:<user_id>
Value:  ZSET of (post_id, timestamp_score)
TTL:    7 days (rebuild on demand for inactive users)
Size:   ~500 entries per active user
Memory: 50M users * 500 entries * ~30 bytes = ~750 GB across cluster

Trim the ZSET to 500 entries to bound memory. Inactive users (no app open in 30 days) get evicted; on next login, rebuild from a fan-out-on-read scan.

Photo Storage Pipeline

Why presigned URLs?

The alternative is uploading through your application servers, which means:

200 KB photo * 3,500 uploads/sec = 700 MB/s of inbound traffic per server cluster.
Each upload occupies a connection slot for ~200 ms.
The server is now in the data path of every photo, which is wasteful.

With presigned URLs:

Client uploads directly to S3 over a connection that S3 absorbs.
Application server returns the URL in ~10 ms and is free for the next request.

Why multiple resolutions?

Mobile clients need a small thumbnail in the feed list (10 KB) and a larger version when the user taps to expand (200 KB). Loading a 1 MB original in the feed wastes bandwidth and slows scroll.

Generate 5 versions during transcoding; let the client pick based on screen size and DPI.

Transcode worker pool

Kafka topic photo.created triggers transcode workers.
Each worker pulls the original, resizes via libvips (faster than ImageMagick at scale), uploads each variant to S3 under a deterministic key like photos/01HW3M9/medium.jpg.
Worker pool autoscales by Kafka lag; at 1,200 uploads/sec average and ~2 sec per transcode, we need ~2,400 worker concurrency. With 1 worker per CPU core, that's ~600 4-core machines.
After all variants are uploaded, the worker writes status='ready' to the photo metadata and emits photo.ready.

CDN

Photos are served from S3 through Cloudfront (or Akamai). CDN cache key is the S3 URL. Set Cache-Control: public, max-age=31536000, immutable because photos under their canonical URL never change. (User edits create a new variant under a new URL.)

Data Model

We split into three storage systems for three different access patterns.

Postgres (sharded): users, follows, photos metadata

SQL

-- users
CREATE TABLE users (
    id          BIGINT PRIMARY KEY,        -- Snowflake ID
    handle      VARCHAR(32) UNIQUE NOT NULL,
    name        VARCHAR(64),
    bio         TEXT,
    avatar_url  VARCHAR(255),
    created_at  TIMESTAMPTZ NOT NULL
);

-- photos metadata
CREATE TABLE photos (
    id          BIGINT PRIMARY KEY,         -- Snowflake (sortable by time)
    owner_id    BIGINT NOT NULL REFERENCES users(id),
    caption     TEXT,
    location    GEOGRAPHY,
    status      VARCHAR(16) NOT NULL,       -- pending | ready | deleted
    created_at  TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_photos_owner_time ON photos (owner_id, created_at DESC);

-- follows (the graph)
CREATE TABLE follows (
    follower_id BIGINT NOT NULL,
    followee_id BIGINT NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL,
    PRIMARY KEY (follower_id, followee_id)
);
CREATE INDEX idx_followers ON follows (followee_id, created_at DESC);

Shard users, photos, follows by user_id (or follower_id for follows). The follows table is the largest (500M users * average 200 follows = 100B rows).

Cassandra: likes and comments (high-write, append-only)

Text

CREATE TABLE likes (
    photo_id    bigint,
    user_id     bigint,
    created_at  timestamp,
    PRIMARY KEY ((photo_id), user_id)        -- partition by photo_id
) WITH CLUSTERING ORDER BY (user_id ASC);

CREATE TABLE comments (
    photo_id    bigint,
    comment_id  bigint,
    user_id     bigint,
    text        text,
    created_at  timestamp,
    PRIMARY KEY ((photo_id), comment_id)
) WITH CLUSTERING ORDER BY (comment_id DESC);

Likes can hit thousands per second on a viral photo; Cassandra handles per-partition writes well.

Redis: feeds, counters, online presence

feed:<user_id> -> ZSET of post_ids by timestamp (precomputed feed).
count:photo:<id>:likes -> integer (with periodic flush to Postgres).
count:photo:<id>:comments -> integer.
seen:<user_id> -> highest post_id Bob has seen (for 'new posts' badge).

S3: photo bytes

Text

Bucket: instagram-photos-prod
  photos/<photo_id>/original.jpg   (~500 KB)
  photos/<photo_id>/large.jpg      (~200 KB, 1080x1080)
  photos/<photo_id>/medium.jpg     (~80 KB,  640x640)
  photos/<photo_id>/small.jpg      (~30 KB,  320x320)
  photos/<photo_id>/thumb.jpg      (~10 KB,  150x150)

Lifecycle: move originals to S3 IA after 90 days (originals are rarely re-fetched once variants exist). Variants stay in S3 Standard.

Scaling and Bottlenecks

Hot Photo: a celebrity post hits 10M likes in an hour

Cassandra likes table is partitioned by photo_id; this single partition gets 3K writes/sec for an hour.
Mitigation: Cassandra handles thousands of writes per partition. If it gets worse, write through a Redis HyperLogLog for approximate counts and persist to Cassandra in batches.

Feed Hotspot: viral content

Edge cache the popular post's media for everyone (CDN handles this naturally).
The feed itself is per-user; no single feed key is a hotspot.

Follower-graph Scaling

100B follow rows is too big for a single shard. Shard by follower_id so 'who do I follow' queries hit one shard. 'Who follows me?' (for fan-out on write) requires a denormalized inverse table sharded by followee_id. Maintain both with a write-through pattern.

Cross-region availability

Multi-region active-active: 5 regions (NA, EU, APAC, etc.).
Photos: S3 Cross-Region Replication. CDN routes users to the nearest origin.
Metadata: each region writes to its primary, with async cross-region replication. Writes are stickied to the user's home region for consistency.
Feed cache: regional Redis clusters; a user moving regions warms a new cache lazily.

When fan-out queues back up

If transcoding falls behind, photos appear in feeds without thumbnails. Mitigations:

Autoscale workers off Kafka lag.
Generate the thumbnail synchronously during upload (small enough), defer the larger variants asynchronously.
Show a 'processing' placeholder in the client, swap when ready.

Trade-offs and Alternatives

Hybrid fan-out vs pure pull

We could push the celebrity threshold to 0 (everything pull) and rely on aggressive caching of each celebrity's recent posts. Real Instagram historically used variations of this. The hybrid has a slightly more complex code path but better tail latency for normal users with many followers.

Why not GraphQL?

The feed query has known shape; REST with batched hydration is simpler and cacheable. GraphQL shines when client needs vary widely; for Instagram's app-only world, a tight REST surface wins.

Why not a dedicated graph database (Neo4j, JanusGraph)?

Follows is a simple two-column relation; the queries are 1-hop ('who do I follow?', 'who follows me?'). Sharded SQL is faster and cheaper than a graph database for this. Graph databases shine on multi-hop traversals (friend-of-friend), which Instagram doesn't expose.

Why not Postgres for likes / comments?

Likes have very high write rate per popular photo (thousands per sec). Postgres on a single row would lock-contest. Cassandra (or DynamoDB) shards by photo_id and handles per-partition writes much better.

Strong vs eventual consistency on follower count

We show 'followed by 3.4M people' in the UI. That number is eventually consistent (incremented from a Kafka stream into a counter). A user might see 3.4M while Alice just hit 3,400,001 - fine.

Counter-based view: separate timeline service

Real Instagram has a dedicated Timeline Service (sometimes called 'Pixel') whose only job is feed assembly. We've collapsed it into the Feed Service for clarity, but in practice there's enough complexity (re-ranking, deduplication, ML scoring) to warrant its own team and SLO.

Real-World Examples

How real systems implement this in production

Visual discovery platform with a similar pin-based feed. Uses a heavily customized fan-out-on-write for normal users and a pull-based 'home feed' that re-ranks at read time using ML. Stores images on S3 with regional CDN distribution; metadata in MySQL with manual sharding via Vitess.

Trade-off: Pinterest spends more compute on read-time ML ranking than Instagram historically did because the discovery use case (find new things) requires deeper personalization than a chronological friend feed. The cost is higher feed latency budget (300 ms vs 200 ms).

TikTok

Same multi-resolution media pipeline but for video, with HLS segments instead of JPEG variants. Critically, TikTok feeds are NOT fan-out from people you follow; the For You Page is purely ML-recommended from a global pool. Eliminates the celebrity problem entirely.

Trade-off: TikTok's recommendation-only feed avoids fan-out complexity but requires a massive ML inference platform (described in the TikTok case study). Trades systems complexity for ML complexity.

Snapchat

Stories model with 24-hour TTL. Uses fan-out on write more aggressively because content is ephemeral - storage growth is bounded. Unlike Instagram, Snapchat does not need to maintain a forever-archive of every post.

Trade-off: Ephemeral content means smaller storage costs and simpler retention but no monetizable history. Snapchat trades long-term value for product simplicity and a different user contract.

Mastodon (federated)

Federated social network where each instance runs its own database and connects via ActivityPub. Fan-out happens via inter-server federation (push events to remote followers' instances). No central feed service.

Trade-off: Federation eliminates centralized scaling problems but introduces inter-instance consistency and discovery problems. A celebrity on a small instance can take down the instance through federation traffic. Decentralization has its own scaling pathologies.

Quick Interview Phrases

Key terms to use in your answer

fan-out on write vs fan-out on read

hybrid fan-out

celebrity problem

presigned upload URL

asynchronous transcoding

timeline service

Common Interview Questions

Questions you might be asked about this topic

Walk me through what happens from when a user taps the Like button to when their friend sees the updated count.

Client POSTs to a Likes endpoint, which writes to Cassandra (partition by photo_id) and increments a Redis counter. The friend's next feed read pulls the count from Redis (eventually consistent with Cassandra). A background job flushes Redis counters to Postgres metadata every minute for durability and analytics. The like itself is durable in Cassandra immediately; the displayed count may lag by seconds.

How would you handle a user who follows 5,000 people, including 50 celebrities?

How do you ensure that a newly uploaded photo appears in followers' feeds within a few seconds?

Your CDN cache hit rate is 95%. The 5% miss is causing origin overload. How do you fix it?

How would you handle photo deletion?

Interview Tips

How to discuss this topic effectively

Lead with 'this is read-heavy and the feed is the centerpiece' so the interviewer knows you understand the product. Then drive immediately to fan-out.

When asked about fan-out, never commit to pure write or pure read. Say 'hybrid with a celebrity threshold' from the start. Then walk through the math.

Mention the celebrity problem by name. Saying 'a Justin Bieber post would create 100M writes' is the magic phrase that signals you've thought past textbook fan-out.

Always separate the upload pipeline from the feed pipeline in the diagram. Two distinct flows: presigned URL for writes, precomputed list for reads.

Cite real numbers: 500M DAU, 100M uploads/day, 100 KB photos. Numbers anchor every later decision (sharding, caching, CDN sizing).

Common Mistakes

Pitfalls to avoid in interviews

Picking pure fan-out on write without addressing celebrities

Pure fan-out on write means a user with 100M followers generates 100M writes per post. This melts your storage and write pipeline. Always cap fan-out at a follower threshold and pull-on-read for users above that threshold.

Storing photos in the database instead of S3

100 TB/day of photos in a database destroys backup, replication, and IOPS. Photos go in S3 (or equivalent object store); only the metadata (id, owner, caption, S3 key) goes in the database.

Doing transcoding synchronously during upload

Synchronous transcoding adds 1-3 seconds to upload latency and ties application throughput to image processing CPU. Push to a Kafka queue, transcode in workers, mark the photo 'ready' when done. The user gets immediate feedback that the upload succeeded.

Storing a single 'photo URL' instead of multiple resolutions

Mobile feeds need 10 KB thumbnails, not 1 MB originals. Generate 4-5 resolutions during transcoding and let the client pick based on screen DPI. Bandwidth savings dominate any storage cost.

Forgetting that the follower graph has 100B rows and needs sharding

500M users with 200 average follows is 100 billion rows. Shard by follower_id for forward queries and maintain a denormalized inverse table sharded by followee_id for fan-out. Both indexes are needed; neither alone is enough.

Back to System Design