System Design Article

Design YouTube (Video Platform)

Difficulty: Medium

Design a video platform like YouTube with 2 billion users, 500 hours of video uploaded every minute, and 1 billion hours watched per day. The interview centerpiece is the video pipeline: chunked uploads, parallel transcoding to 8 resolutions and 3 codecs, HLS/DASH adaptive streaming over a global CDN, and the metadata service that ties it all together. We also cover recommendations (the secondary feed problem), comment scaling, view-counter accuracy, and how YouTube serves 200 Tbps of egress without melting the internet.

Design YouTube (Video Platform)

System Design

Medium

design-youtube

case-study

social-content-platforms

video-streaming

video-transcoding

adaptive-bitrate-streaming

hls

dash-streaming

video-cdn

media-storage

recommendation-system

system-design

intermediate

premium

1,138 views

Requirements

Functional Requirements

Upload a video (up to several hours, dozens of GB).
Watch a video with adaptive quality (auto-switch between 144p and 4K based on bandwidth).
Like, dislike, comment on videos.
Subscribe to channels; get a subscription feed.
Search videos by title, description, channel.
Recommendations ('next' video, home feed).
View counts (with anti-fraud).

Out of Scope (state explicitly)

Live streaming (different problem; mention briefly).
Monetization, ads, copyright/Content-ID system.
YouTube Shorts (similar to TikTok; covered in TikTok case study).
Premium subscription, music, kids.

Non-Functional Requirements

Scale: 2B users, 500 hours uploaded per minute, 1B hours watched per day.
Read-heavy: orders of magnitude more watch time than upload.
Latency: video starts playing in < 2 seconds (time-to-first-frame), video chunks delivered with < 200 ms RTT to absorb buffering.
Highly available: 99.99%. YouTube outages make global news.
Durability: 11 nines. Lost videos = lost users.
Eventually consistent counters (likes, views) within seconds.

Back-of-the-Envelope Estimation

Upload

Text

---------- Upload volume ----------
Upload rate:                500 hours / minute = 8.3 hours/sec
Average video length:        10 min
Uploads per minute:          500 hr * 60 / 10 = 3000 videos
Uploads per second:          ~50 videos/sec
Uploads per second peak:     ~150 videos/sec (3x)

Bitrate at upload:           ~5-10 Mbps for 1080p source
Ingest bandwidth:            500 hr * 60 sec * 7 Mbps ~= 35 Gbps sustained

Transcoding output

Each uploaded video is transcoded to multiple bitrates and codecs:

Text

---------- Per-video transcoding output ----------
Resolutions: 144p, 240p, 360p, 480p, 720p, 1080p, 1440p, 4K = 8 resolutions
Codecs:      H.264 (universal), VP9, AV1 = 3 codecs
Variants:    8 * 3 = 24 outputs per video

Approximate sizes (10 min video, average bitrates):
144p:    ~30 MB
360p:    ~80 MB
720p:    ~250 MB
1080p:   ~500 MB
4K:      ~2.5 GB
Sum across all variants: ~6 GB per 10 min source

Storage

Text

---------- Storage growth ----------
Per day:    500 hr/min * 1440 min/day = 720,000 hours/day
Per day:    72,000 ten-minute videos /day
Storage:    72,000 * 6 GB = 432 TB / day of finished video assets
Per year:   ~158 PB / year
5 years:    ~800 PB just for videos

With replication (3x):     ~2.4 EB

Long tail observation: 95% of videos have < 1000 views.
Tiered storage:
- Hot (top 5%):  S3 Standard (or equivalent) + CDN edge cache.
- Warm (next 20%): S3 IA, fetched on demand.
- Cold (bottom 75%): Glacier-like; fetch is slow but cheap.

Egress

Text

---------- Egress bandwidth ----------
Watched hours/day:    1B
Average bitrate:      ~3 Mbps (mix of mobile + desktop)
Daily egress bytes:   1B * 3600 sec * 3 Mbps / 8 = 1.35 EB / day
Sustained egress:     ~125 Tbps
Peak (2x):            ~250 Tbps

100% of egress is from the CDN. Origin sees a tiny fraction.

High-Level Design

Text

---------- High-level architecture ----------
     +----------+
     |  Client  |
     +----------+
          |
          v
  +-------------+
  |  Cloudfront  |  <- 99% of segment fetches served here
  +-------------+
     |     |
     v     v
  +----------+   +-----------------+
  | Origin   |   |  Metadata API   |
  | Storage  |   +-----------------+
  | (S3)     |          |
  +----------+   +------+------+
     ^           |             |
     |           v             v
+----+-----+ +----------+ +----------+
| Transcode| | Postgres | |  Search  |
| Workers  | | (videos, | | (Elastic)|
+----------+ |  users)  | +----------+
     ^       +----------+
     |             ^
     |             |
+----+-----+    +----------+
| Kafka    |<---| Recommend|
| (events) |    | Service  |
+----------+    +----------+
     |
     v
+----------+
| Upload   |
| Service  |
+----------+
     ^
     |
  Client (chunked PUTs)

API Design

Jsonc

// 1. Initiate upload
POST /api/v1/videos/upload-init
{ "file_size": 5000000000, "content_type": "video/mp4", "title": "My talk" }
// Response
{
    "video_id": "01HW...",
    "upload_session": "sess_abc",
    "chunk_size": 8388608,             // 8 MB
    "upload_url_pattern": "https://s3.../uploads/<video_id>/chunk_{n}?X-Amz-Signature=..."
}

// 2. Client PUTs chunks in parallel to S3 directly

// 3. Finalize upload
POST /api/v1/videos/upload-complete
{ "video_id": "01HW...", "upload_session": "sess_abc", "chunks": 596 }
// Response
{ "video_id": "01HW...", "status": "processing" }

// 4. Watch a video (returns the master manifest)
GET /api/v1/videos/:id/manifest.m3u8
// Response: HLS master playlist with variants for each bitrate

// 5. Client picks a bitrate, fetches segment manifests, then segments
GET https://cdn.../v/01HW.../1080p/segment_001.ts

Watch Path (the hot read)

Client GETs video metadata + master manifest.
Client picks initial bitrate based on bandwidth heuristic, fetches variant manifest.
Client streams 6-second segments from CDN, switching variants up/down based on real-time bandwidth measurement.
View event sent fire-and-forget to View Counter Service after the user watches > 30 seconds (anti-fraud threshold).
Player asynchronously fetches recommendations for 'up next' from Recommend Service.

Text

---------- Time-to-first-frame budget (target < 2 sec) ----------
DNS + connect to CDN:           50 ms
Fetch master manifest:          80 ms
Fetch variant manifest:         80 ms
Fetch first segment:           500 ms (depends on bitrate)
Video decode + first frame:    100 ms
Total:                        ~810 ms (well under 2 sec)

Detailed Design

The two interesting components are the video upload + transcoding pipeline and the adaptive streaming + CDN strategy.

Upload Pipeline

Why chunked uploads?

Video files are huge (1 GB - 50 GB+). Single-PUT uploads:

Time out on slow connections.
Cannot resume on failure (start over from byte 0).
Saturate single TCP connections (limited by congestion control).

Chunked uploads (multipart upload in S3 terminology):

8-100 MB per chunk; client uploads chunks in parallel.
Resumable: a failed chunk just retries.
Faster: parallel TCP connections fill more bandwidth.

Resumability

Client persists upload_session and chunks_uploaded locally. On reconnect, asks server which chunks are missing and retries them. S3 multipart upload is built for this.

Transcoding Pipeline

This is the most interesting part of YouTube's backend.

The naive approach (don't do this)

After upload completes, run a single ffmpeg job that produces all 24 variants. For a 60-min source, this takes hours per video on one machine. With 3000 uploads/min, you need an absurd amount of compute, and a single failure restarts the whole thing.

The chunked transcoding approach

Text

---------- Transcoding pipeline ----------
1. Splitter: splits source video into 30-second GOP-aligned chunks.
   - 60 min video -> 120 chunks.
2. Each chunk goes into Kafka topic transcode.chunk.
3. Transcode Workers (autoscaled fleet, ~10K instances):
   - Each pulls a chunk, transcodes to ALL 24 variants in parallel ffmpeg processes.
   - Writes outputs to S3 under deterministic keys.
   - Emits transcode.chunk.done.
4. Stitcher: when all 120 chunks * 24 variants are done, assembles the manifests.
   - Master playlist (.m3u8) lists variants.
   - Variant playlist lists chunks (URLs in CDN).
5. Marks video status='ready' in metadata.
6. Emits video.ready event for fan-out (notify subscribers).

Key insight: transcode in parallel by chunks, not by video. A 60-min video uses 120 workers concurrently and finishes in the time of one chunk transcode (~1-2 min) instead of hours. The cost is the same total compute, but the wall clock is bounded.

GOP alignment

Each chunk must be a Group of Pictures (GOP) aligned to a keyframe so it can decode independently. Splitter inserts keyframes every 30 seconds during the split if needed (cheap operation).

Codec selection

H.264: universal compatibility, software-decode on every device. The default.
VP9: ~30% smaller than H.264 at same quality. Supported on most modern browsers.
AV1: ~50% smaller than H.264. Slow to encode (10x H.264). Saves bandwidth at scale; only worth it for popular videos.

Real YouTube transcodes only the most-watched percentile to AV1 because of the encoding cost. For long-tail videos (which are 95% of uploads but 5% of watch hours), H.264 only.

Adaptive Bitrate Streaming

HLS (HTTP Live Streaming)

Master playlist (.m3u8) lists variants:

Text

#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=400000,RESOLUTION=240x144
v/01HW.../144p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=2500000,RESOLUTION=1280x720
v/01HW.../720p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=8000000,RESOLUTION=1920x1080
v/01HW.../1080p/playlist.m3u8

Variant playlist lists segments (chunks):

Text

#EXTM3U
#EXT-X-TARGETDURATION:6
#EXTINF:6.0,
segment_001.ts
#EXTINF:6.0,
segment_002.ts

Client downloads a segment, measures throughput, decides whether to step up/down for the next segment.

MPEG-DASH

Same concept, different format (XML manifest, .mpd). Better codec flexibility (DRM, multiple audio tracks). YouTube actually uses DASH primarily; HLS is the iOS/Safari path.

Why both?

iOS Safari requires HLS. Chrome supports both but DASH is more standardized for newer codecs. YouTube serves whichever the client requests. Both reference the same underlying transcoded segments; only the manifest format differs.

Client-side bitrate adaptation

Python

JavaScript

class AbrSelector:
    def __init__(self, variants):
        self.variants = sorted(variants, key=lambda v: v.bandwidth)
        self.current = 0
        self.bandwidth_estimate = 0  # bps

    def on_segment_complete(self, segment_size_bytes, download_time_sec):
        self.bandwidth_estimate = (segment_size_bytes * 8) / download_time_sec
        # Pick highest variant whose bandwidth <= 0.8 * estimate (safety margin)
        candidate = max(
            (i for i, v in enumerate(self.variants) if v.bandwidth <= self.bandwidth_estimate * 0.8),
            default=0,
        )
        self.current = candidate

CDN Strategy

Edge caching

Video segments are immutable (URL includes a content hash or version). Set Cache-Control: public, max-age=31536000, immutable. The CDN caches forever.

Tiered cache

Edge (close to user): caches segments seen recently. Hit rate ~90% for popular content.
Regional shield: catches edge misses. Hit rate ~99% combined.
Origin (S3): caches at the regional shield ensure that even on cold edges, origin sees < 1% of traffic.

Pre-warming

When a video is predicted to be popular (initial uploader has many subscribers, similar videos went viral), pre-push the most-requested variants (480p, 720p, 1080p H.264) to all edges. This is cheaper than serving the first million views as cache misses.

Cold tail strategy

Long-tail videos (< 1000 views/day) get NO pre-warming. First view is a cache miss; subsequent views in that region cache normally. Acceptable because cold-tail watch time is small.

View Counter Service

View counts at YouTube scale require:

High write throughput (hundreds of thousands of view events/sec at peak).
Anti-fraud (don't count refresh-spammers, bots).
Eventually consistent display (within minutes is fine).

Design:

Player sends view event after watching > 30 seconds.
Event lands in Kafka.
View Counter consumer: a. Dedupe per (video_id, user_id, day) using Bloom filter or Redis SET. b. Anti-fraud: rate-limit per IP (10 views/min/IP from a single IP is suspicious). c. Increment a Redis counter; periodically flush to Postgres.
Display 'about 1.2M views' (rounded) for counts > 1000 to absorb minor inaccuracies.

Recommendation Service

The full ML pipeline is its own beast (covered in the Recommendation Systems advanced lesson). For YouTube design, the architectural notes:

Recommendations are precomputed per user offline (batch ML pipeline).
Stored as Redis lists keyed by user_id.
Re-ranked at read time with online signals (recently watched, current session).
Cold-start for new users: trending videos by region.

Data Model

Postgres (sharded by video_id): video metadata

SQL

CREATE TABLE videos (
    id              BIGINT PRIMARY KEY,
    channel_id      BIGINT NOT NULL,
    title           VARCHAR(100) NOT NULL,
    description     TEXT,
    duration_sec    INTEGER,
    status          VARCHAR(16),       -- uploading | processing | ready | deleted | takedown
    visibility      VARCHAR(16),       -- public | unlisted | private
    upload_at       TIMESTAMPTZ,
    published_at    TIMESTAMPTZ,
    view_count      BIGINT DEFAULT 0,  -- denormalized; eventually consistent
    like_count      BIGINT DEFAULT 0,
    s3_master_key   VARCHAR(255)        -- pointer to manifest in S3
);

Cassandra: comments, likes

Text

CREATE TABLE video_comments (
    video_id    bigint,
    comment_id  bigint,
    user_id     bigint,
    body        text,
    parent_id   bigint,
    created_at  timestamp,
    PRIMARY KEY ((video_id), comment_id)
) WITH CLUSTERING ORDER BY (comment_id DESC);

CREATE TABLE video_likes (
    video_id  bigint,
    user_id   bigint,
    value     tinyint,         -- +1 like, -1 dislike, 0 neutral
    PRIMARY KEY ((video_id), user_id)
);

S3: video segments and manifests

Text

Bucket: youtube-videos-prod
  videos/<video_id>/master.m3u8
  videos/<video_id>/master.mpd
  videos/<video_id>/144p/playlist.m3u8
  videos/<video_id>/144p/segment_001.ts
  videos/<video_id>/144p/segment_002.ts
  videos/<video_id>/720p/segment_001.ts
  videos/<video_id>/4k/segment_001.ts
  ...

Storage class lifecycle:

0-7 days: Standard (high access expected).
7-90 days: Standard-IA (access drops sharply).
90+ days for cold tail: Glacier Instant Retrieval (still serveable on cache miss, just slower).

Redis: hot counts, dedup, recommendations

view_count:<video_id> -> integer (live, flushed to Postgres every 60 sec).
view_dedup:<video_id>:<day> -> Bloom filter or SET of user_ids.
recs:<user_id> -> LIST of recommended video_ids (TTL 24h).

Scaling and Bottlenecks

Viral video: 100M views in a day

CDN absorbs essentially 100%. Each edge serves a small slice; segment hit rate is ~99.99% for hot content.
View counter handles 100M events/day = 1,200/sec average. Trivial for Kafka + Redis.
Comment writes: hot video can have 10K comments/min. Cassandra absorbs partition writes.

Live event: 100M concurrent viewers

(Live streaming is technically out of scope, but commonly asked.)

Different ingest path: live transcoder generates segments in real time with very low latency (~3 sec).
CDN edge caches segments for the chunk duration (6 sec). With 6-sec TTL and 100M concurrent viewers, each segment serves 100M times from edge.
Use a hierarchical CDN: edges fan in to regional caches that fan in to origin.

Storage cost is the dominant cost

Tiered storage: 95% of bytes go to IA or Glacier within 90 days.
Aggressive deduplication: identical uploads (re-uploads of the same source) detected by perceptual hash and served from a single set of segments.
Newer codecs (AV1) reduce bytes per video at the cost of encode CPU. Use for the popular 5%; H.264 only for the long tail.

Transcoding fleet sizing

Ingest: ~50 videos/sec average, 150/sec peak. Each video chunk transcodes in ~1-2 min. Average video has ~20 chunks (10 min / 30 sec per chunk).

Text

---------- Transcode worker math ----------
50 videos/sec * 20 chunks * 24 variants = 24,000 chunk-variant transcodes/sec
Each takes ~30 sec on a 4-core box (one variant at 1x speed)
Needed concurrency: 24,000 * 30 = 720,000 worker-seconds/sec
Fleet size: 720K cores / 4 cores per box = 180,000 boxes (rough order)

That's a massive fleet, which is why YouTube spends serious money on transcoding. Mitigations: cheaper per-pixel encoders, hardware-accelerated transcoding (VP9 ASICs, AV1 ASICs), processing only popular content into expensive codecs.

Trade-offs and Alternatives

Why HLS + DASH instead of progressive download?

Progressive download (a single MP4 file) doesn't adapt to bandwidth. Buffering on a slow connection means waiting indefinitely. Adaptive streaming switches down to 360p so the user keeps watching, even on a 1 Mbps connection.

Why so many resolutions?

More variants = better adaptation = fewer rebuffers. 8 resolutions is a lot but each adds ~12% to storage; bitrate ladder optimization is its own field. Real YouTube tunes ladders per content type (cartoon vs nature documentary).

Why chunked transcoding instead of GPU per video?

GPU transcoding is faster per machine but doesn't parallelize one video across multiple GPUs cleanly. Chunking lets you use thousands of CPU cores simultaneously, completing a 1-hour video in 1-2 minutes wall-clock. GPUs are used for AV1 (where they're 10x faster than CPU) but the architectural pattern is still chunked.

Comments at scale: Cassandra vs Postgres

Viral video comments hit thousands of writes/sec on a single video. Cassandra partitioned by video_id absorbs this. Postgres would lock-contest. The cost: no JOINs on comments. We hydrate user info separately.

View count accuracy

We undercount slightly (Bloom filter false positives, anti-fraud filtering). For a 1B-view video, undercounting by 0.1% doesn't matter; we round display anyway. For monetization (per-view payout) we'd need exact counts via deduplicated event logs.

Why not BitTorrent / P2P delivery?

P2P video distribution exists (PeerTube uses WebTorrent) but:

Mobile devices don't seed (battery, data plan).
Users don't tolerate the latency variability.
Operating costs of CDN are predictable; P2P quality of service isn't.

Real YouTube has explored P2P for live streaming (where redundancy matters) but stuck with CDN for VOD.

Single canonical bitrate ladder vs per-video tuning

A fixed ladder (144p, 240p, 360p, 480p, 720p, 1080p, 1440p, 4K at fixed bitrates) is operationally simple. Per-video ladders (using ML to pick the optimal bitrates per content type) save 20-30% bandwidth. Real YouTube does some per-video tuning; the canonical ladder is the fallback.

Real-World Examples

How real systems implement this in production

Netflix

Video-on-demand platform serving 250M+ subscribers with a similar pipeline: upload, transcode to many variants, distribute via CDN. Critical difference: Netflix uploads come from studios (not users), so per-title encoding can be ML-optimized for hours per title, not minutes.

Trade-off: Netflix optimizes per-title bitrate ladders to save 20-30% bandwidth, justifying the longer encode time. YouTube cannot afford this because user uploads are too frequent. Trade-off: encoder optimization vs throughput; Netflix wins on quality, YouTube wins on volume.

Twitch

Live video platform with very low latency (<3 sec stream-to-viewer for the lowest-latency mode). Uses HLS but with smaller segment sizes (2 sec) and aggressive prefetch. Transcoding happens in real time at ingest.

Trade-off: Lower latency means smaller segments and less buffer headroom. Twitch trades robustness on bad networks (more rebuffer events) for the live interactivity that's its core product. Live and VOD have different optimization targets.

Vimeo

Premium video host focused on creators and businesses. Same pipeline (chunked upload, multi-variant transcode, CDN distribution) but with extra emphasis on quality (better default ladders) and customization (custom players, no ads).

Trade-off: Vimeo trades scale (millions of users vs YouTube's billions) for quality and customization. Smaller fleet, fewer constraints, more polished UX. Same architecture, different optimization knobs.

TikTok

Short-form video (15-60 sec) with a different access pattern: most videos are watched many times in a short window, then forgotten. Transcoding is simpler (shorter clips, fewer variants). Recommendation is far more aggressive (the For You Page is the entire product).

Trade-off: TikTok's bounded video length simplifies the pipeline (no need to tier old long videos) but shifts complexity to the recommendation engine. Long-form video (YouTube) and short-form (TikTok) have different scaling pathologies.

Quick Interview Phrases

Key terms to use in your answer

chunked upload with multipart

GOP-aligned chunked transcoding

adaptive bitrate streaming

HLS and DASH manifests

tiered CDN cache

long-tail tiered storage

Common Interview Questions

Questions you might be asked about this topic

Walk me through what happens from the moment a user uploads a 5 GB video to when their friend can watch it.

(1) Client requests upload-init; gets a session and chunked upload URLs. (2) Client uploads ~600 chunks of 8 MB in parallel to S3 directly via presigned URLs. (3) Client posts upload-complete; status flips to 'processing'. (4) Splitter splits source into 30-sec GOP-aligned chunks; emits Kafka events. (5) Transcode workers transcode each chunk to 24 variants in parallel, write outputs to S3. (6) Stitcher assembles manifests after all chunks done. (7) Status flips to 'ready'. (8) Friend opens the video; metadata API returns the manifest URL; player streams adaptive segments from CDN. End-to-end: minutes for upload + 1-3 min for transcode + immediate playback once ready.

How do you handle a viral video that gets 100M views in 24 hours?

How would you ensure low time-to-first-frame on a slow mobile network?

How do you detect view fraud (botting, refresh-spamming)?

How would you support live streaming on top of this design?

Interview Tips

How to discuss this topic effectively

First sentence: 'Storage and egress are the two cost centers, and they're both dominated by long-tail behavior.' This frames the entire conversation around what actually matters at YouTube scale.

Always describe the chunked transcoding pipeline. It's the canonical example of horizontal-scale batch work and demonstrates you've thought about parallelism.

Mention HLS vs DASH explicitly and pick both (different clients need different formats). Saying just 'I'll stream the video' is an instant downgrade.

Bring up tiered storage for the cold tail. 95% of videos have < 1000 views; storing them in S3 Standard is wasteful. This shows cost awareness.

Decouple view count from playback. Saying 'the player increments a counter on play' fails the scale test. Always: fire-and-forget event, asynchronous counter.

Common Mistakes

Pitfalls to avoid in interviews

Doing all transcoding for a video on a single machine

A 60-minute video has 120 chunks; transcoded to 24 variants on one machine, that's hours per video. Split into 30-second chunks, transcode each in parallel across thousands of workers, then stitch the manifest. Wall-clock drops from hours to minutes.

Serving the full video file as a single download

Progressive downloads don't adapt to bandwidth changes. On a slow connection the user waits indefinitely or gives up. Adaptive bitrate streaming (HLS/DASH) lets the player switch resolutions per segment, keeping playback going.

Storing all videos in S3 Standard regardless of view count

95% of videos have < 1000 lifetime views but consume 75% of storage. Tier them to S3 IA after 7 days and Glacier after 90 days. Bandwidth cost when an old video gets viewed is small compared to constant storage cost for billions of cold videos.

Counting views synchronously by incrementing a database column

View counts hit thousands per second on viral videos. Synchronous counters lock contend and limit throughput. Send fire-and-forget events to Kafka, dedupe and increment in Redis, periodically flush to durable storage. Display 'about 1.2M' to absorb eventual consistency.

Forgetting that the CDN does 99% of egress

At 100+ Tbps egress, your origin couldn't possibly serve directly. The CDN with edge + regional shield absorbs essentially all traffic. Origin sees < 1%. Designs that route playback through your application servers don't work at video scale.

Back to System Design