Tags

System Design

System Design

0 lessons
70 system designs
2 behavioral interviews
28 community items

system-design

System Design

70 articles
System Design

SQL vs NoSQL - Choosing the Right Database

SQL vs NoSQL is the most common storage decision in system design interviews. SQL databases give you ACID guarantees, joins, and a fixed relational schema; NoSQL databases give you flexible schemas, horizontal scaling, and specialized data models. This lesson teaches you the four NoSQL families, the real engineering trade-offs, and a clear decision framework so you can defend your database choice in any interview.

sql
nosql
database
acid
data-modeling
horizontal-scaling
system-design
beginner

525

15

Easy
System Design

Database Indexing & Query Optimization

Indexes turn O(N) full-table scans into O(log N) lookups, but every index costs storage and slows writes. This lesson teaches how B-tree and hash indexes work, when to use composite or covering indexes, how to read an EXPLAIN plan, and the common indexing mistakes that cause production outages. By the end you can defend any indexing decision in an interview and diagnose a slow query in production.

database-indexing
query-optimization
sql
btree
performance
database
system-design
beginner

332

10

Easy
System Design

Database Replication (Leader-Follower, Multi-Leader)

Replication keeps copies of your data on multiple servers so you can survive failures, scale reads, and serve users from the nearest region. This lesson covers the three replication topologies (leader-follower, multi-leader, leaderless), the mechanics of synchronous and asynchronous replication, the consistency surprises that come with replication lag, and how to design failover and conflict resolution. By the end you can pick a topology and defend it in an interview, and recognize the bug class behind 'I just wrote it but the read says it does not exist'.

database-replication
leader-follower
consistency
availability
distributed-systems
failover
system-design
intermediate

204

3

Medium
System Design

Database Sharding & Partitioning Strategies

Sharding splits a database into many smaller pieces (shards) so writes and storage can scale across servers. The hard part is not the splitting; it is choosing a shard key that avoids hot shards, supporting cross-shard queries, and rebalancing as the data grows. This lesson covers the four sharding strategies, how to pick a shard key, the operational realities of resharding, and when sharding is the wrong answer.

data-partitioning
partitioning
database
horizontal-scaling
consistent-hashing
sql
system-design
intermediate

282

5

Medium
System Design

Blob Storage, Object Stores & CDNs

Databases are wrong for storing large unstructured files - photos, videos, backups, logs. Object stores like S3 give you cheap, durable, infinitely scalable storage for blobs, while CDNs cache that content at edges close to users. This lesson covers the object-storage data model, multi-part upload, storage classes, presigned URLs, and how a CDN turns a globally slow origin into a globally fast experience. By the end you can design the media layer for any social, video, or e-commerce system.

blob-storage
object-storage
cdn
content-delivery-network
s3
caching
system-design
intermediate

698

20

Medium
System Design
Premium

Data Warehousing, Data Lakes & OLAP vs OLTP

OLTP databases are built for fast single-row reads and writes; analytical queries against them choke. This lesson covers why analytics needs its own storage stack: column-oriented warehouses, lake formats, and lakehouse engines that scan billions of rows in seconds. You'll learn the OLTP versus OLAP trade-off, dimensional modeling (star schema), ETL versus ELT, change data capture, and how a modern data platform separates compute from storage so you can query petabytes for the cost of a coffee.

data-warehouse
data-lake
olap
oltp
etl
sql
system-design
advanced

503

14

Hard
System Design

Caching Fundamentals (Write-Through, Write-Back, Write-Around)

A cache is a small, fast store that holds copies of data so the next request does not pay the cost of fetching it from the source of truth. This lesson covers what a cache is, where it lives in a stack, the four read and write patterns you will be asked about (cache-aside, read-through, write-through, write-back, write-around), eviction policies, and the failure modes (stampedes, hot keys, stale data) that bite real systems. By the end you can pick a caching strategy and defend it in an interview.

caching
cache-aside
write-through
write-back
write-around
lru
ttl
performance
system-design
beginner

802

11

Easy
System Design

Distributed Caching (Redis, Memcached)

A single-node cache eventually runs out of RAM, CPU, or network. Distributed caching spreads keys across many nodes so total capacity and throughput scale horizontally. This lesson covers how Redis and Memcached partition data, replicate it for availability, fail over when nodes die, and how to choose between them. By the end you can design a multi-node cache layer for a real workload, defend the topology in an interview, and recognize the bug class behind 'why is one cache node maxed at 100% CPU while the others are idle?'.

caching
redis
memcached
consistent-hashing
distributed-systems
replication
failover
system-design
intermediate

806

6

Medium
System Design

Cache Invalidation Strategies & Consistency

There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors. This lesson tackles the first one. We cover TTL-based, write-driven, and event-driven invalidation; the canonical race conditions (lost-update, double-write inconsistency, stale-after-failover); the consistency models a cache can offer; and the patterns that real systems (Facebook, Stripe, AWS) use to keep cached data trustworthy. By the end you can pick an invalidation strategy, defend it under interviewer pressure, and explain exactly why your cache will not silently serve yesterday's data.

caching
cache-invalidation
consistency
ttl
distributed-systems
race-conditions
system-design
intermediate
premium

578

12

Medium
System Design

Horizontal vs Vertical Scaling

When traffic grows, you have two choices: make the box bigger (vertical) or add more boxes (horizontal). This lesson lays out the cost, complexity, and ceiling of each approach, why stateless services scale horizontally with almost no thought, why stateful services require sharding or replication, and how real teams pick a default. By the end you can answer 'how would you scale this?' with a defensible answer instead of an instinct.

scalability
horizontal-scaling
vertical-scaling
stateless-services
system-design
beginner

534

10

Easy
System Design

Load Balancing Algorithms & Patterns

A load balancer is the traffic cop in front of every horizontally scaled service. This lesson covers the four scheduling algorithms you need to know (round-robin, least-connections, weighted, hash), the difference between Layer 4 and Layer 7 load balancing, how health checks pull dead nodes out of rotation, the role of sticky sessions and connection draining, and the tools (NGINX, HAProxy, ELB/ALB, Envoy) that implement all of this. By the end you can pick the right algorithm for a workload and explain to an interviewer exactly how a request finds its way from the load balancer to a healthy backend.

load-balancing
round-robin
least-connections
sticky-sessions
health-checks
layer-4
layer-7
nginx
haproxy
system-design
beginner

998

14

Easy
System Design

Reverse Proxy & API Gateway

A reverse proxy sits at the edge of your infrastructure and terminates client connections so backends never see them directly. An API gateway is a reverse proxy with opinions: authentication, rate limiting, request transformation, and per-route policies. This lesson covers what each does, when one is enough and when you need the other, the canonical features (TLS termination, response caching, request shaping, JWT validation, circuit breaking), and the tools that implement them (NGINX, Envoy, Kong, AWS API Gateway, Apigee). By the end you can place either in a real architecture and articulate the boundary between them in an interview.

reverse-proxy
api-gateway
nginx
envoy
kong
tls
rate-limiting
system-design
intermediate
premium

1.1k

21

Medium
System Design

Auto-Scaling, Elasticity & Capacity Planning

Auto-scaling lets your fleet grow when traffic surges and shrink when it ebbs, so you pay for the load you actually have. This lesson covers reactive metric-based scaling, predictive (schedule-based) scaling, and the gotchas that turn auto-scaling into auto-outage: warm-up time, scale-down storms, downstream throttling, and cost runaway. We also walk through capacity planning: how to estimate the fleet size you need from QPS, latency targets, and headroom, before relying on the scaler to fix mistakes at 3 a.m. By the end you can configure an auto-scaling policy with confidence and explain to an interviewer why simply 'putting it on auto-scale' is not the actual answer.

auto-scaling
elasticity
capacity-planning
kubernetes-hpa
aws-asg
scalability
system-design
intermediate
premium

779

20

Medium
System Design

CAP Theorem & Trade-offs

The CAP theorem says any distributed data store must trade off Consistency, Availability, or Partition tolerance during a network split, and you only get to keep two. This lesson cuts through the textbook version with the practical engineer's reading: partitions are non-negotiable, so the real choice is between consistency and availability when the network breaks. We cover what each property actually means, why CAP is misleading without PACELC, and how real systems (MongoDB, DynamoDB, Cassandra, Spanner) place themselves on the spectrum. By the end you can defend a system's CAP choice in an interview without falling into the common 'I picked CA' trap.

cap-theorem
distributed-systems
consistency
availability
partition-tolerance
system-design
beginner
free

1.1k

4

Easy
System Design

Consistency Models (Strong, Eventual, Causal)

Consistency models are the contract between a distributed data store and its clients about what they can and cannot observe. This lesson walks the spectrum from strict serializability at the strong end to eventual consistency at the relaxed end, with stops at linearizability, sequential, causal, read-your-writes, monotonic reads, and monotonic writes. We focus on what each model promises, what bugs it prevents, what it costs in latency and availability, and which production systems implement it. By the end you can name the model your system needs and explain why - the senior-level move that interviewers reward.

consistency
strong-consistency
eventual-consistency
causal-consistency
distributed-systems
cap-theorem
system-design
intermediate
free

911

4

Medium
System Design

Consistent Hashing & Data Distribution

Consistent hashing is the trick that lets distributed caches and databases add or remove nodes without remapping every key in the cluster. This lesson explains why naive `hash(key) % N` is broken, how the hash ring works, why you need virtual nodes to keep load balanced, and how real systems (DynamoDB, Cassandra, Memcached, Discord) implement it. We finish with the modern alternatives (rendezvous hashing, jump consistent hash, Maglev) and the trade-offs that make consistent hashing the answer in interviews 90% of the time.

consistent-hashing
data-partitioning
distributed-systems
distributed-cache
database-sharding
system-design
intermediate
free

696

17

Medium
System Design
Premium

Leader Election & Consensus (Raft, Paxos)

Leader election is how a distributed cluster picks one node to be in charge so the others can stop arguing. This lesson covers the consensus problem (FLP impossibility), Paxos in concept, Raft in detail (leader election + log replication + safety), the role of quorum, and the operational pitfalls of split brain and network partitions. We also tour the systems that ship Raft or Paxos in production: etcd, ZooKeeper, Consul, CockroachDB, MongoDB, Spanner. By the end you can explain why every modern distributed database has a consensus protocol at its core, and you can sketch Raft on a whiteboard.

leader-election
raft
paxos
distributed-systems
consensus
consistency
fault-tolerance
system-design
advanced
premium

965

31

Hard
System Design
Premium

Distributed Transactions (2PC, Saga Pattern)

When a single business operation spans multiple services or databases, you cannot rely on a single ACID transaction. This lesson covers the two dominant patterns for keeping consistency across services: Two-Phase Commit (2PC) for synchronous, atomic, blocking transactions, and the Saga pattern (orchestration vs choreography) for long-running asynchronous workflows with compensating actions. We also cover Three-Phase Commit, idempotency keys, the outbox pattern, and the trade-offs that explain why 2PC is rare in microservices and Sagas are everywhere. By the end you can pick the right pattern for an order checkout, a money transfer, or a multi-step booking flow.

distributed-transactions
two-phase-commit
saga-pattern
distributed-systems
consistency
acid
microservices
system-design
advanced
premium

855

24

Hard
System Design

Message Queues (Kafka, RabbitMQ, SQS)

Message queues let one service hand work to another without waiting, smoothing traffic spikes, decoupling services, and surviving downstream outages. This lesson covers the two queue families (broker-based like RabbitMQ and SQS vs log-based like Kafka), the delivery semantics (at-most-once, at-least-once, exactly-once), the operational essentials (DLQs, consumer groups, backpressure, ordering), and the trade-offs that decide between Kafka, RabbitMQ, and SQS for any given workload. By the end you can pick a queue and defend the choice with the per-property reasoning interviewers reward.

message-queue
kafka
rabbitmq
sqs
async-processing
pub-sub
distributed-systems
system-design
intermediate
free

932

7

Medium
System Design

Event-Driven Architecture & Pub/Sub

Event-driven architecture (EDA) is a style where services communicate by emitting and reacting to immutable events instead of calling each other directly. This lesson covers the publish/subscribe pattern, the difference between event notification and event-carried state transfer, the role of an event bus, and how EDA reshapes coupling, scalability, and consistency. We compare it with request/response, walk through real implementations on Kafka, Kinesis, EventBridge, and SNS, and end with the operational pitfalls (event versioning, ordering, schema drift, observability) that bite teams who adopt EDA without preparation.

event-driven
pub-sub
kafka
message-queue
async-processing
distributed-systems
system-design
intermediate
premium

388

7

Medium
System Design
Premium

Stream Processing (Kafka Streams, Flink)

Stream processing is the discipline of computing on continuous, unbounded data as it arrives, instead of in periodic batches. This lesson covers the core stream-processing primitives: stateful operators, event time vs processing time, watermarks, windowing (tumbling, sliding, session), exactly-once semantics, and stateful checkpointing. We compare the leading engines (Kafka Streams, Apache Flink, Spark Structured Streaming) and walk through real production patterns: real-time analytics, fraud detection, ML feature pipelines, and CDC-driven materialized views. By the end you can sketch a Flink pipeline on a whiteboard and defend the windowing and checkpointing choices.

stream-processing
kafka
flink
event-driven
async-processing
distributed-systems
system-design
advanced
premium

949

28

Hard
System Design

Fault Tolerance, Redundancy & Failover

Fault tolerance is the property that lets a system keep working when components fail - and at any reasonable scale, components are always failing. This lesson covers the building blocks: redundancy (active-active, active-passive), failure detection (health checks, heartbeats), failover (automatic, manual), and the patterns that make systems gracefully degrade instead of catastrophically crash (circuit breakers, retries with backoff, bulkheads, timeouts). We finish with the operational disciplines that turn architecture into reality: chaos engineering, runbooks, blast-radius analysis, and disaster recovery (RTO/RPO). By the end you can design a system that survives the failure modes interviewers love to throw at you.

fault-tolerance
redundancy
failover
circuit-breaker
reliability
availability
distributed-systems
system-design
intermediate
free

510

11

Medium
System Design

Monitoring, Logging, Alerting & SLAs

Observability is what lets you know whether your system is working before customers do. This lesson covers the three pillars (metrics, logs, traces), the SRE-grade definitions of SLI / SLO / SLA, and the operational practices that turn raw telemetry into actionable alerts (RED method, USE method, error budgets, alert fatigue control). We tour the standard production stack (Prometheus, Grafana, OpenTelemetry, ELK, Datadog) and the pitfalls that cause teams to either drown in alerts or miss real incidents. By the end you can design an observability strategy and defend it in an interview against the question 'how would you know if this system was broken?'.

monitoring
alerting
logging
tracing
sla
slo
reliability
system-design
intermediate
premium

474

4

Medium
System Design

Design a URL Shortener (TinyURL)

Design a URL shortening service like TinyURL or bit.ly that maps a long URL to a 7 character code, redirects clicks in under 50 ms, and survives a 100:1 read-to-write ratio. This lesson walks through capacity estimation, the choice between counter based and hash based key generation, the database split between a key store and an analytics store, and the caching strategy that lets a single mid-tier service handle 10K redirects per second on commodity hardware.

design-tinyurl
url-shortener
case-study
social-content-platforms
base62-encoding
read-heavy
consistent-hashing
caching
cdn
system-design
beginner
free

690

4

Easy
System Design

Design Pastebin

Design a service like Pastebin or GitHub Gist where users dump up to 10 MB of text and share a link. The interview twist over a URL shortener: pastes are big, so you store them in object storage (S3) and only keep metadata in your database. This lesson covers the metadata vs blob split, expiration via S3 lifecycle policies, presigned URLs for direct uploads, syntax highlighting strategy, and how to handle the read pattern when most pastes are read once and never again.

design-pastebin
case-study
social-content-platforms
blob-storage
cdn
expiration-policy
presigned-urls
read-heavy
system-design
beginner
free

601

15

Easy
System Design

Design Instagram (Photo Sharing)

Design a photo sharing service like Instagram with 500M daily active users uploading 100M photos a day, served as personalized feeds at sub-200 ms p99. The interview centerpiece is the news feed: fan-out on write versus fan-out on read, the celebrity problem, and the hybrid pull-on-read model that real Instagram uses. We also cover photo upload pipelines (presigned URLs, multi-resolution generation, CDN), the metadata data model, and how to scale follow graphs that go from a few friends to hundreds of millions of followers.

design-instagram
case-study
social-content-platforms
photo-sharing
fan-out-on-write
fan-out-on-read
hybrid-fan-out
celebrity-problem
feed-ranking
media-storage
thumbnail-generation
cdn
social-media
system-design
intermediate
free

797

18

Medium
System Design

Design Twitter / X (Social Feed)

Design a microblogging service like Twitter or X with 250M daily active users posting 500M tweets a day, served as a personalized timeline at sub-200 ms p99. The interview centerpiece is the home timeline: hybrid fan-out at the celebrity boundary, write amplification math, and how Twitter built Manhattan and the Timeline Service to make 250M people see fresh tweets within seconds. We also cover trending topics, the search index, retweet semantics, and how Twitter handles 50,000 tweets per second when a major event happens.

design-twitter
case-study
social-content-platforms
fan-out-on-write
fan-out-on-read
hybrid-fan-out
celebrity-problem
timeline-service
feed-ranking
trending-topics
social-media
system-design
intermediate
premium

1.1k

31

Medium
System Design

Design Reddit (Forum / Voting)

Design a community-driven forum like Reddit with 50M daily active users, 500K subreddits, and the famous hot/top/best ranking algorithms that decide which posts you see. The interview centerpiece is the ranking system: how to score posts in real time as votes pour in, how to make the front page personalized without per-user fan-out, and how to render nested comment trees at sub-200 ms when a popular thread has 10,000 nested replies. We also cover voting fraud detection, the difference between hot and Wilson score, and the tiered cache that makes 50K reads per second on the front page survive a viral post.

design-reddit
case-study
social-content-platforms
voting-systems
hot-ranking
wilson-score
nested-comments
subreddits
feed-ranking
social-media
system-design
intermediate
premium

909

24

Medium
System Design

Design YouTube (Video Platform)

Design a video platform like YouTube with 2 billion users, 500 hours of video uploaded every minute, and 1 billion hours watched per day. The interview centerpiece is the video pipeline: chunked uploads, parallel transcoding to 8 resolutions and 3 codecs, HLS/DASH adaptive streaming over a global CDN, and the metadata service that ties it all together. We also cover recommendations (the secondary feed problem), comment scaling, view-counter accuracy, and how YouTube serves 200 Tbps of egress without melting the internet.

design-youtube
case-study
social-content-platforms
video-streaming
video-transcoding
adaptive-bitrate-streaming
hls
dash-streaming
video-cdn
media-storage
recommendation-system
system-design
intermediate
premium

1.1k

18

Medium
System Design
Premium

Design TikTok (Short-Form Video)

Design TikTok with 1.5B monthly active users, 100M short videos uploaded daily, and the For You Page that decides which video plays next for every viewer in under 100 ms. Unlike Instagram and Twitter, TikTok has no follower-driven feed - the For You Page is pure ML recommendation from a global pool. The interview centerpiece is the recommendation system architecture: candidate retrieval, two-tower models, online ranking with engagement signals, and how to keep video pre-loaded so the next swipe is instant. We also cover content moderation at scale, edge caching for the long-form-of-short-form access pattern, and why TikTok's product choice eliminated the celebrity fan-out problem entirely.

design-tiktok
case-study
social-content-platforms
for-you-page
short-form-video
recommendation-system
engagement-signals
content-moderation
edge-caching
video-cdn
video-streaming
social-media
system-design
advanced
premium

669

12

Hard
System Design
Premium

Design Facebook News Feed

Design Facebook's News Feed for 2 billion daily active users where every feed open reads from a personalized, ML-ranked timeline assembled from thousands of candidate posts in real time. Unlike Instagram's chronological precomputed feed or TikTok's pure recommendation, Facebook blends a friend graph, group memberships, page follows, and ads into one ranked stream via the legendary EdgeRank-and-successor algorithms. The interview centerpiece is the aggregator pattern: parallel candidate retrieval from many sources, real-time feature lookup, ML scoring, and online filtering, all under a 200 ms p99 budget. We also cover real-time updates (push notifications when a friend posts), edge ranking signals, and how Meta keeps the feed fresh with no precomputed timeline.

design-facebook-newsfeed
case-study
social-content-platforms
feed-ranking
edge-ranking
edgerank
aggregation-service
real-time-updates
fan-out-on-read
recommendation-system
social-media
system-design
advanced
premium

854

26

Hard
System Design

Design a Chat System (WhatsApp)

Design a real-time chat system like WhatsApp serving 2B users sending 100B messages per day with sub-second delivery, presence indicators, and read receipts. The interview centerpiece is the persistent WebSocket connection layer: how many connections per server, how to route a message to a recipient who may be on a different server, and how to guarantee delivery when the recipient is offline. We cover the message delivery state machine (sent, delivered, read), the connection routing layer that maps user_id to a chat server, the message store for offline delivery, and presence/typing indicators that operate at a higher write rate than messages themselves.

design-chat-system
case-study
messaging-communication
chat
websockets
real-time
presence
delivery-receipts
at-least-once
fan-out
session-affinity
system-design
intermediate
free

664

16

Medium
System Design

Design a Notification Service

Design a multi-channel notification service that delivers 10B push, email, and SMS notifications per day across three independent provider networks (APNs, FCM, SendGrid, Twilio) with priority queues, per-user rate limits, and idempotent retries. The interview centerpiece is the fan-out from a single application event to multiple channels and providers, each with its own rate limits, failure modes, and delivery semantics. We cover priority queues for transactional vs marketing traffic, retry policies with exponential backoff, deduplication of duplicate triggers, user preference enforcement, and the device token lifecycle that quietly invalidates tens of millions of tokens per day.

design-notification-service
case-study
messaging-communication
push-notifications
email
sms
priority-queue
rate-limiting
idempotency
fan-out
retry-policy
dead-letter-queue
system-design
intermediate
premium

946

29

Medium
System Design

Design an Email Service (Gmail)

Design an email service like Gmail handling 1.8B users storing 500EB of email, accepting ~300B inbound messages per day from the public SMTP network while filtering 90%+ as spam, and serving full-text search over a user's entire inbox in sub-200ms. The interview centerpiece is the asymmetric architecture: SMTP is an untrusted public protocol with hostile traffic patterns (spam, phishing, sender forgery) that needs heavy gateway-side filtering, while the user-facing IMAP/web layer needs cheap reads, pagination of huge mailboxes, and per-user inverted indexes for search. We cover the SMTP MX gateway, the spam pipeline (SPF/DKIM/DMARC + ML), the per-user inverted index for search, and how mailboxes scale when one user holds 50GB of email.

design-email-service
case-study
messaging-communication
email
smtp
spam-filtering
spf-dkim-dmarc
inverted-index
full-text-search
blob-storage
attachment-dedup
system-design
intermediate
premium

926

9

Medium
System Design
Premium

Design Video Conferencing (Zoom)

Design a real-time video conferencing system like Zoom that supports 1-on-1 calls and meetings of up to 1000 participants with sub-200ms glass-to-glass latency, adapts to user bandwidth, and runs reliably across mobile networks. The interview centerpiece is the choice of media topology: peer-to-peer mesh (small calls), MCU mixing (centralized, expensive), or SFU forwarding (the modern standard). We cover the WebRTC stack (signaling vs media planes, ICE/STUN/TURN), simulcast and SVC for adaptive quality, recording pipelines, and how to keep latency low when participants span multiple continents.

design-video-conferencing
case-study
messaging-communication
video-conferencing
webrtc
sfu
mcu
rtp
simulcast
svc
ice-stun-turn
low-latency-media
real-time
system-design
advanced
premium

547

11

Hard
System Design
Premium

Design Discord (Real-time Communities)

Design Discord, a real-time community platform with 200M monthly active users, organized into 'guilds' (servers) of up to 500K members each, with persistent text channels storing trillions of messages and live voice channels with sub-100ms latency. The interview centerpiece is the dual architecture: a sharded text-message store (Cassandra/ScyllaDB) with billions of messages per guild and per-channel ordering, plus a real-time voice infrastructure with regional voice servers and custom UDP transport. We cover guild sharding by Snowflake ID, the Elixir/Erlang gateway that holds millions of WebSocket connections, presence at the guild scale, and how Discord migrated from MongoDB to Cassandra to ScyllaDB as message volume crossed trillions.

design-discord
case-study
messaging-communication
discord
guild-architecture
websocket-gateway
cassandra
scylladb
voice-channels
presence-fan-out
elixir
system-design
advanced
premium

944

21

Hard
System Design

Design Typeahead / Autocomplete

Design a typeahead/autocomplete service like Google Search's suggestion bar that returns the top 10 ranked completions for a query prefix in under 100ms p99, scaling to 5B searches per day with a multi-billion-entry suggestion index. The interview centerpiece is the data structure choice (trie vs sorted strings vs ngram index) and the offline pipeline that ranks suggestions by frequency, recency, personalization, and click-through rate. We cover the trie with precomputed top-K per node, edge n-gram indexes for typo tolerance, the MapReduce/Spark batch pipeline that rebuilds suggestions nightly, and the per-region edge cache that absorbs 99% of traffic.

design-typeahead
case-study
search-discovery
autocomplete
trie
edge-ngrams
ranking
top-k-precomputation
edge-caching
personalization
system-design
intermediate
free

712

23

Medium
System Design

Design a Web Crawler

Design a distributed web crawler that fetches 5 billion pages per month from the public web while respecting robots.txt, applying per-host politeness limits, deduplicating URLs and content across a 50PB corpus, and feeding the indexer pipeline downstream. The interview centerpiece is the URL frontier: a priority-aware queue of pending URLs sharded by host so politeness rules can be enforced per domain, plus content deduplication via hashing and shingling. We cover the fetcher worker pool, DNS caching, content extraction, the bloom-filter URL seen set, and how to handle hostile sites (large pages, redirect loops, slow responses, deliberate spam).

design-web-crawler
case-study
search-discovery
web-crawler
url-frontier
politeness
robots-txt
bloom-filter
shingling
minhash
content-dedup
distributed-fetching
system-design
intermediate
premium

626

6

Medium
System Design
Premium

Design a Search Engine

Design a web-scale search engine that indexes 50B documents and serves 100K queries per second with sub-200ms p99 latency, ranking results by relevance (BM25), authority (PageRank), and personalization. The interview centerpiece is the inverted index sharded across thousands of nodes with scatter-gather query execution, plus the multi-stage ranking pipeline (cheap candidate generation, expensive learned-to-rank rerank). We cover document parsing and tokenization, the offline indexing pipeline (Spark MapReduce), term-partitioned vs document-partitioned sharding, query understanding and expansion, snippet generation, and how to keep the index fresh as the web changes.

design-search-engine
case-study
search-discovery
search-engine
inverted-index
bm25
pagerank
scatter-gather
learned-to-rank
tf-idf
tokenization
near-real-time-indexing
system-design
advanced
premium

516

7

Hard
System Design

Design Nearby / Location Service (Yelp)

Design a 'nearby' service like Yelp that returns the top businesses within a search radius of the user's location, ranking by distance, rating, and category, scaling to 200M monthly users querying 100M businesses. The interview centerpiece is the geospatial index: how to find 'all businesses within 5 km of (lat, lng)' efficiently. We compare bounding-box scans, geohashes, quadtrees, R-trees, and PostGIS GIST indexes; we recommend geohash + secondary index for write-heavy systems and quadtree/R-tree for read-heavy. We cover business storage and search, review ranking, the infrequent-update vs frequent-query asymmetry, and how to handle the long tail of remote regions.

design-nearby-service
case-study
search-discovery
nearby-search
geospatial-index
geohash
quadtree
r-tree
postgis
yelp
location-based-services
spatial-indexing
system-design
intermediate
premium

171

4

Medium
System Design

Design a Rate Limiter

Design a distributed rate limiter that protects an API platform from abuse and uneven load while staying fast and accurate at 1B requests per day. The interview centerpiece is choosing among the five canonical algorithms (fixed window, sliding window log, sliding window counter, token bucket, leaky bucket) and explaining how to make the chosen one atomic across a Redis cluster. We cover where to place the limiter (edge, gateway, in-process), per-IP vs per-user vs per-API-key keys, returning 429 with Retry-After, the hot key problem, and fail-open vs fail-closed under cache outages.

design-rate-limiter
case-study
ecommerce-marketplace
rate-limiter
token-bucket
leaky-bucket
sliding-window
fixed-window
lua-script
throttling
redis
api-gateway
system-design
intermediate
free

737

21

Medium
System Design

Design an E-Commerce Platform (Amazon)

Design an Amazon-scale e-commerce platform that lets 200M monthly users browse 100M SKUs, add items to a cart, check out, and have orders fulfilled from regional warehouses. The interview centerpiece is the order lifecycle: how to reserve inventory atomically while a customer is on the checkout page, how to chain cart-to-payment-to-fulfillment as a saga with compensating actions, and how to make checkout idempotent so a flaky network never charges a customer twice. We also cover catalog browse at scale, multi-warehouse fulfillment routing, and the asymmetric read/write workload that makes aggressive catalog caching the right call.

design-ecommerce
case-study
ecommerce-marketplace
amazon
shopping-cart
checkout-flow
inventory-management
optimistic-locking
saga-pattern
fulfillment
idempotency
system-design
intermediate
premium

651

8

Medium
System Design

Design a Ticketing System (Ticketmaster)

Design a Ticketmaster-style ticketing platform that sells reserved seats for concerts and sports events, with the central challenge being a flash onsale where 1M users compete for 50K seats in five minutes. The interview centerpiece is the seat reservation lock: each unique seat (Section A, Row 12, Seat 7) cannot be split or sub-bucketed like fungible inventory, so contention is unavoidable. We cover seat-level pessimistic holds with TTL, the virtual waiting room that randomizes queue position to absorb flash demand fairly, anti-bot defenses, dynamic pricing tiers, and the read-replica explosion that interactive seat maps cause.

design-ticketing-system
case-study
ecommerce-marketplace
ticketmaster
seat-reservation
flash-sale
virtual-waiting-room
pessimistic-locking
websockets
system-design
intermediate
premium

998

29

Medium
System Design
Premium

Design a Payment System (Stripe)

Design a Stripe-style payment platform that processes 100M payments per day across 50 currencies and dozens of payment methods, where the central requirement is financial correctness: never charge a customer twice, never lose a payment, always reconcile to the cent. The interview centerpiece is the trio of idempotency keys, the payment intent state machine, and the immutable double-entry ledger - together they make the system safe in the face of network failures, partial outages, and adversarial retries. We also cover webhook delivery with signing and exponential backoff, PCI scope minimization through tokenization, multi-region availability, and the reconciliation jobs that compare our ledger to the bank's settlement files every night.

design-payment-system
case-study
ecommerce-marketplace
stripe
payment-system
idempotency
double-entry-ledger
reconciliation
webhooks
pci
system-design
advanced
premium

1k

32

Hard
System Design

Design a Key-Value Store (DynamoDB)

Design a Dynamo-style distributed key-value store that scales linearly to thousands of nodes, stays available during partitions, and offers tunable consistency through a quorum (N, W, R). The interview centerpiece is the trio that makes this work at scale: consistent hashing with virtual nodes for partitioning, N/W/R quorums for replication and consistency, and vector clocks for resolving concurrent writes. We cover the gossip protocol for membership, Merkle trees for anti-entropy, hinted handoff for transient failures, sloppy quorum for write availability during partitions, and the LSM-tree storage engine that powers each node.

design-key-value-store
case-study
infrastructure-storage
dynamodb
key-value-store
consistent-hashing
vector-clocks
gossip-protocol
merkle-tree
lsm-tree
quorum
hinted-handoff
system-design
intermediate
premium

457

13

Medium
System Design

Design a Distributed Cache (Redis)

Design a Redis-style in-memory distributed cache that serves billions of GET/SET operations per day at sub-millisecond latency, with sharding across hundreds of nodes and explicit eviction when memory fills. The interview centerpiece is the eviction-and-partitioning combination: how LRU and LFU choose what to drop, and how a cluster picks which node owns each key without a central coordinator. We compare client-side hashing, proxy-based partitioning (twemproxy), and Redis Cluster's hash-slot model; we cover cache-aside as the dominant access pattern, replica failover, optional persistence, and the sub-ms latency budget that makes this design fundamentally different from the durable KV store covered in the previous case study.

design-distributed-cache
case-study
infrastructure-storage
redis
memcached
lru
eviction-policy
consistent-hashing
cache-aside
in-memory-store
system-design
intermediate
premium

1k

28

Medium
System Design
Premium

Design Object Storage (S3)

Design an S3-style object storage service that stores trillions of immutable blobs ranging from 1 KB to 5 TB at eleven nines of durability and a fraction of the cost of triple replication. The interview centerpiece is the trio that makes this economical: erasure coding (typically 12 data shards plus 4 parity shards) instead of full replicas; a separate metadata service that maps object keys to chunk locations; and multi-part upload that lets a 5 TB object stream from many sources in parallel. We also cover the bucket/object namespace, lifecycle policies that move cold objects to colder tiers, immutability with versioning, pre-signed URLs for direct client transfer, and the move from eventual to strong read-after-write consistency that AWS shipped in 2020.

design-object-storage
case-study
infrastructure-storage
s3
object-storage
erasure-coding
metadata-service
multi-part-upload
immutability
system-design
advanced
premium

1.1k

35

Hard
System Design
Premium

Design a Distributed File System (GFS/HDFS)

Design a Google-File-System or HDFS-style distributed file system that stores petabytes across commodity hardware, optimized for batch analytics workloads where files are large (gigabytes), reads are sequential, and writes are append-mostly. The interview centerpiece is the leader-based architecture: one strongly-consistent master node holds the entire file namespace and chunk locations in memory, while many chunkservers store the actual data in 64-128 MB chunks replicated three times across racks. We cover the lease-based primary-replica protocol that lets the master stay out of the data path, the heartbeat-and-chunk-report mechanism that keeps cluster state fresh, and the federation strategy for scaling beyond a single master's memory.

design-distributed-file-system
case-study
infrastructure-storage
gfs
hdfs
distributed-file-system
chunk-server
namenode
leader-based
system-design
advanced
premium

1k

31

Hard
System Design

Design a Content Delivery Network

Design a Cloudflare/Akamai/Fastly-style content delivery network that offloads 95%+ of static traffic from origin servers, brings latency from hundreds of milliseconds down to single digits, and absorbs DDoS attacks at the edge. The interview centerpiece is the cache hierarchy and routing: hundreds of edge POPs anycast-routed to the user's nearest location, a regional shield layer that consolidates fetches, and the origin only seeing the long tail of misses. We cover cache key design with Vary headers, the TTL lifecycle and purge model, stale-while-revalidate for resilience under origin outages, and the moves CDNs make to keep dynamic content fast (programmable edge functions, smart routing).

design-cdn
case-study
infrastructure-storage
cdn
edge-caching
origin-shield
anycast
cache-invalidation
stale-while-revalidate
ddos-protection
system-design
intermediate
premium

865

15

Medium
System Design

Design Uber / Lyft (Ride-Sharing)

Design a ride-sharing service like Uber that matches a rider's request to a nearby driver in under 5 seconds, streams driver locations every 4 seconds, computes ETAs, and applies surge pricing in real time at 1M concurrent active drivers and 100K rides/min globally. The interview centerpiece is the dispatch path: how to find the nearest available driver, hold them briefly, and confirm the match without race conditions. We compare geohash, S2, and H3 for the driver index and recommend H3 hex grid for ride-sharing because hex neighbors are equidistant. We cover the trip state machine, surge multipliers per cell, and how location updates fan out without melting the network.

design-uber
case-study
ride-sharing-and-maps
ride-sharing
uber
lyft
driver-dispatch
ride-matching
h3-hex-grid
geospatial
geospatial-index
geohash
s2-cells
surge-pricing
trip-state-machine
websockets
kafka
real-time-systems
system-design
intermediate
premium

285

9

Medium
System Design
Premium

Design Google Maps

Design Google Maps: a global mapping service that renders the Earth from 256x256 tiles, computes the shortest driving route in under 200 ms, and folds live traffic into routing for 1B users issuing 5B route requests per day. The interview centerpiece is the routing engine: how Dijkstra is too slow on a continent-scale graph and how Contraction Hierarchies (CH) precompute shortcuts so the live query is logarithmic. We cover the tile pyramid (zoom 0-20, ~1 trillion possible tiles at zoom 20), how live traffic from 100M Android phones updates edge weights every minute, and how to keep navigation latency under 1 second when re-routing.

design-google-maps
case-study
ride-sharing-and-maps
google-maps
graph-algorithms
dijkstra
a-star
contraction-hierarchies
routing-engine
map-tiles
tile-rendering
real-time-traffic
cdn
geospatial
h3-hex-grid
system-design
advanced
premium

584

10

Hard
System Design

Design Food Delivery (DoorDash)

Design a food delivery service like DoorDash that links three actors (customer, restaurant, courier) with an end-to-end SLA of <40 minutes per order at 10M orders per day across 500K restaurants. The interview centerpiece is the courier dispatch problem, which is fundamentally different from ride-sharing: it is a 3-leg trip (courier-to-restaurant, wait for food, restaurant-to-customer) and the platform routinely batches multiple orders onto one courier to cut cost. We compare Uber's 1:1 matching to DoorDash's many-to-1 batching, design the ETA composition (prep time + assignment time + drive time + handoff), and walk through the order state machine that coordinates three independent humans.

design-food-delivery
case-study
ride-sharing-and-maps
food-delivery
doordash
courier-dispatch
batched-dispatch
vehicle-routing-problem
eta-prediction
three-sided-marketplace
geospatial
h3-hex-grid
ride-sharing
kafka
system-design
intermediate
premium

1k

17

Medium
System Design

Design a Unique ID Generator

Design a service that generates globally unique, roughly time-sortable 64-bit IDs at 1M IDs per second across hundreds of application servers, without coordination on the hot path. The interview centerpiece is the trade-off between uniqueness, ordering, size, and coordination cost. We compare UUIDv4 (random, no coordination, 128 bits, no ordering), database AUTOINCREMENT (single point of contention), Twitter Snowflake (64 bits, time-ordered, requires worker_id assignment and clock discipline), Instagram's per-shard hybrid, and ULID/KSUID. We deep-dive into Snowflake: bit layout, clock skew handling, leader election for worker IDs, and the dreaded clock-rollback bug.

design-unique-id-generator
case-study
unique-specialized
snowflake-id
uuid
ulid
ksuid
distributed-id
clock-skew
leader-election
zookeeper
instagram
twitter
system-design
intermediate
premium

184

5

Medium
System Design
Premium

Design Google Docs (Collaborative Editing)

Design a real-time collaborative document editor like Google Docs where 1B+ users can co-edit the same document with sub-200 ms latency, never lose a keystroke, and converge to the same state across all clients regardless of network conditions. The interview centerpiece is concurrency control: how to merge two users' simultaneous edits without conflicts. We compare Operational Transformation (OT, used by Google Docs) and Conflict-free Replicated Data Types (CRDT, used by Figma, Notion, Linear), explain the convergence problem (TP1, TP2 properties), walk through cursor presence, and design the document storage as an append-only operation log compacted into snapshots.

design-collaborative-editor
case-study
unique-specialized
google-docs
collaborative-editing
operational-transform
crdt
real-time-sync
websockets
presence
vector-clock
event-sourcing
consistency-models
system-design
advanced
premium

267

4

Hard
System Design
Premium

Design a Stock Exchange

Design a stock exchange like NASDAQ that matches buy and sell orders for thousands of symbols at sub-100-microsecond latency, handles 200K orders per second per symbol at peak, and produces a deterministic, replayable trade history with regulatory audit guarantees. The interview centerpiece is the matching engine: a deliberately single-threaded, in-memory order book that processes orders sequentially in price-time priority. We design the order book data structures (price-indexed levels with FIFO queues), the gateway path (ultra-low-latency parsing and rate-limit), the event-sourced persistence (every order and trade as an append-only event), and how to scale by sharding per symbol.

design-stock-exchange
case-study
unique-specialized
stock-exchange
matching-engine
order-book
limit-order
fix-protocol
low-latency
ultra-low-latency
event-sourcing
deterministic-replay
single-threaded
leader-election
system-design
advanced
premium

775

15

Hard
System Design

Authentication & Authorization (OAuth2, JWT, RBAC)

Authentication answers 'who are you?'. Authorization answers 'what are you allowed to do?'. Most systems get both wrong in subtle ways: rolling their own crypto, treating JWTs as a session store, copying RBAC into every service, or never thinking about how to revoke a leaked credential. This lesson covers the standard building blocks: password storage with adaptive hashing, session vs token authentication, OAuth2 and OIDC flows, JWTs and their honest trade-offs, RBAC vs ABAC vs ReBAC, multi-tenant authorization at scale, machine-to-machine auth (API keys, mTLS, workload identity), and the operational concerns (key rotation, revocation, audit). The goal is to leave you able to design and defend the auth architecture for any system, from a single product to a federated multi-tenant platform.

authentication
authorization
oauth2
jwt
rbac
system-design
advanced
premium
security

295

7

Medium
System Design

Data Pipelines & ETL/ELT

Data pipelines move data from operational systems (your transactional databases, event logs, third-party APIs) into analytical systems (warehouses, lakes, search indexes, ML feature stores). The 'shape' of the pipeline (ETL vs ELT, batch vs incremental, push vs pull) determines latency, cost, and how painful schema changes will be. This lesson covers the architectural choices: ingestion patterns, transformation engines (dbt, Spark, Beam), orchestration (Airflow, Dagster, Prefect), data quality, lineage, and the standard production layout (raw / staging / mart). It also covers the failure modes you must design for: late-arriving data, idempotency, backfills, schema evolution, and the silent corruption that comes from not testing your pipelines.

data-pipelines
etl
elt
data-engineering
system-design
advanced
premium
orchestration

161

1

Medium
System Design

Microservices vs Monolith: When to Choose What

Microservices are not a maturity badge. Monoliths are not a code smell. The honest interview answer is that architecture is a continuum (monolith, modular monolith, services, microservices) and the right point on it is set by team size, deployment frequency, and the cost of distribution, not by what the cool kids at Netflix did. This lesson walks through the trade-offs concretely: latency tax, operational overhead, organizational coupling (Conway's Law), data consistency, and the migration paths that work. By the end you can defend either choice for a given product without reaching for buzzwords.

microservices
monolith
microservices-architecture
system-design
advanced
premium
distributed-systems

358

5

Medium
System Design

The System Design Interview Framework (RESHADED)

A system design interview is 45-60 minutes to design something the interviewer has been thinking about for years. Without a framework you will spend the first 20 minutes flailing, the next 20 deep in one corner, and the last 20 watching the interviewer try to redirect you. The RESHADED framework (Requirements, Estimation, Schema / API, High-level design, Architecture deep dive, Edge cases, Done / wrap-up) gives you a defensible structure that maps to how senior engineers actually think. This lesson walks through every stage with concrete tactics: the questions to ask in Requirements, the back-of-envelope numbers to estimate, the layer to draw first in HLD, the components to deep-dive into, and how to read the interviewer's signals to know what they want next. By the end you can walk into any system design interview with a known opener and a sequence of moves that work for any prompt.

system-design-interview
interview-strategy
framework
reshaded
system-design
advanced
premium

342

9

Easy
System Design
Premium

Batch vs Stream Processing (Lambda/Kappa)

Batch processing computes results over a finite, bounded dataset. Stream processing computes results continuously over an unbounded, ever-arriving dataset. The two paradigms have different latency, cost, correctness, and operational profiles, and choosing wrong is one of the most expensive architectural mistakes a senior engineer can make. This lesson covers the mental model (bounded vs unbounded data, event time vs processing time, watermarks, windows), the two classical reference architectures (Lambda and Kappa), the modern unified models (Beam, Flink), and the production realities of exactly-once semantics, late data, replays, and operational complexity. The goal is to leave you able to choose batch, streaming, or a hybrid for any system, and to defend the choice in an interview.

stream-processing
batch-processing
lambda-architecture
kappa-architecture
system-design
advanced
premium
data-intensive-systems

449

4

Hard
System Design
Premium

Encryption at Rest/Transit & Data Privacy (GDPR)

Encryption protects data from unauthorized access; privacy regulations (GDPR, CCPA, HIPAA, PCI-DSS) determine what data you may collect, how you must protect it, who can see it, and how you must respond to user requests. The two intersect: regulations mandate encryption in many cases, and encryption is the technical foundation for most privacy controls. This lesson covers the standard primitives (TLS 1.3 for transit, AES-GCM and envelope encryption for rest), key management (KMS, HSM, key rotation), application-level encryption (per-tenant keys, field-level encryption, deterministic encryption for searchability), the privacy-engineering layer (data classification, minimization, retention, right-to-be-forgotten), and the operational realities (key compromise, crypto-shredding, BYOK, audit logs). The goal is to leave you able to design a system that is encryption-correct, privacy-compliant, and operationally honest about its trade-offs.

encryption
data-privacy
gdpr
kms
envelope-encryption
system-design
advanced
premium
security

947

12

Hard
System Design
Premium

Event Sourcing & CQRS

Event Sourcing stores every change to your application state as an immutable event, and the current state is what you get when you replay them. CQRS splits the read and write paths so each can be optimized independently. Together they unlock auditability, time travel, and read/write scaling that traditional CRUD cannot. They also introduce eventual consistency, schema evolution pain, and a steep operational learning curve. This lesson teaches the mechanics, the implementation patterns (event store, snapshots, projections, sagas), and the honest answer to when these patterns are worth the cost (financial ledgers, audit-heavy domains, complex business workflows) and when they are over-engineering (a typical SaaS CRUD app).

event-sourcing
cqrs
event-driven-architecture
system-design
advanced
premium
distributed-systems

167

4

Hard
System Design

Back-of-the-Envelope Estimation & Capacity Planning

Back-of-the-envelope estimation is the math you do in three minutes to ground a system design in numbers. It is what tells you whether your single Postgres instance can handle the load (no), how much storage you need over five years (probably more than you think), and how much CDN bandwidth you are about to commit to (probably more than that). This lesson covers the standard latency / throughput / size / bandwidth numbers every engineer should have memorized, the unit conversions and order-of-magnitude reasoning that keep you fast, the templates for QPS, storage, and bandwidth estimation, capacity planning beyond steady state (peak vs average, headroom, growth, regional, seasonal), and the cost rough-arithmetic that turns 'we need more servers' into a defensible business case. The goal is to leave you able to walk into any interview or design review and produce useful numbers in three minutes flat.

estimation
capacity-planning
interview-strategy
back-of-envelope
system-design
advanced
premium

405

9

Medium
System Design
Premium

DDoS Protection, WAF & Security Best Practices

DDoS attacks try to exhaust your bandwidth, your TCP stack, your application capacity, or your downstream dependencies. A WAF (web application firewall) tries to block exploit traffic before it reaches your code. Together with rate limiting, bot management, anti-abuse tooling, and a hardened application layer, they form the defensive perimeter that real production systems live behind. This lesson covers the layered defense: edge / CDN scrubbing for L3/L4 floods, rate limiting and bot detection for L7 abuse, WAF rules for OWASP-class exploits, the OWASP Top 10 with concrete mitigations, secure development practices (input validation, output encoding, secrets management, dependency hygiene), incident response, and the operational realities of running this stack (false positives, vendor selection, escalation, post-mortems). The goal is to leave you able to design and defend the security perimeter for any user-facing system.

ddos
waf
security
rate-limiting
owasp
system-design
advanced
premium

498

12

Hard
System Design
Premium

ML System Design (Feature Store, Model Serving)

An ML system in production is mostly a data system with a model in the middle. The model is the smallest, most-discussed, and least-troublesome part. The hard parts are training data pipelines, feature freshness and parity between training and serving, the feature store that enforces that parity, model deployment and rollback, online and offline evaluation, and the operational concern that the model silently degrades as the world drifts. This lesson covers the canonical reference architecture: training pipeline, feature store with online and offline halves, model registry, serving infrastructure, monitoring, and the feedback loop. It is the senior-level mental model for designing 'add ML to product X' without falling into the standard traps.

ml-system-design
feature-store
model-serving
mlops
system-design
advanced
premium
data-intensive-systems

1k

33

Hard
System Design
Premium

Service Mesh, Sidecar & Service Discovery

Once you have more than a handful of services, the cross-cutting concerns (mTLS, retries, circuit breaking, load balancing, traffic shifting, observability) start to dominate. Doing them in every service in every language is a maintenance nightmare. The sidecar pattern moves these concerns into a co-located proxy that runs next to your service, and a service mesh is the control plane that programs every sidecar in your fleet from one place. This lesson covers how a mesh actually works (data plane vs control plane, Envoy as the de-facto data plane, Istio and Linkerd as control planes), how service discovery underpins it, and the very real cost (latency tax, complexity, on-call burden) so you know when a mesh helps and when it is over-engineering.

service-mesh
sidecar-pattern
service-discovery
envoy
circuit-breaker
system-design
advanced
premium

1.1k

32

Hard
System Design
Premium

Recommendation Systems Architecture

A recommendation system at scale is a multi-stage funnel: candidate generation narrows millions of items to a few thousand, light ranking trims to a few hundred, heavy ranking scores those, and a re-ranking stage applies business and policy constraints. Each stage has a different latency budget, a different model, and a different operational profile. This lesson covers the canonical architecture (retrieval + ranking + re-ranking), the core algorithmic families (collaborative filtering, content-based, two-tower neural retrieval, sequential models), the embedding store and vector ANN serving stack, the cold-start problem, ranking objectives and the metrics that measure them, and the rollout / monitoring discipline that keeps the system honest. The goal is to leave you able to design the recommendation system for any consumer product and defend every layer's choices.

recommendation-systems
ranking
embedding
vector-search
system-design
advanced
premium
data-intensive-systems

773

15

Hard
System Design

Serverless Architecture & FaaS

Serverless does not mean 'no servers'. It means the cloud provider runs the servers, scales them to zero when idle, and bills you per request rather than per running hour. Functions-as-a-Service (Lambda, Cloud Functions, Cloud Run, Azure Functions) is the most visible flavor. The pattern is genuinely powerful for spiky workloads, glue code, and small teams who want to skip the infrastructure tax. It is genuinely a bad fit for steady high-throughput services, latency-critical paths, and stateful systems. This lesson covers how serverless actually executes (cold starts, warm pools, concurrency limits), the architectural patterns it enables, the patterns it breaks, and the honest cost model.

serverless
faas
event-driven-architecture
system-design
advanced
premium

792

3

Medium
System Design
Premium

Multi-Region, Multi-Tenant Architecture

Going from one region to many is one of the largest architectural commitments a company can make. The motivations are real (latency for global users, regulatory data residency, disaster recovery, regional uptime SLOs) and so are the costs (cross-region replication latency, conflict resolution, deployment complexity, blast-radius management, double or triple infrastructure spend). Multi-tenancy adds another orthogonal axis: how do you share the same infrastructure safely across hundreds or thousands of customers without one of them noisy-neighboring everyone else? This lesson covers active-active vs active-passive deployments, the data layer (replication, conflict handling, GDPR-style data residency), DNS and traffic routing, deployment topology, and the tenancy patterns (silo, pool, bridge) along with when each is the right answer.

multi-region
multi-tenant
system-design
advanced
premium
distributed-systems

1k

8

Hard
System Design
Premium

Search Indexing at Scale (Elasticsearch)

Search at scale is two systems in one: an indexing pipeline that ingests, transforms, and stores documents into an inverted index (and increasingly a vector index), and a query path that distributes searches across shards, scores results, and merges them under tight latency budgets. Elasticsearch and OpenSearch are the dominant production engines, and almost every large product runs one. This lesson covers the architecture: how Lucene segments and inverted indexes work, how Elasticsearch shards and replicates them, the tokenization and analyzer pipeline that determines what 'matches' mean, the query coordinator -> shard fan-out -> merge flow, hybrid search (lexical + vector), reindexing strategies, and the operational realities (hot shards, mapping explosions, garbage collection pauses, write amplification). The goal is to leave you able to design and operate search for any catalog from a million to billions of documents.

search
elasticsearch
lucene
inverted-index
system-design
advanced
premium
data-intensive-systems

295

9

Hard

Behavioral Interviews

2 articles
Behavioral Interview
Premium

Navigating Technical Trade-offs

Trade-off questions are the senior-engineering judgement probe. They test whether you can weigh competing technical priorities, articulate the criteria that drove your choice, own the path you took including its costs, and distinguish real trade-offs from false choices that better engineering would dissolve. This lesson defines trade-off literacy across the canonical axes (consistency vs availability, build vs buy, simplicity vs flexibility, speed vs safety, cost vs latency), walks through the explicit-criteria framework strong candidates use to make trade-offs visible, covers the technical-debt framing that scores best in interviews, and provides fully worked model STAR answers for the prompts you will hear most. After this lesson you will be able to take any consequential technical choice from your career and tell the story so the rubric reads judgement, calibration, and ownership simultaneously.

behavioral
behavioral-interview
trade-offs
decision-making
technical-depth
system-design
scalability
interview-prep
interview-strategy
senior-interviews
story-banking

424

8

Hard
Behavioral Interview
Premium

System Design Decision Stories

System design decision questions are the staff-and-above architecture probe. They test whether you can shape a design that compounds correctly over years, demonstrate second-order thinking about how decisions interact, balance forward-looking design with iterative delivery, and tell a story that operates at the right altitude for staff scale. This lesson defines what counts as a scale-shaping decision (architectural choices whose costs and benefits compound), walks through how to present design decisions in narrative form rather than whiteboard form, covers the second-order-thinking moves that distinguish staff stories from senior stories, addresses when to over-engineer versus when to ship-and-iterate, and provides fully worked model STAR answers for the prompts you will hear most. After this lesson you will be able to take any consequential architectural decision from your career and tell the story so the rubric reads design judgement, second-order thinking, and operating at staff altitude.

behavioral
behavioral-interview
system-design
scalability
distributed-systems
decision-making
trade-offs
technical-depth
interview-prep
interview-strategy
senior-interviews
leadership-interview

789

25

Hard

Community

28 items
Article

CAP, PACELC, and the Trade-off People Misquote

CAP is a real theorem about a narrow edge case. PACELC is the framing that captures the trade-off teams actually make in production.

cap-theorem
consistency
availability
distributed-systems
system-design

1k

25

4.4 (9)

May 8, 2026

by @calebhadid

Interview Experience

Datadog Onsite: Five Hours of System Design

A Datadog senior backend onsite where four of the five rounds were system design, anchored on real telemetry-shaped problems.

system-design
interview-prep
distributed-systems
monitoring
reliability

730

9

4.3 (11)

Apr 30, 2026

by @chloesaeed

Interview Experience

Designing a Feed in 45 Minutes at a Mid-Size SaaS

A senior system design round at a mid-size B2B SaaS where the prompt was a generic activity feed but 45 minutes forced me to commit to a fan-out strategy in the first ten minutes.

system-design
system-design-interview
news-feed
scalability
senior-interviews

1.1k

26

4.3 (9)

Apr 25, 2026

by @liamsuzuki

Article

Caching Strategies: Write-Through, Write-Behind, and When Each Fits

Write-through is the safe default. Write-behind is the option for write-heavy paths. Cache-aside is what most teams actually use, and that is fine.

caching
write-through
write-back
cache-aside
system-design

226

6

4.1 (11)

Apr 23, 2026

by @vikramross

Article

RBAC vs ABAC vs ReBAC, Explained

RBAC, ABAC, and ReBAC are different shapes for different rules, not stages of maturity. Pick by the shape of your access policy, and most real systems end up a thoughtful hybrid.

rbac
authorization
security
system-design
api-design

445

2

4.0 (9)

Apr 9, 2026

by @lucasmoreau

Question Bundle
$12.99

Backend Loop Questions That Actually Test System Design

Five backend coding questions where the surface is a function but the real signal is your system-design instincts. None of them want the cleverest algorithm; all of them want the right data model and the right failure mode.

Python
backend
system-design
interview-prep
coding-interview

438

13

Mar 24, 2026

by @arjunrivera

Interview Experience

My Google L4 Interview Experience

A round-by-round account of my Google L4 software engineer loop, from recruiter screen to team match, ending in an offer.

google
interview-prep
coding-interview
system-design
behavioral

574

4

4.3 (15)

Mar 23, 2026

by @ezb1981

Article

Pagination Strategies: Offset, Cursor, and Keyset

Offset is the default that breaks under load. Keyset is what you want for most lists. Cursor is keyset wearing a public costume. Pick deliberately, not by ORM defaults.

pagination
api-design
rest-api
backend
system-design

378

5

4.2 (13)

Mar 21, 2026

by @amaragupta

Article

Building a Notification Service From Scratch

Delivery is the easy part. Preferences, dedup, throttling, and timezone-aware digests are where notification services succeed or generate complaints.

notification-service
fan-out
queue
system-design
message-queue

1k

8

4.2 (13)

Mar 18, 2026

by @sofiacollins

Article

API Gateway vs BFF vs Reverse Proxy

Three terms, three distinct concerns, three different owners. Most teams collapse them and end up with one thing pretending to be all three.

api-gateway
reverse-proxy
microservices
system-design
api-design

448

5

Mar 17, 2026

by @marcusreddy

Article

The Saga Pattern: When Distributed Transactions Aren't an Option

Why 2PC is rarely available, what a saga actually is, and the compensation design rules that separate working sagas from stuck ones.

saga-pattern
distributed-systems
two-phase-commit
microservices
system-design

606

12

4.2 (12)

Mar 12, 2026

by @meibennett

Interview Experience

The Sysdesign Round Where I Talked Myself Out of an Offer

I drew a clean diagram, then over-explained every tradeoff until the interviewer no longer trusted any of them. A postmortem on a defensible answer that still got rejected.

system-design
system-design-interview
interview-prep
interview-strategy
senior-interviews

877

24

4.2 (9)

Mar 8, 2026

by @gracebanda

Article

SSR, CSR, SSG, ISR: Pick the Right One

Four rendering strategies, four cost profiles. Pick by data freshness and personalization needs, not by which acronym sounds most modern.

frontend
performance
react
system-design

788

25

4.3 (11)

Feb 20, 2026

by @sophiegarcia

Article

Event-Driven Architecture and the Three Failure Modes

Lost messages, out-of-order delivery, duplicate processing. EDA buys decoupling and replay; the price is three failure modes you must operate.

event-driven
message-queue
kafka
distributed-systems
system-design

907

21

4.3 (12)

Feb 18, 2026

by @kavyanovak

Article

Microservices vs Monolith: An Honest Comparison

Modular monolith is the right default for most teams. Microservices earn their cost only past a specific organizational scale, and the bar is higher than the literature suggests.

microservices
monolith
system-design
trade-offs
backend

558

17

Feb 14, 2026

by @laylabauer

Article

Rate Limiting: Token Bucket vs Sliding Window

Token bucket is the right default. Sliding window log is correct but expensive. Fixed window is the algorithm I would not ship.

rate-limiting
token-bucket
sliding-window
api-design
system-design

198

2

4.2 (12)

Feb 11, 2026

by @adityadesai

Interview Experience

System Design Interview at Stripe

A senior backend system design round at Stripe centered on idempotent webhooks, the failure mode I missed, and how the interviewer pushed me from a clean diagram to a defensible one.

stripe
system-design
system-design-interview
interview-prep
senior-interviews

1.1k

34

Feb 8, 2026

by @mianair

Article

REST vs GraphQL vs RPC: Pick the Fit, Not the Trend

Three protocols, three call shapes. The wrong choice is fixable, indecision is not. Pick by caller, dominant call shape, and how much HTTP caching matters.

rest
graphql
grpc
api-design
system-design

1k

9

4.2 (13)

Jan 22, 2026

by @quinnsuzuki

Interview Experience

Cloudflare System Design: The Edge-Latency Question

A senior backend system design round at Cloudflare anchored on p99 latency at the edge, where the interviewer pushed past the obvious answers until I had to commit to a defensible number budget.

system-design
system-design-interview
distributed-systems
cdn
senior-interviews

232

2

4.2 (10)

Jan 15, 2026

by @oliviafoster

Article

Consistent Hashing Explained with a 200-Line Toy

A working Python toy of the ring, with virtual nodes, the bounded-movement test that proves the algorithm earns its complexity, and the cases where I would not reach for it.

consistent-hashing
hashing
distributed-systems
system-design
partitioning

299

6

4.4 (10)

Jan 8, 2026

by @ryanjoshi

Article

Idempotency Keys: The Pattern Stripe Taught Everyone

The key itself is the trivial part. The lifecycle, the storage, the body fingerprint, and the TTL are where production teams trip.

idempotency
stripe
api-design
system-design
reliability

577

4

4.1 (12)

Dec 31, 2025

by @chloekelly

Question Bundle
$12.99

Senior Engineer Design Questions I Actually Use

Four open-ended design prompts I ask in senior engineer loops. There is no clean LeetCode answer; I am listening for how the candidate frames the tradeoff, when they push back, and whether they can ship a v1 before optimizing.

JavaScript
senior
system-design
interview-prep
trade-offs

487

5

4.2 (10)

Dec 26, 2025

by @meerapowell

Interview Experience

Coinbase System Design Round: What "Crypto-Native" Meant

A senior backend system design round at Coinbase where the generic exchange-order-book prompt was actually grading deposit confirmations, double-spend windows, and the cold-wallet boundary.

system-design
system-design-interview
distributed-systems
interview-prep
senior-interviews

764

7

Dec 24, 2025

by @ryancastillo

Question Bundle
$14.99

Staff+ Tradeoff Questions With No Right Answer

Four staff-plus prompts where the interviewer is testing whether you can hold two answers in your head and pick the right one for a specific context. The Python is intentionally thin: this is about judgment, not syntax.

Python
staff-engineer
system-design
trade-offs
interview-prep

282

9

4.2 (10)

Dec 24, 2025

by @ethandubois

Interview Experience

Shopify Senior Engineer Loop: Take-Home Plus Architecture

A Shopify senior backend loop centered on a take-home, an architecture deep dive on what I built, and a Life Story round.

interview-prep
system-design
api-design
coding-interview
behavioral

898

14

4.3 (13)

Dec 15, 2025

by @emmadiallo

Article

CDN 101: Edge Caches, Origin Shields, and Cache Keys

The cache key matters more than the TTL. Origin shield is a cheap config win. Most CDN incidents are key bugs, not capacity bugs.

cdn
caching
origin-shield
http
system-design

1.1k

18

Dec 10, 2025

by @nadiaali

Interview Experience

Atlassian Senior SWE Loop: The Roadmap Round

How a roadmap-and-product round at Atlassian sank an otherwise solid senior backend loop, and what I would prep next time.

interview-prep
behavioral
system-design
career
failure

904

15

4.4 (11)

Dec 8, 2025

by @davidmorgan

Interview Experience

Meta E5 Backend, Phone Screen to Offer

A full-loop account of my Meta E5 backend interview, from cold-applying through team match, with the rounds and the calibration I missed.

meta
interview-prep
system-design
coding-interview
behavioral

466

4

Dec 2, 2025

by @jameszhang