Tags

Distributed Systems

Distributed Systems

0 lessons
15 system designs
1 behavioral interview
10 community items

distributed-systems

System Design

15 articles
System Design

Database Replication (Leader-Follower, Multi-Leader)

Replication keeps copies of your data on multiple servers so you can survive failures, scale reads, and serve users from the nearest region. This lesson covers the three replication topologies (leader-follower, multi-leader, leaderless), the mechanics of synchronous and asynchronous replication, the consistency surprises that come with replication lag, and how to design failover and conflict resolution. By the end you can pick a topology and defend it in an interview, and recognize the bug class behind 'I just wrote it but the read says it does not exist'.

database-replication
leader-follower
consistency
availability
distributed-systems
failover
system-design
intermediate

204

3

Medium
System Design

Distributed Caching (Redis, Memcached)

A single-node cache eventually runs out of RAM, CPU, or network. Distributed caching spreads keys across many nodes so total capacity and throughput scale horizontally. This lesson covers how Redis and Memcached partition data, replicate it for availability, fail over when nodes die, and how to choose between them. By the end you can design a multi-node cache layer for a real workload, defend the topology in an interview, and recognize the bug class behind 'why is one cache node maxed at 100% CPU while the others are idle?'.

caching
redis
memcached
consistent-hashing
distributed-systems
replication
failover
system-design
intermediate

806

6

Medium
System Design

Cache Invalidation Strategies & Consistency

There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors. This lesson tackles the first one. We cover TTL-based, write-driven, and event-driven invalidation; the canonical race conditions (lost-update, double-write inconsistency, stale-after-failover); the consistency models a cache can offer; and the patterns that real systems (Facebook, Stripe, AWS) use to keep cached data trustworthy. By the end you can pick an invalidation strategy, defend it under interviewer pressure, and explain exactly why your cache will not silently serve yesterday's data.

caching
cache-invalidation
consistency
ttl
distributed-systems
race-conditions
system-design
intermediate
premium

578

12

Medium
System Design

CAP Theorem & Trade-offs

The CAP theorem says any distributed data store must trade off Consistency, Availability, or Partition tolerance during a network split, and you only get to keep two. This lesson cuts through the textbook version with the practical engineer's reading: partitions are non-negotiable, so the real choice is between consistency and availability when the network breaks. We cover what each property actually means, why CAP is misleading without PACELC, and how real systems (MongoDB, DynamoDB, Cassandra, Spanner) place themselves on the spectrum. By the end you can defend a system's CAP choice in an interview without falling into the common 'I picked CA' trap.

cap-theorem
distributed-systems
consistency
availability
partition-tolerance
system-design
beginner
free

1.1k

4

Easy
System Design

Consistency Models (Strong, Eventual, Causal)

Consistency models are the contract between a distributed data store and its clients about what they can and cannot observe. This lesson walks the spectrum from strict serializability at the strong end to eventual consistency at the relaxed end, with stops at linearizability, sequential, causal, read-your-writes, monotonic reads, and monotonic writes. We focus on what each model promises, what bugs it prevents, what it costs in latency and availability, and which production systems implement it. By the end you can name the model your system needs and explain why - the senior-level move that interviewers reward.

consistency
strong-consistency
eventual-consistency
causal-consistency
distributed-systems
cap-theorem
system-design
intermediate
free

911

4

Medium
System Design

Consistent Hashing & Data Distribution

Consistent hashing is the trick that lets distributed caches and databases add or remove nodes without remapping every key in the cluster. This lesson explains why naive `hash(key) % N` is broken, how the hash ring works, why you need virtual nodes to keep load balanced, and how real systems (DynamoDB, Cassandra, Memcached, Discord) implement it. We finish with the modern alternatives (rendezvous hashing, jump consistent hash, Maglev) and the trade-offs that make consistent hashing the answer in interviews 90% of the time.

consistent-hashing
data-partitioning
distributed-systems
distributed-cache
database-sharding
system-design
intermediate
free

696

17

Medium
System Design
Premium

Leader Election & Consensus (Raft, Paxos)

Leader election is how a distributed cluster picks one node to be in charge so the others can stop arguing. This lesson covers the consensus problem (FLP impossibility), Paxos in concept, Raft in detail (leader election + log replication + safety), the role of quorum, and the operational pitfalls of split brain and network partitions. We also tour the systems that ship Raft or Paxos in production: etcd, ZooKeeper, Consul, CockroachDB, MongoDB, Spanner. By the end you can explain why every modern distributed database has a consensus protocol at its core, and you can sketch Raft on a whiteboard.

leader-election
raft
paxos
distributed-systems
consensus
consistency
fault-tolerance
system-design
advanced
premium

965

31

Hard
System Design
Premium

Distributed Transactions (2PC, Saga Pattern)

When a single business operation spans multiple services or databases, you cannot rely on a single ACID transaction. This lesson covers the two dominant patterns for keeping consistency across services: Two-Phase Commit (2PC) for synchronous, atomic, blocking transactions, and the Saga pattern (orchestration vs choreography) for long-running asynchronous workflows with compensating actions. We also cover Three-Phase Commit, idempotency keys, the outbox pattern, and the trade-offs that explain why 2PC is rare in microservices and Sagas are everywhere. By the end you can pick the right pattern for an order checkout, a money transfer, or a multi-step booking flow.

distributed-transactions
two-phase-commit
saga-pattern
distributed-systems
consistency
acid
microservices
system-design
advanced
premium

855

24

Hard
System Design

Message Queues (Kafka, RabbitMQ, SQS)

Message queues let one service hand work to another without waiting, smoothing traffic spikes, decoupling services, and surviving downstream outages. This lesson covers the two queue families (broker-based like RabbitMQ and SQS vs log-based like Kafka), the delivery semantics (at-most-once, at-least-once, exactly-once), the operational essentials (DLQs, consumer groups, backpressure, ordering), and the trade-offs that decide between Kafka, RabbitMQ, and SQS for any given workload. By the end you can pick a queue and defend the choice with the per-property reasoning interviewers reward.

message-queue
kafka
rabbitmq
sqs
async-processing
pub-sub
distributed-systems
system-design
intermediate
free

932

7

Medium
System Design

Event-Driven Architecture & Pub/Sub

Event-driven architecture (EDA) is a style where services communicate by emitting and reacting to immutable events instead of calling each other directly. This lesson covers the publish/subscribe pattern, the difference between event notification and event-carried state transfer, the role of an event bus, and how EDA reshapes coupling, scalability, and consistency. We compare it with request/response, walk through real implementations on Kafka, Kinesis, EventBridge, and SNS, and end with the operational pitfalls (event versioning, ordering, schema drift, observability) that bite teams who adopt EDA without preparation.

event-driven
pub-sub
kafka
message-queue
async-processing
distributed-systems
system-design
intermediate
premium

388

7

Medium
System Design
Premium

Stream Processing (Kafka Streams, Flink)

Stream processing is the discipline of computing on continuous, unbounded data as it arrives, instead of in periodic batches. This lesson covers the core stream-processing primitives: stateful operators, event time vs processing time, watermarks, windowing (tumbling, sliding, session), exactly-once semantics, and stateful checkpointing. We compare the leading engines (Kafka Streams, Apache Flink, Spark Structured Streaming) and walk through real production patterns: real-time analytics, fraud detection, ML feature pipelines, and CDC-driven materialized views. By the end you can sketch a Flink pipeline on a whiteboard and defend the windowing and checkpointing choices.

stream-processing
kafka
flink
event-driven
async-processing
distributed-systems
system-design
advanced
premium

949

28

Hard
System Design

Fault Tolerance, Redundancy & Failover

Fault tolerance is the property that lets a system keep working when components fail - and at any reasonable scale, components are always failing. This lesson covers the building blocks: redundancy (active-active, active-passive), failure detection (health checks, heartbeats), failover (automatic, manual), and the patterns that make systems gracefully degrade instead of catastrophically crash (circuit breakers, retries with backoff, bulkheads, timeouts). We finish with the operational disciplines that turn architecture into reality: chaos engineering, runbooks, blast-radius analysis, and disaster recovery (RTO/RPO). By the end you can design a system that survives the failure modes interviewers love to throw at you.

fault-tolerance
redundancy
failover
circuit-breaker
reliability
availability
distributed-systems
system-design
intermediate
free

510

11

Medium
System Design

Microservices vs Monolith: When to Choose What

Microservices are not a maturity badge. Monoliths are not a code smell. The honest interview answer is that architecture is a continuum (monolith, modular monolith, services, microservices) and the right point on it is set by team size, deployment frequency, and the cost of distribution, not by what the cool kids at Netflix did. This lesson walks through the trade-offs concretely: latency tax, operational overhead, organizational coupling (Conway's Law), data consistency, and the migration paths that work. By the end you can defend either choice for a given product without reaching for buzzwords.

microservices
monolith
microservices-architecture
system-design
advanced
premium
distributed-systems

358

5

Medium
System Design
Premium

Event Sourcing & CQRS

Event Sourcing stores every change to your application state as an immutable event, and the current state is what you get when you replay them. CQRS splits the read and write paths so each can be optimized independently. Together they unlock auditability, time travel, and read/write scaling that traditional CRUD cannot. They also introduce eventual consistency, schema evolution pain, and a steep operational learning curve. This lesson teaches the mechanics, the implementation patterns (event store, snapshots, projections, sagas), and the honest answer to when these patterns are worth the cost (financial ledgers, audit-heavy domains, complex business workflows) and when they are over-engineering (a typical SaaS CRUD app).

event-sourcing
cqrs
event-driven-architecture
system-design
advanced
premium
distributed-systems

167

4

Hard
System Design
Premium

Multi-Region, Multi-Tenant Architecture

Going from one region to many is one of the largest architectural commitments a company can make. The motivations are real (latency for global users, regulatory data residency, disaster recovery, regional uptime SLOs) and so are the costs (cross-region replication latency, conflict resolution, deployment complexity, blast-radius management, double or triple infrastructure spend). Multi-tenancy adds another orthogonal axis: how do you share the same infrastructure safely across hundreds or thousands of customers without one of them noisy-neighboring everyone else? This lesson covers active-active vs active-passive deployments, the data layer (replication, conflict handling, GDPR-style data residency), DNS and traffic routing, deployment topology, and the tenancy patterns (silo, pool, bridge) along with when each is the right answer.

multi-region
multi-tenant
system-design
advanced
premium
distributed-systems

1k

8

Hard

Behavioral Interviews

1 article
Behavioral Interview
Premium

System Design Decision Stories

System design decision questions are the staff-and-above architecture probe. They test whether you can shape a design that compounds correctly over years, demonstrate second-order thinking about how decisions interact, balance forward-looking design with iterative delivery, and tell a story that operates at the right altitude for staff scale. This lesson defines what counts as a scale-shaping decision (architectural choices whose costs and benefits compound), walks through how to present design decisions in narrative form rather than whiteboard form, covers the second-order-thinking moves that distinguish staff stories from senior stories, addresses when to over-engineer versus when to ship-and-iterate, and provides fully worked model STAR answers for the prompts you will hear most. After this lesson you will be able to take any consequential architectural decision from your career and tell the story so the rubric reads design judgement, second-order thinking, and operating at staff altitude.

behavioral
behavioral-interview
system-design
scalability
distributed-systems
decision-making
trade-offs
technical-depth
interview-prep
interview-strategy
senior-interviews
leadership-interview

789

25

Hard

Community

10 items
Article

CAP, PACELC, and the Trade-off People Misquote

CAP is a real theorem about a narrow edge case. PACELC is the framing that captures the trade-off teams actually make in production.

cap-theorem
consistency
availability
distributed-systems
system-design

1k

25

4.4 (9)

May 8, 2026

by @calebhadid

Question Bundle
$12.99

Kubernetes Pod Scheduling Mental Model Drill

A 4-question reference set on how Kubernetes places pods: filter-then-score, requests vs limits, affinity and anti-affinity, and the taint/toleration machinery the kubelet uses to signal node health.

Go
kubernetes-hpa
distributed-systems
backend
interview-prep

470

5

4.3 (12)

May 5, 2026

by CodeSnatch

Interview Experience

Datadog Onsite: Five Hours of System Design

A Datadog senior backend onsite where four of the five rounds were system design, anchored on real telemetry-shaped problems.

system-design
interview-prep
distributed-systems
monitoring
reliability

730

9

4.3 (11)

Apr 30, 2026

by @chloesaeed

Article

The Saga Pattern: When Distributed Transactions Aren't an Option

Why 2PC is rarely available, what a saga actually is, and the compensation design rules that separate working sagas from stuck ones.

saga-pattern
distributed-systems
two-phase-commit
microservices
system-design

606

12

4.2 (12)

Mar 12, 2026

by @meibennett

Article

Event-Driven Architecture and the Three Failure Modes

Lost messages, out-of-order delivery, duplicate processing. EDA buys decoupling and replay; the price is three failure modes you must operate.

event-driven
message-queue
kafka
distributed-systems
system-design

907

21

4.3 (12)

Feb 18, 2026

by @kavyanovak

Interview Experience

Cloudflare System Design: The Edge-Latency Question

A senior backend system design round at Cloudflare anchored on p99 latency at the edge, where the interviewer pushed past the obvious answers until I had to commit to a defensible number budget.

system-design
system-design-interview
distributed-systems
cdn
senior-interviews

232

2

4.2 (10)

Jan 15, 2026

by @oliviafoster

Article

Consistent Hashing Explained with a 200-Line Toy

A working Python toy of the ring, with virtual nodes, the bounded-movement test that proves the algorithm earns its complexity, and the cases where I would not reach for it.

consistent-hashing
hashing
distributed-systems
system-design
partitioning

299

6

4.4 (10)

Jan 8, 2026

by @ryanjoshi

Interview Experience

Coinbase System Design Round: What "Crypto-Native" Meant

A senior backend system design round at Coinbase where the generic exchange-order-book prompt was actually grading deposit confirmations, double-spend windows, and the cold-wallet boundary.

system-design
system-design-interview
distributed-systems
interview-prep
senior-interviews

764

7

Dec 24, 2025

by @ryancastillo

Question Bundle
$19.99

The Netflix Senior Bar-Raiser Set

Six Python questions that map to the bar-raiser part of a senior streaming-platform loop. The set rewards depth over speed: each prompt has a follow-up about failure modes, observability, or rollout.

Python
interview-prep
netflix
bar-raiser
distributed-systems

745

15

4.4 (14)

Dec 4, 2025

by @arjunrivera

Question Bundle
$14.99

Distributed Locks and Leases Deep Dive

A 5-question reference set on distributed mutual exclusion: Redis SETNX with fencing tokens, etcd leases, Raft leader election, optimistic vs pessimistic locking tradeoffs, and cross-region home-region pinning.

Java
distributed-systems
consensus
concurrency
interview-prep

701

7

Nov 24, 2025

by CodeSnatch