Distributed Systems

0 lessons

15 system designs

1 behavioral interview

10 community items

distributed-systems

System Design

15 articles

System Design

Database Replication (Leader-Follower, Multi-Leader)

Replication keeps copies of your data on multiple servers so you can survive failures, scale reads, and serve users from the nearest region. This lesson covers the three replication topologies (leader-follower, multi-leader, leaderless), the mechanics of synchronous and asynchronous replication, the consistency surprises that come with replication lag, and how to design failover and conflict resolution. By the end you can pick a topology and defend it in an interview, and recognize the bug class behind 'I just wrote it but the read says it does not exist'.

204

Distributed Caching (Redis, Memcached)

A single-node cache eventually runs out of RAM, CPU, or network. Distributed caching spreads keys across many nodes so total capacity and throughput scale horizontally. This lesson covers how Redis and Memcached partition data, replicate it for availability, fail over when nodes die, and how to choose between them. By the end you can design a multi-node cache layer for a real workload, defend the topology in an interview, and recognize the bug class behind 'why is one cache node maxed at 100% CPU while the others are idle?'.

806

Cache Invalidation Strategies & Consistency

There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors. This lesson tackles the first one. We cover TTL-based, write-driven, and event-driven invalidation; the canonical race conditions (lost-update, double-write inconsistency, stale-after-failover); the consistency models a cache can offer; and the patterns that real systems (Facebook, Stripe, AWS) use to keep cached data trustworthy. By the end you can pick an invalidation strategy, defend it under interviewer pressure, and explain exactly why your cache will not silently serve yesterday's data.

578

CAP Theorem & Trade-offs

The CAP theorem says any distributed data store must trade off Consistency, Availability, or Partition tolerance during a network split, and you only get to keep two. This lesson cuts through the textbook version with the practical engineer's reading: partitions are non-negotiable, so the real choice is between consistency and availability when the network breaks. We cover what each property actually means, why CAP is misleading without PACELC, and how real systems (MongoDB, DynamoDB, Cassandra, Spanner) place themselves on the spectrum. By the end you can defend a system's CAP choice in an interview without falling into the common 'I picked CA' trap.

1.1k

Consistency Models (Strong, Eventual, Causal)

Consistency models are the contract between a distributed data store and its clients about what they can and cannot observe. This lesson walks the spectrum from strict serializability at the strong end to eventual consistency at the relaxed end, with stops at linearizability, sequential, causal, read-your-writes, monotonic reads, and monotonic writes. We focus on what each model promises, what bugs it prevents, what it costs in latency and availability, and which production systems implement it. By the end you can name the model your system needs and explain why - the senior-level move that interviewers reward.

911

Consistent Hashing & Data Distribution

Consistent hashing is the trick that lets distributed caches and databases add or remove nodes without remapping every key in the cluster. This lesson explains why naive `hash(key) % N` is broken, how the hash ring works, why you need virtual nodes to keep load balanced, and how real systems (DynamoDB, Cassandra, Memcached, Discord) implement it. We finish with the modern alternatives (rendezvous hashing, jump consistent hash, Maglev) and the trade-offs that make consistent hashing the answer in interviews 90% of the time.

696

System Design

Premium

Leader Election & Consensus (Raft, Paxos)

Leader election is how a distributed cluster picks one node to be in charge so the others can stop arguing. This lesson covers the consensus problem (FLP impossibility), Paxos in concept, Raft in detail (leader election + log replication + safety), the role of quorum, and the operational pitfalls of split brain and network partitions. We also tour the systems that ship Raft or Paxos in production: etcd, ZooKeeper, Consul, CockroachDB, MongoDB, Spanner. By the end you can explain why every modern distributed database has a consensus protocol at its core, and you can sketch Raft on a whiteboard.

leader-election

raft

paxos

distributed-systems

consensus

consistency

fault-tolerance

system-design

advanced

premium

965

Hard

System Design

Premium

Distributed Transactions (2PC, Saga Pattern)

When a single business operation spans multiple services or databases, you cannot rely on a single ACID transaction. This lesson covers the two dominant patterns for keeping consistency across services: Two-Phase Commit (2PC) for synchronous, atomic, blocking transactions, and the Saga pattern (orchestration vs choreography) for long-running asynchronous workflows with compensating actions. We also cover Three-Phase Commit, idempotency keys, the outbox pattern, and the trade-offs that explain why 2PC is rare in microservices and Sagas are everywhere. By the end you can pick the right pattern for an order checkout, a money transfer, or a multi-step booking flow.

distributed-transactions

two-phase-commit

saga-pattern

distributed-systems

consistency

acid

microservices

system-design

advanced

premium

855

Hard

System Design

Message Queues (Kafka, RabbitMQ, SQS)

Message queues let one service hand work to another without waiting, smoothing traffic spikes, decoupling services, and surviving downstream outages. This lesson covers the two queue families (broker-based like RabbitMQ and SQS vs log-based like Kafka), the delivery semantics (at-most-once, at-least-once, exactly-once), the operational essentials (DLQs, consumer groups, backpressure, ordering), and the trade-offs that decide between Kafka, RabbitMQ, and SQS for any given workload. By the end you can pick a queue and defend the choice with the per-property reasoning interviewers reward.

932

Event-Driven Architecture & Pub/Sub

Event-driven architecture (EDA) is a style where services communicate by emitting and reacting to immutable events instead of calling each other directly. This lesson covers the publish/subscribe pattern, the difference between event notification and event-carried state transfer, the role of an event bus, and how EDA reshapes coupling, scalability, and consistency. We compare it with request/response, walk through real implementations on Kafka, Kinesis, EventBridge, and SNS, and end with the operational pitfalls (event versioning, ordering, schema drift, observability) that bite teams who adopt EDA without preparation.

388

System Design

Premium

Stream Processing (Kafka Streams, Flink)

Stream processing is the discipline of computing on continuous, unbounded data as it arrives, instead of in periodic batches. This lesson covers the core stream-processing primitives: stateful operators, event time vs processing time, watermarks, windowing (tumbling, sliding, session), exactly-once semantics, and stateful checkpointing. We compare the leading engines (Kafka Streams, Apache Flink, Spark Structured Streaming) and walk through real production patterns: real-time analytics, fraud detection, ML feature pipelines, and CDC-driven materialized views. By the end you can sketch a Flink pipeline on a whiteboard and defend the windowing and checkpointing choices.

stream-processing

kafka

flink

event-driven

async-processing

distributed-systems

system-design

advanced

premium

949

Hard

System Design

Fault Tolerance, Redundancy & Failover

Fault tolerance is the property that lets a system keep working when components fail - and at any reasonable scale, components are always failing. This lesson covers the building blocks: redundancy (active-active, active-passive), failure detection (health checks, heartbeats), failover (automatic, manual), and the patterns that make systems gracefully degrade instead of catastrophically crash (circuit breakers, retries with backoff, bulkheads, timeouts). We finish with the operational disciplines that turn architecture into reality: chaos engineering, runbooks, blast-radius analysis, and disaster recovery (RTO/RPO). By the end you can design a system that survives the failure modes interviewers love to throw at you.

510

Microservices vs Monolith: When to Choose What

Microservices are not a maturity badge. Monoliths are not a code smell. The honest interview answer is that architecture is a continuum (monolith, modular monolith, services, microservices) and the right point on it is set by team size, deployment frequency, and the cost of distribution, not by what the cool kids at Netflix did. This lesson walks through the trade-offs concretely: latency tax, operational overhead, organizational coupling (Conway's Law), data consistency, and the migration paths that work. By the end you can defend either choice for a given product without reaching for buzzwords.

microservices

monolith

microservices-architecture

358

System Design

Premium

Event Sourcing & CQRS

Event Sourcing stores every change to your application state as an immutable event, and the current state is what you get when you replay them. CQRS splits the read and write paths so each can be optimized independently. Together they unlock auditability, time travel, and read/write scaling that traditional CRUD cannot. They also introduce eventual consistency, schema evolution pain, and a steep operational learning curve. This lesson teaches the mechanics, the implementation patterns (event store, snapshots, projections, sagas), and the honest answer to when these patterns are worth the cost (financial ledgers, audit-heavy domains, complex business workflows) and when they are over-engineering (a typical SaaS CRUD app).

event-sourcing

cqrs

event-driven-architecture

system-design

advanced

premium

distributed-systems

167

Hard

System Design

Premium

Multi-Region, Multi-Tenant Architecture

Going from one region to many is one of the largest architectural commitments a company can make. The motivations are real (latency for global users, regulatory data residency, disaster recovery, regional uptime SLOs) and so are the costs (cross-region replication latency, conflict resolution, deployment complexity, blast-radius management, double or triple infrastructure spend). Multi-tenancy adds another orthogonal axis: how do you share the same infrastructure safely across hundreds or thousands of customers without one of them noisy-neighboring everyone else? This lesson covers active-active vs active-passive deployments, the data layer (replication, conflict handling, GDPR-style data residency), DNS and traffic routing, deployment topology, and the tenancy patterns (silo, pool, bridge) along with when each is the right answer.

multi-region

multi-tenant

system-design

advanced

premium

distributed-systems

Hard

Behavioral Interviews

1 article

Behavioral Interview

Premium

System Design Decision Stories

System design decision questions are the staff-and-above architecture probe. They test whether you can shape a design that compounds correctly over years, demonstrate second-order thinking about how decisions interact, balance forward-looking design with iterative delivery, and tell a story that operates at the right altitude for staff scale. This lesson defines what counts as a scale-shaping decision (architectural choices whose costs and benefits compound), walks through how to present design decisions in narrative form rather than whiteboard form, covers the second-order-thinking moves that distinguish staff stories from senior stories, addresses when to over-engineer versus when to ship-and-iterate, and provides fully worked model STAR answers for the prompts you will hear most. After this lesson you will be able to take any consequential architectural decision from your career and tell the story so the rubric reads design judgement, second-order thinking, and operating at staff altitude.

behavioral

behavioral-interview

system-design

scalability

distributed-systems

decision-making

trade-offs

technical-depth

interview-prep

interview-strategy

senior-interviews

leadership-interview

789

Hard

Community

10 items

Article

CAP, PACELC, and the Trade-off People Misquote

CAP is a real theorem about a narrow edge case. PACELC is the framing that captures the trade-off teams actually make in production.

4.4 (9)

May 8, 2026

by @calebhadid

Question Bundle

$12.99

Kubernetes Pod Scheduling Mental Model Drill

A 4-question reference set on how Kubernetes places pods: filter-then-score, requests vs limits, affinity and anti-affinity, and the taint/toleration machinery the kubelet uses to signal node health.

470

4.3 (12)

May 5, 2026

by CodeSnatch

Interview Experience

Datadog Onsite: Five Hours of System Design

A Datadog senior backend onsite where four of the five rounds were system design, anchored on real telemetry-shaped problems.

730

4.3 (11)

Apr 30, 2026

by @chloesaeed

Article

The Saga Pattern: When Distributed Transactions Aren't an Option

Why 2PC is rarely available, what a saga actually is, and the compensation design rules that separate working sagas from stuck ones.

606

4.2 (12)

Mar 12, 2026

by @meibennett

Article

Event-Driven Architecture and the Three Failure Modes

Lost messages, out-of-order delivery, duplicate processing. EDA buys decoupling and replay; the price is three failure modes you must operate.

907

4.3 (12)

Feb 18, 2026

by @kavyanovak

Interview Experience

Cloudflare System Design: The Edge-Latency Question

A senior backend system design round at Cloudflare anchored on p99 latency at the edge, where the interviewer pushed past the obvious answers until I had to commit to a defensible number budget.

system-design

system-design-interview

distributed-systems

cdn

senior-interviews

232

4.2 (10)

Jan 15, 2026

by @oliviafoster

Article

Consistent Hashing Explained with a 200-Line Toy

A working Python toy of the ring, with virtual nodes, the bounded-movement test that proves the algorithm earns its complexity, and the cases where I would not reach for it.

299

4.4 (10)

Jan 8, 2026

by @ryanjoshi

Interview Experience

Coinbase System Design Round: What "Crypto-Native" Meant

A senior backend system design round at Coinbase where the generic exchange-order-book prompt was actually grading deposit confirmations, double-spend windows, and the cold-wallet boundary.

system-design

system-design-interview

distributed-systems

interview-prep

senior-interviews

764

Dec 24, 2025

by @ryancastillo

Question Bundle

$19.99

The Netflix Senior Bar-Raiser Set

Six Python questions that map to the bar-raiser part of a senior streaming-platform loop. The set rewards depth over speed: each prompt has a follow-up about failure modes, observability, or rollout.

745

4.4 (14)

Dec 4, 2025

by @arjunrivera

Question Bundle

$14.99

Distributed Locks and Leases Deep Dive

A 5-question reference set on distributed mutual exclusion: Redis SETNX with fencing tokens, etcd leases, Raft leader election, optimistic vs pessimistic locking tradeoffs, and cross-region home-region pinning.

701

Nov 24, 2025

by CodeSnatch