Availability

0 lessons

3 system designs

1 community item

availability

System Design

3 articles

Database Replication (Leader-Follower, Multi-Leader)

Replication keeps copies of your data on multiple servers so you can survive failures, scale reads, and serve users from the nearest region. This lesson covers the three replication topologies (leader-follower, multi-leader, leaderless), the mechanics of synchronous and asynchronous replication, the consistency surprises that come with replication lag, and how to design failover and conflict resolution. By the end you can pick a topology and defend it in an interview, and recognize the bug class behind 'I just wrote it but the read says it does not exist'.

204

CAP Theorem & Trade-offs

The CAP theorem says any distributed data store must trade off Consistency, Availability, or Partition tolerance during a network split, and you only get to keep two. This lesson cuts through the textbook version with the practical engineer's reading: partitions are non-negotiable, so the real choice is between consistency and availability when the network breaks. We cover what each property actually means, why CAP is misleading without PACELC, and how real systems (MongoDB, DynamoDB, Cassandra, Spanner) place themselves on the spectrum. By the end you can defend a system's CAP choice in an interview without falling into the common 'I picked CA' trap.

1.1k

Fault Tolerance, Redundancy & Failover

Fault tolerance is the property that lets a system keep working when components fail - and at any reasonable scale, components are always failing. This lesson covers the building blocks: redundancy (active-active, active-passive), failure detection (health checks, heartbeats), failover (automatic, manual), and the patterns that make systems gracefully degrade instead of catastrophically crash (circuit breakers, retries with backoff, bulkheads, timeouts). We finish with the operational disciplines that turn architecture into reality: chaos engineering, runbooks, blast-radius analysis, and disaster recovery (RTO/RPO). By the end you can design a system that survives the failure modes interviewers love to throw at you.

510