System Design Article

Database Replication (Leader-Follower, Multi-Leader)

Difficulty: Medium

Replication keeps copies of your data on multiple servers so you can survive failures, scale reads, and serve users from the nearest region. This lesson covers the three replication topologies (leader-follower, multi-leader, leaderless), the mechanics of synchronous and asynchronous replication, the consistency surprises that come with replication lag, and how to design failover and conflict resolution. By the end you can pick a topology and defend it in an interview, and recognize the bug class behind 'I just wrote it but the read says it does not exist'.

System Design
/

Database Replication (Leader-Follower, Multi-Leader)

Database Replication (Leader-Follower, Multi-Leader)

Replication keeps copies of your data on multiple servers so you can survive failures, scale reads, and serve users from the nearest region. This lesson covers the three replication topologies (leader-follower, multi-leader, leaderless), the mechanics of synchronous and asynchronous replication, the consistency surprises that come with replication lag, and how to design failover and conflict resolution. By the end you can pick a topology and defend it in an interview, and recognize the bug class behind 'I just wrote it but the read says it does not exist'.

System Design
Medium
database-replication
leader-follower
consistency
availability
distributed-systems
failover
system-design
intermediate

204 views

3

Why Replicate?

Replication serves three different goals, often at the same time:

  1. Availability - if one server crashes, another can take over.
  2. Read scalability - many followers can serve read traffic in parallel.
  3. Geographic locality - place a replica near each region so user reads are fast.

What replication does not automatically buy you is write scalability. In leader-follower replication, every write still funnels through one leader. To scale writes you need sharding (a separate lesson) or multi-leader replication.

The Three Topologies

1. Leader-Follower (Primary-Replica)

One node is the leader. All writes go to the leader. The leader streams its change log to one or more followers, which apply the changes in the same order. Reads can go to either leader or followers.

Text
---------- Leader-Follower topology ----------
                  writes
   client ---------------> [ leader ]
                              | (replication stream)
                              v
                          [ follower ] [ follower ] [ follower ]
        reads ---------------> any of: leader or followers

This is the default in PostgreSQL, MySQL, MongoDB replica sets, Redis, and most managed cloud databases. It is the simplest topology that gives you failover and read scaling.

Pros: simple, strong consistency on the leader, well-understood failover. Cons: writes do not scale beyond one leader, followers can lag, failover is a non-trivial dance.

2. Multi-Leader (Active-Active)

Two or more nodes accept writes. Each leader replicates its changes to the others. Useful when:

  • Multiple data centers must each accept writes locally for low latency.
  • You need to keep writing during a partition between regions.
  • Offline-capable clients (mobile apps) act as 'leaders' that sync when online.
Text
---------- Multi-leader topology (two regions) ----------
  region us-east             region eu-west
   client                      client
     |                           |
     v                           v
  [ leader ]  <-- replication --> [ leader ]
     |                           |
     v                           v
  [ followers ]               [ followers ]

Engines: Cassandra (every node is a leader within its replication group), CockroachDB, MySQL Group Replication, BDR for PostgreSQL, Aurora Multi-Master.

Pros: writes happen locally in every region, survives whole-region failures. Cons: write conflicts between leaders are unavoidable and need a resolution strategy.

3. Leaderless (Dynamo-style)

No special leader. Every write is sent to several replicas in parallel; the client waits until W of them acknowledge. Reads go to several replicas; the client waits for R to respond and picks the latest version. With N replicas, the inequality R + W > N gives you strong consistency for any single key.

Engines: Cassandra (also leaderless at the partition level), Riak, Amazon DynamoDB internally, Voldemort.

Pros: no special node to fail over, tunable consistency per request, scales horizontally. Cons: every read can hit multiple nodes (latency cost), conflicts are common and need resolution (vector clocks, CRDTs).

Synchronous vs Asynchronous Replication

The second axis: when does the leader confirm the write to the client?

Asynchronous

  • Leader writes locally and acknowledges the client immediately.
  • Followers receive the change later (milliseconds to seconds).
  • Risk: if the leader crashes before any follower receives the write, that write is lost.
  • Benefit: fast writes, low latency, the standard mode for most setups.

Synchronous

  • Leader waits until at least one follower has durably stored the write before acknowledging.
  • Risk: if the follower is slow or down, the leader stalls. One failed follower can cause an outage.
  • Benefit: guaranteed durability across machines.

Semi-Synchronous (the practical middle ground)

  • One follower is synchronous, the rest are async. If the sync follower lags, it is silently demoted and another follower takes its place.
  • Used in MySQL semi-sync, PostgreSQL synchronous_commit = remote_apply with multiple potential sync followers, MongoDB write concern w: majority.

A practical configuration: write-concern majority in MongoDB or synchronous_standby_names with quorum in PostgreSQL gives durable, low-latency writes across an odd number of replicas.

Replication Lag and Consistency Anomalies

The single hardest concept in replication is replication lag: the time between a write landing on the leader and the same write becoming visible on a follower. Lag is usually milliseconds, but under load it can stretch to seconds or even minutes.

Lag creates three classic anomalies. Memorize them - they appear in interviews and bug reports.

Anomaly 1: Stale Read

Text
---------- Stale read after write ----------
T0  client writes 'name=Alice' to leader        OK
T1  client immediately reads from a follower    returns the old name

The write committed, but the follower has not received it yet.

Mitigation - read-your-writes consistency:

  • Route reads of a user's own data to the leader for a short window after they write.
  • Track the latest replication position (LSN, oplog timestamp) per session and only read from a follower that has caught up to it.
  • For account-critical data, simply read from the leader.

Anomaly 2: Monotonic Reads Violated

Text
---------- Monotonic read violation ----------
T0  client reads from follower A    sees version v5
T1  client reads from follower B    sees version v4 (older!)

The user appears to time-travel because consecutive reads went to different followers.

Mitigation: pin a session to the same follower, or pick a follower whose log position is at least as new as the previous read.

Anomaly 3: Causal Inconsistency

A reply appears before the original message because the original arrived to a slower follower second. Mitigation: vector clocks, causal tokens, or sequencing all related operations through the same partition.

Failover: Promoting a Follower

When the leader fails, the system must promote a follower to take its place. This sounds simple and is famously not.

The dance:

  1. Detect failure - heartbeat / health check timeout. False positives waste your time; long timeouts increase outage duration.
  2. Choose a new leader - typically the follower with the most up-to-date replication log. Built-in tools: Patroni for Postgres, MySQL Orchestrator, MongoDB's automatic election within a replica set.
  3. Reconfigure clients - update connection strings or DNS so traffic flows to the new leader.
  4. Recover the old leader - when it comes back, it must rejoin as a follower and reconcile any writes that did not make it to the new leader.

Hazards:

  • Lost writes - asynchronous replication means recent writes on the failed leader may never have reached the new one. Some writes are silently lost. Customers see 'I'm sure I clicked save'.
  • Split brain - the old leader and the new leader both think they are leader, both accept writes. Fencing tokens, leases, and STONITH ('Shoot The Other Node In The Head') are mitigations.
  • Cascading overload - the new leader inherits all traffic and may be slower than the old, triggering further failures.

This is why automated failover should be tested in production-like environments quarterly. It is not a tool you set and forget.

Pseudocode for read-your-writes routing

async function readUserProfile(userId, ctx) {
    // If this session wrote in the last 30s, read from leader
    if (ctx.lastWriteAt && Date.now() - ctx.lastWriteAt < 30_000) {
        return await leaderDb.query('SELECT * FROM users WHERE id = $1', [userId]);
    }
    return await followerDb.query('SELECT * FROM users WHERE id = $1', [userId]);
}

async function updateUserProfile(userId, patch, ctx) {
    const result = await leaderDb.query('UPDATE users SET ... WHERE id = $1', [userId]);
    ctx.lastWriteAt = Date.now();
    return result;
}

Conflict Resolution in Multi-Leader Setups

With two leaders, the same row can be updated in two places at the same time. The replication stream brings the conflicting versions together and the system must decide which wins.

StrategyHow it worksWhen to use
Last-Write-Wins (LWW)Compare timestamps, newest wins. Loses writes silently when clocks drift.Caches, sessions, anywhere data loss is recoverable.
Application-level mergeSurface the conflict, let business logic merge (e.g., shopping cart union).Carts, collaborative documents.
CRDTs (Conflict-Free Replicated Data Types)Math-guaranteed mergeable types: counters, sets, sequences.Real-time collab, offline mobile sync, Redis Enterprise CRDB.
Operational Transform (OT)Re-base each operation against concurrent ones.Google Docs, Notion, collaborative editors.

The key insight: multi-leader without a conflict strategy will silently corrupt your data. Always choose one before deploying.

How Replication Streams Work Under the Hood

  • PostgreSQL: streams the WAL (Write-Ahead Log) byte-for-byte. Followers replay WAL records to rebuild state. Logical replication (Postgres 10+) decodes the WAL into row-level changes for cross-version or cross-cluster replication.
  • MySQL: streams the binlog in row, statement, or mixed format. Async by default. Group Replication adds a Paxos-like protocol for multi-leader.
  • MongoDB: every write is appended to the oplog capped collection on the primary; secondaries tail it. Elections use the Raft protocol.
  • Cassandra: every coordinator forwards each write to N replicas. No leader; consistency is a per-request choice via R, W, and N.
  • Amazon Aurora: storage layer is shared across instances; replicas read the same storage volume rather than replaying a log. Replication lag is microseconds, not milliseconds.

Knowing these specifics is what separates a junior answer from a staff-level answer.

Real-World Examples

How real systems implement this in production

GitHub MySQL outage (2018)

A 43-second network partition caused MySQL Orchestrator to promote a stale follower in a remote DC as the new leader. When the original leader rejoined, the clusters had diverged, requiring 24 hours of manual reconciliation.

Trade-off: Aggressive automated failover protects against single-node outages but can multiply the blast radius of network partitions. Many teams now favor a slightly slower, human-in-the-loop failover for cross-region scenarios.

Amazon Aurora

Aurora pushes replication into the storage layer: compute replicas read from the same distributed storage volume that is replicated 6-ways across 3 availability zones. Read replica lag is around 100us because there is no log replay.

Trade-off: The shared storage architecture is more expensive per byte than vanilla Postgres or MySQL, and Aurora is locked to AWS. The latency win matters most for read-heavy workloads with strict freshness needs.

Cassandra multi-datacenter

Cassandra deploys leaderless replication within a DC and asynchronous replication between DCs. With LOCAL_QUORUM in each region you get fast local writes and EACH_QUORUM gives cross-region durability when needed.

Trade-off: Conflicts are resolved by last-write-wins on per-cell timestamps, which works for time series but can silently drop concurrent updates in OLTP-style workloads.

Quick Interview Phrases

Key terms to use in your answer

leader-follower replication
replication lag
read-your-writes
split brain
synchronous commit
quorum-based election

Common Interview Questions

Questions you might be asked about this topic

Health check fails after timeout. The orchestrator (Patroni, repmgr) inspects each follower's WAL position, picks the most up-to-date one, and promotes it. Clients reconnect via a new connection string or virtual IP. The old leader, when it returns, must rewind to the new leader's WAL position via pg_rewind and rejoin as a follower. Mention the risk of lost async writes and how you'd reduce it with synchronous_standby_names.

Interview Tips

How to discuss this topic effectively

1

Always state the topology and the sync mode in the same sentence: 'leader-follower with semi-synchronous replication to one local follower and async to the cross-region follower'. That precision is what staff-level interviewers look for.

2

When the interviewer mentions a multi-region requirement, immediately raise the question of write topology. 'Do we need single-leader for simplicity, or multi-leader for local writes per region?' shows you understand the architectural fork.

3

Bring up replication lag proactively. Say 'we'll need read-your-writes consistency for the user's own data; for everyone else's data, eventual consistency is fine'. Interviewers love this because it shows you've shipped real systems.

4

Discuss failover as a process, not a magic feature. Mention quorum, fencing tokens, and the 'old leader rejoins' problem. This is where senior engineers earn their stripes.

5

Know one concrete tool per engine: Patroni or repmgr for Postgres, Orchestrator for MySQL, replica sets for MongoDB. Naming a tool signals operational experience.

Common Mistakes

Pitfalls to avoid in interviews

Assuming replication automatically scales writes

In leader-follower replication, every write goes through one leader. Replication scales reads and improves availability, but write throughput is bounded by the leader. To scale writes you need multi-leader replication or sharding.

Treating async replication as 'safe enough'

Async replication can lose recently-acknowledged writes if the leader fails before any follower received them. For ledger or payment data, configure semi-synchronous replication or write-concern majority so the write is durable on at least one peer before acking.

Using last-write-wins for everything in multi-leader setups

LWW relies on synchronized clocks and silently drops one of two concurrent writes. For cart contents, document edits, or any state where data loss is unacceptable, use CRDTs, application merges, or operational transforms.

Ignoring split brain when designing failover

Without quorum-based elections and fencing, a network partition can produce two leaders that diverge. At reunion the cluster must discard one side's writes. Always require a majority for elections and fence the old leader before promoting.

Routing all read traffic to followers without thinking

Followers serve stale data. Reads that need to see the user's own recent write must go to the leader (or wait for the follower to catch up). Build read routing that's aware of session write history, not a global 'reads-go-to-followers' switch.