System Design Article
Design a Key-Value Store (DynamoDB)
Difficulty: Medium
Design a Dynamo-style distributed key-value store that scales linearly to thousands of nodes, stays available during partitions, and offers tunable consistency through a quorum (N, W, R). The interview centerpiece is the trio that makes this work at scale: consistent hashing with virtual nodes for partitioning, N/W/R quorums for replication and consistency, and vector clocks for resolving concurrent writes. We cover the gossip protocol for membership, Merkle trees for anti-entropy, hinted handoff for transient failures, sloppy quorum for write availability during partitions, and the LSM-tree storage engine that powers each node.
Design a Key-Value Store (DynamoDB)
Design a Dynamo-style distributed key-value store that scales linearly to thousands of nodes, stays available during partitions, and offers tunable consistency through a quorum (N, W, R). The interview centerpiece is the trio that makes this work at scale: consistent hashing with virtual nodes for partitioning, N/W/R quorums for replication and consistency, and vector clocks for resolving concurrent writes. We cover the gossip protocol for membership, Merkle trees for anti-entropy, hinted handoff for transient failures, sloppy quorum for write availability during partitions, and the LSM-tree storage engine that powers each node.
457 views
13
Requirements
Functional Requirements
- GET (key) -> value with optional consistency level (one, quorum, all).
- PUT (key, value) that durably stores or updates the value.
- DELETE (key) that removes the value and propagates the deletion.
- Tunable consistency per request: callers choose between fast/eventually consistent and slower/strongly consistent.
- Versioning: support concurrent writes with conflict detection (vector clocks); the application or store decides the resolution policy.
- Auto scale: add nodes to the cluster without downtime; existing data redistributes minimally.
- Multi-data-center: optionally replicate across data centers with configurable replication factor per DC.
Out of Scope (state explicitly)
- Range queries / secondary indexes (this is a pure KV store; range support is a separate feature).
- Strong cross-key transactions (we offer single-key atomicity only).
- Schema management (values are opaque blobs to the store).
- Client-side caching (client libraries optional).
Non-Functional Requirements
- Highly available: 99.99%, even during node failures and network partitions. Lean AP per CAP.
- Tunable consistency: callers pay the cost of consistency only when they want it.
- Low latency: p99 < 10 ms for GET at consistency=ONE; p99 < 30 ms at consistency=QUORUM.
- Linear scale: doubling nodes ~doubles throughput; throughput grows with cluster size, not node size.
- Durable: no acknowledged write is lost under single-node failure; quorum writes survive a minority partition.
- Operationally simple: no manual rebalancing, no leader elections per partition, no specialized hardware.
Back-of-the-Envelope Estimation
Capacity
---------- Storage and traffic ----------
Keys: 1B
Average value size: ~1 KB
Replication factor (N): 3
Logical data: ~1 TB
Replicated data: ~3 TB
Daily ops (R + W): ~10B
Average ops/sec: ~120K
Peak ops/sec: ~500K
Cluster nodes (each ~16 GB RAM, 1 TB SSD): ~50 to startWith 50 nodes each holding ~60 GB and ~10K ops/sec, headroom for 5x growth before adding nodes.
Network and Replication Overhead
---------- Replication overhead ----------
Every PUT writes to N=3 nodes; a PUT at QUORUM (W=2) waits for 2 acks but still writes 3.
Replication multiplier: 3x writes; 1x reads
Inter-node bandwidth at 500K writes/sec * 1 KB * 3 replicas: ~1.5 GB/sec across the cluster
Per-node replication traffic: ~30 MB/sec at 50 nodes (manageable)Hot Key Risk
---------- Hot key estimation ----------
With virtual nodes (256 vnodes per physical node), a single popular key still lives on one physical node.
If one key receives 10K ops/sec, that's 10% of one node's budget; we live with it or precompute / cache hot reads.
For truly hot keys we replicate beyond N (e.g., N=5) and let R=1 reads fan out.High-Level Design
---------- Architecture overview ----------
+-----------+
| Client |
+-----------+
|
v (any node accepts requests; called the coordinator)
+------------------+
| Coordinator | (one of the cluster nodes)
+------------------+
|
v (forward to N replicas via consistent hash)
+-------- ring of nodes (consistent hashing) --------+
| n1 n2 n3 n4 n5 ... n50 |
| N replicas = next clockwise nodes |
+-----------------------------------------------------+
|
v (per-node)
+------------------+
| Storage engine | (LSM tree on SSD)
+------------------+Every node is symmetric: any node can be the coordinator for any request. There is no master. Membership is gossiped peer-to-peer.
Public API
// Client SDK style
put(key, value, consistency=QUORUM, version=null)
get(key, consistency=ONE)
delete(key, consistency=QUORUM)
// Wire-level (gRPC)
rpc Put(PutRequest) returns (PutResponse);
rpc Get(GetRequest) returns (GetResponse);
rpc Delete(DeleteRequest) returns (DeleteResponse);
message PutRequest {
bytes key = 1;
bytes value = 2;
Consistency consistency = 3;
repeated VectorClockEntry vclock = 4; // for concurrent-write detection
}
enum Consistency { ONE = 0; QUORUM = 1; ALL = 2; }Request Flow
- Client sends PUT to any node (the coordinator). Smart clients pick a node that owns the key's range; dumb clients hash anywhere.
- Coordinator computes the key's preference list (N nodes responsible for the key).
- Coordinator sends the write to all N concurrently and waits for W acks before responding.
- Each replica appends to its commit log, updates the LSM memtable, and acks.
- If a replica is unreachable, the coordinator stores a hinted handoff for it on a substitute node.
Detailed Design
The two interesting components are consistent hashing with virtual nodes and the N/W/R quorum with vector clocks for conflict resolution.
Consistent Hashing with Virtual Nodes
A naive sharding scheme (key -> hash(key) mod N) breaks badly when N changes; nearly every key remaps. Consistent hashing maps both keys and nodes onto a ring (e.g., a 64-bit hash space). A key is owned by the first node clockwise from it.
---------- Consistent hash ring ----------
Hash space: 0 .. 2^64-1
Nodes: n1 at hash(n1)=2^60, n2 at 2^62, n3 at 2^63 ...
Key 'foo': hash('foo') = 2^61.5; first clockwise node is n2; n2 owns 'foo'.
When we add n4 at 2^61, only keys in the range (2^60, 2^61] move to n4 (small fraction).Why Virtual Nodes
With one position per physical node, the ring is uneven (some nodes own much larger ranges than others) and adding a node moves all of one neighbor's data. Virtual nodes (vnodes) split each physical node into many positions on the ring (typically 128 or 256).
Effects:
- The ring becomes statistically uniform; each physical node owns roughly the same total range.
- Adding a new physical node steals small slivers from many nodes, not one big chunk from one neighbor.
- Heterogeneous capacity is easy: a beefier node gets more vnodes, owning a larger slice.
Preference List
For key K, the preference list is the next N distinct physical nodes clockwise on the ring. This is the replica set for that key.
---------- Preference list example ----------
Key 'foo' with N=3:
position on ring -> next vnode owned by n2 (primary)
-> next vnode owned by n5 (replica)
-> next vnode owned by n11 (replica)
Preference list = [n2, n5, n11]If n5 is down, we skip to n12 as the substitute (sloppy quorum); we'll hand off the data to n5 when it returns.
N, W, R Quorums
N = number of replicas in the preference list. W = number of acks required before a write is considered successful. R = number of replies required before a read returns.
If W + R > N, every read is guaranteed to see at least one replica that has the latest write (strong consistency, with caveats around vector clocks).
---------- Common configurations ----------
N=3, W=1, R=1 -> fast, eventually consistent (Dynamo-style 'always writeable')
N=3, W=2, R=2 -> quorum reads/writes, strongly consistent (W+R > N)
N=3, W=3, R=1 -> writes always see all replicas, fast reads, slow writes
N=3, W=1, R=3 -> fast writes, slow reads, strongThe default for most workloads is W=R=2 (quorum on both sides). Latency is bounded by the slower of two replicas (often well below the third's tail latency).
Vector Clocks for Conflict Detection
When two clients concurrently update the same key without coordination, both writes can succeed. Without vector clocks the store cannot tell which is newer (timestamps are unreliable across hosts). Vector clocks tag every write with a per-coordinator counter.
---------- Vector clock example ----------
Write A by client1 -> vclock = {n2: 1}
Write B by client2 (concurrent, no read of A) -> vclock = {n5: 1}
Reader sees both versions: {n2: 1} and {n5: 1}. Neither dominates the other.
The reader (client) is told 'two versions exist; pick one or merge them'.If one clock dominates (every entry >= the other's), that version is newer. If neither dominates, they are concurrent.
Conflict Resolution Policies
- Application merges: client receives all sibling versions and decides (e.g., shopping cart: union the items).
- Last-write-wins (LWW): store keeps the version with the highest timestamp; loses some writes but is simple. Cassandra defaults to this.
- CRDT: values are designed to merge automatically (counters, sets); great when applicable.
Storage Engine: LSM Tree
Each node stores its data in an LSM (log-structured merge) tree:
---------- LSM tree structure ----------
Write path: append to commit log -> insert into memtable
When memtable fills: flush to immutable SSTable on disk
Background compaction merges SSTables, drops tombstones, rewrites in sorted order
Read path: check memtable, then bloom filter per SSTable, then SSTables in age orderLSM trees give excellent write throughput (sequential writes only) at the cost of read amplification (multiple SSTables per query). Bloom filters per SSTable make negative reads cheap. Compaction strategy (size-tiered vs leveled) trades write amplification for read amplification.
Hinted Handoff
When a coordinator can't reach replica R for a write, it writes to the next clockwise node X with a 'hint': 'this is for R, deliver when R returns.' X stores the hint. When R rejoins the cluster (gossip notices), X replays the hints to R.
---------- Hinted handoff flow ----------
1. Coordinator tries to write to n5; n5 unreachable.
2. Coordinator writes the data to n12 with hint metadata {target: n5, expires: now+3h}.
3. Gossip detects n5 has rejoined.
4. n12 replays hints to n5; deletes them once acked.Hinted handoff handles brief outages (seconds to minutes). For longer outages, anti-entropy (next section) catches the residual divergence.
Anti-Entropy with Merkle Trees
Replicas can drift if hints expire or are lost. Periodic anti-entropy sessions compare two replicas' contents and reconcile differences.
Naive comparison (send every key) is O(data size). Merkle trees compare in O(log N) by hashing tree levels:
---------- Merkle tree comparison ----------
Each replica builds a Merkle tree of its key range (leaves = key hashes, nodes = hash of children).
Replicas exchange tree roots; if equal, they agree on everything.
If different, recurse into the differing subtrees until you find the divergent keys.
Only the divergent keys are exchanged.This is what Riak and Cassandra call 'anti-entropy repair'. Run as a background job (typically once per week).
Gossip for Membership
Nodes don't need a master to know cluster state. Gossip propagates membership changes peer-to-peer:
---------- Gossip ----------
Every 1 s, each node picks K=3 random peers and exchanges:
- my view of the membership table (alive / suspect / down)
- my heartbeat counter
Within O(log N) seconds, the whole cluster converges on a consistent view.Nodes detected as down (via heartbeat timeout) are removed from preference lists. When they rejoin, gossip propagates that and they are added back; vnodes redistribute slightly.
Coordinator Logic (Pseudocode)
async function put(key, value, consistency = 'QUORUM') {
const N = 3;
const W = consistency === 'ONE' ? 1 : consistency === 'ALL' ? N : 2;
const replicas = preferenceList(key, N); // ring lookup
const vclock = mergeAndIncrement(key, this.nodeId);
let acks = 0;
const errors = [];
await Promise.all(replicas.map(async (node) => {
try {
await rpcPut(node, key, value, vclock);
acks++;
} catch (e) {
errors.push({ node, e });
const sub = nextHealthyAfter(node);
await rpcPutWithHint(sub, key, value, vclock, { hintFor: node });
}
}));
if (acks >= W) return { ok: true, vclock };
throw new Error(`Insufficient acks: ${acks}/${W}`);
}Data Model
Per-Node On-Disk Layout
---------- Per node ----------
/data/
commitlog/ (sequential writes, fsync per batch)
memtable (in RAM, ~256 MB before flush)
sstables/
L0-001.sst (most recent flush)
L1-001.sst
...
bloom/ (one bloom filter per SSTable)
hints/ (pending hinted handoffs to other nodes)
merkle/ (cached Merkle tree segments)In-Memory State
- Membership table (gossiped, O(N) entries).
- Vnode -> physical node map (computed from membership).
- Recent vector clocks per active key (small cache).
Per-Key State on Disk
For a key 'foo':
---------- Stored value ----------
key: 'foo'
vclock: {n2: 4, n5: 3, n11: 2}
value: <bytes>
tombstone: false
last_write_ts: 1714142400123Tombstones (delete markers) are kept until anti-entropy + compaction eventually drops them; that's why 'delete' is heavier than in a SQL store.
Scaling and Bottlenecks
Adding Nodes
When a node joins, gossip propagates its identity; the ring is recomputed; small slivers of vnodes from existing nodes migrate to the new node. Migration is rate-limited (the cluster keeps serving traffic during it). Time to add a node: minutes for membership convergence; hours for data migration depending on cluster size.
Hot Keys
A single hyper-popular key (Justin Bieber's user record on Twitter) lands on one set of N nodes. Mitigations:
- Increase N for that key (e.g., N=5 for read-hot keys). More replicas means R=1 reads can fan out wider.
- Client-side caching with short TTL.
- For truly extreme cases, replicate the value to all nodes (but lose write performance).
Multi-Region Replication
Replicate data across DCs by including replicas from each DC in the preference list. Local QUORUM (LOCAL_QUORUM) lets writes succeed locally even if the remote DC is unreachable; full-cluster QUORUM (EACH_QUORUM) requires acks from each DC.
The trade-off: LOCAL_QUORUM is fast but two DCs can diverge during a partition; EACH_QUORUM costs cross-DC RTT on every write.
Compaction Pressure
LSM compaction can monopolize disk IOPS during heavy write periods, hurting read latency. Throttle compaction to a fixed IO budget and prefer leveled compaction for read-heavy workloads.
Sloppy Quorum During Partitions
If a partition isolates a replica, hinted handoff lets writes continue (sloppy quorum). When the partition heals, hints replay and anti-entropy repairs anything missed. The trade-off: clients see eventually consistent reads during the partition (some replicas don't have the latest data).
Read Repair
On a read at QUORUM, the coordinator may receive different versions from different replicas. It returns the newest (or all siblings to the client). In the background, it pushes the newest to the lagging replicas: 'read repair'. This converges replicas without waiting for anti-entropy.
Trade-offs and Alternatives
AP vs CP
Dynamo is AP-leaning: it sacrifices strong consistency in favor of always being writeable. Spanner-style CP designs (Spanner, CockroachDB) sacrifice availability under partition for serializable transactions across keys. Pick AP for high-availability serving (sessions, configs, carts) and CP for accounting / inventory where overselling is unforgivable.
Vector Clocks vs Last-Write-Wins
Vector clocks preserve all conflicting writes for the application to merge; LWW silently drops one. Vector clocks are the right answer when conflicting writes might both contain real value (cart items merged, social connections added). LWW is fine for monotonic state (latest-known-position).
Quorum Reads vs Eventually Consistent Reads
Quorum reads (R+W > N) cost more latency but guarantee freshness; eventually consistent reads (R=1) are fast but may return stale data. Most applications can tolerate stale reads; for the few that can't (reading own writes), use R=N or read from the same coordinator that did the write.
Consistent Hashing vs Range Sharding
Consistent hashing distributes keys uniformly but range queries become expensive (must scan multiple nodes). Range sharding (by key prefix) supports range queries cheaply but suffers hot-spot issues for sequential keys (timestamps). Pick consistent hashing for pure KV; range sharding for time-series or column families.
LSM vs B-Tree Storage
LSM is write-optimized (append only, sequential writes) at the cost of read amplification. B-tree is read-optimized (single seek per read) but updates are random IO. KV stores nearly always pick LSM because the workload is high write, scan-heavy reads served from cache.
Centralized Coordinator vs Symmetric Nodes
Dynamo nodes are symmetric (any node can coordinate any request). A centralized coordinator is simpler to reason about but introduces a SPOF and a hotspot. Symmetric nodes are harder to implement (every node needs full ring state) but scale linearly and have no SPOF.
Why Not Just Use Postgres with Replication?
Postgres with streaming replication is great for SQL workloads up to ~50K writes/sec on a single primary. Beyond that, you must shard manually, lose cross-shard transactions, and inherit operational complexity that the Dynamo model handles natively. For pure KV at high scale, Dynamo-style is meaningfully simpler than DIY Postgres sharding.
Tunable Consistency vs One-Size-Fits-All
Dynamo's per-request consistency knob is the right design: different operations have different needs. A static 'always strong' system makes most operations unnecessarily slow; a static 'always eventual' system frustrates the few operations that need freshness. Per-request choice is more API surface but more flexible.
Real-World Examples
How real systems implement this in production
DynamoDB evolved from the original Dynamo paper into a managed service with single-digit-millisecond latency at any scale. It uses partition keys (hash) and optional sort keys for limited range queries within a partition; replication is across 3 AZs; strong consistency is opt-in at read time. Billing is per-request or provisioned capacity.
Trade-off: DynamoDB hides nearly all the operational complexity of the original Dynamo (no Merkle repairs to schedule, no gossip to debug) but also limits the consistency knobs (you choose 'eventually' or 'strongly' consistent reads, no fine-grained quorum control). The lesson: managed services trade flexibility for operational simplicity, which is the right call for most teams.
Cassandra is an open-source Dynamo descendant with column families on top of the KV core. It exposes the full N/W/R model (LOCAL_QUORUM, EACH_QUORUM, ALL, ONE, TWO, THREE). Heavy use at Apple, Netflix, Discord (1B+ messages stored). Reads are LWW by default unless you enable lightweight transactions.
Trade-off: Cassandra exposes more consistency knobs than DynamoDB but requires significant operational expertise (compaction tuning, repair scheduling, JVM management). The lesson: open-source Dynamo gives more control but demands more competence; teams should be honest about whether they have it.
Riak was a faithful Dynamo implementation with first-class CRDT support (counters, sets, maps) and explicit sibling exposure to applications. Heavy use at Bet365, Comcast. Less common today after Basho's bankruptcy but the design is canonical.
Trade-off: Riak's first-class CRDTs make multi-actor merging trivial (counters increment correctly across concurrent writes) but require modeling your data as CRDTs from day one. The lesson: building merge semantics into the data type is cleaner than handling siblings in application code, when your data fits a CRDT shape.
ScyllaDB is a C++ rewrite of Cassandra targeting much higher per-node throughput (often 10x). It uses a shard-per-core thread-per-core architecture and exposes the same Cassandra wire protocol. Used heavily at Discord (after their Cassandra migration) and ShareChat.
Trade-off: Scylla's per-core sharding gives massive throughput per node but makes operations (capacity planning, rebalancing) more sensitive to per-CPU contention. The lesson: when your bottleneck is per-node CPU efficiency, a ground-up rewrite in a systems language can beat a JVM-based implementation by an order of magnitude on the same hardware.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Client sends PUT to any node; that node becomes the coordinator. Coordinator hashes 'foo' and computes the preference list of N=3 nodes (next 3 distinct physical nodes clockwise on the consistent-hash ring): say [n2, n5, n11]. Coordinator merges any provided vector clock with its own counter, increments its component, attaches the new vclock. Coordinator dispatches Put RPC to all 3 replicas in parallel. Each replica appends the write to its commit log (fsync), inserts into the memtable, returns ack with the vclock. Coordinator waits for W=2 acks. As soon as 2 arrive, returns success to client; the 3rd write proceeds in the background. If the 3rd replica is unreachable, coordinator writes to a substitute node with a hinted handoff record; the substitute replays to the original when gossip notes the node is back. Total p99 latency: ~10-30 ms depending on tail of the second-fastest replica.
Each client's coordinator constructs a vclock based on what it knew. Without coordination, neither vclock dominates the other (each has an entry the other lacks). Both writes succeed and persist. On a subsequent GET, the coordinator sees both versions and returns them as siblings to the client (with their respective vclocks). The client's application logic resolves: shopping cart unions item lists; preferences pick by recency. Once resolved, the client PUTs the merged value with a vclock that includes both predecessors; the next read sees a single descendant version. If the application chose LWW instead, the store would auto-pick the version with the highest timestamp at read time and silently drop the loser.
Within seconds, gossip detects missed heartbeats and marks the node 'suspect', then 'down' after a longer timeout (say 30 s). Coordinators stop sending writes to that node; instead writes intended for it go to the next clockwise healthy node with a hinted handoff metadata tag. Reads at QUORUM still succeed because R=2 only needs 2 of 3 replicas, both of which are alive. When the node returns, gossip propagates the alive event; the substitute nodes replay queued hints back to it. For drift that hints can't catch (e.g., hints expired before the node returned), a periodic Merkle-tree-based anti-entropy session compares the replicas and ships only the differing keys. End to end, an hour of downtime is invisible to clients at QUORUM consistency; reads at ONE during the window may see stale data.
Add nodes one at a time; gossip propagates membership in O(log N) seconds. Each new node gets ~256 vnodes assigned random positions on the ring; ownership of those positions moves from existing neighbors to the new node. Data migration is rate-limited (e.g., 100 MB/s per source node) so cluster throughput stays high during the join. Per-node throughput stays roughly constant because each node still owns ~the same total range; total throughput scales linearly. The coordinator's load on any single key stays the same (still N=3 replicas per key). Anti-entropy schedules adjust automatically. The biggest operational concern is gossip table size (O(N) per node), which becomes large at thousands of nodes; some implementations switch to hierarchical gossip or rack-aware gossip for very large clusters.
Three scenarios. (1) Cross-key transactions: KV stores offer single-key atomicity at best. If you need 'debit account A AND credit account B atomically', use a CP system (Spanner, CockroachDB) or design an outer saga. (2) Range queries: Dynamo's hash distribution makes 'all users created in March' an O(N nodes) scan. Use a range-sharded store or layer a search index. (3) Strong consistency that cannot tolerate quorum latency: e.g., a financial ledger where every read must see every write immediately and partition tolerance is less important than consistency; pick a CP single-leader system or one with consensus per partition (Raft groups). Generally Dynamo wins for high-availability KV serving (sessions, configs, carts, presence) and loses for transactional / analytical workloads.
Interview Tips
How to discuss this topic effectively
Lead with the trio: consistent hashing, quorums, vector clocks. Saying 'every Dynamo-style design hangs on these three primitives' frames the entire conversation.
Always mention virtual nodes when you say consistent hashing. Without vnodes, the ring is uneven and adding a node moves a huge chunk of data; vnodes are the production fix.
Use the W + R > N rule precisely. Strong consistency requires every read to overlap with the write set; explaining the math wins the consistency portion of the interview.
Distinguish hinted handoff from anti-entropy. Hinted handoff handles seconds-to-minutes outages; Merkle-tree repair handles longer drift. Both are needed.
Articulate AP vs CP and tie it to the workload. 'Dynamo is AP-leaning; Spanner is CP. Pick AP for sessions and configs; CP for inventory and ledgers' shows you know when to use which class of system.
Common Mistakes
Pitfalls to avoid in interviews
Skipping virtual nodes when explaining consistent hashing
One position per physical node creates uneven distribution and forces a single neighbor to absorb a new node's incoming traffic. Use 128-256 vnodes per physical node so the ring is statistically uniform and additions take small slivers from many nodes.
Confusing strong consistency with strong durability
W=N (write to all replicas) gives durability but not necessarily strong consistency for reads; you also need R + W > N. A read at R=1 against an N=3 cluster can still see stale data, even if writes hit all 3.
Using last-write-wins by default for application data
LWW silently drops one of two concurrent writes. For data where both writes might carry real intent (a cart's added items, social follows), use vector clocks and let the client merge siblings; LWW is acceptable only for monotonic state.
Treating gossip as instant
Gossip converges in O(log N) rounds; for a 1000-node cluster that's ~10 seconds before everyone agrees on a membership change. Reads against the freshly-removed node may briefly fail; clients should retry.
Forgetting tombstones during deletes
A delete is not free; the store keeps a tombstone (delete marker) until compaction safely drops it. Without tombstones, a deleted key could resurrect via anti-entropy from a stale replica. Plan for tombstone lifecycle (gc_grace_seconds in Cassandra).
