System Design Article
Consistent Hashing & Data Distribution
Difficulty: Medium
Consistent hashing is the trick that lets distributed caches and databases add or remove nodes without remapping every key in the cluster. This lesson explains why naive `hash(key) % N` is broken, how the hash ring works, why you need virtual nodes to keep load balanced, and how real systems (DynamoDB, Cassandra, Memcached, Discord) implement it. We finish with the modern alternatives (rendezvous hashing, jump consistent hash, Maglev) and the trade-offs that make consistent hashing the answer in interviews 90% of the time.
Consistent Hashing & Data Distribution
Consistent hashing is the trick that lets distributed caches and databases add or remove nodes without remapping every key in the cluster. This lesson explains why naive `hash(key) % N` is broken, how the hash ring works, why you need virtual nodes to keep load balanced, and how real systems (DynamoDB, Cassandra, Memcached, Discord) implement it. We finish with the modern alternatives (rendezvous hashing, jump consistent hash, Maglev) and the trade-offs that make consistent hashing the answer in interviews 90% of the time.
696 views
17
The Problem: Why Modulo Hashing Breaks
Imagine a cache cluster with 4 nodes. The naive way to assign a key to a node:
---------- Naive modulo hashing ----------
node_index = hash(key) % 4
-> key 'user:42' -> hash 17 -> 17 % 4 = 1 -> node[1]
-> key 'photo:9' -> hash 23 -> 23 % 4 = 3 -> node[3]Works fine until you add a fifth node. Now hash(key) % 5 assigns almost every key to a different node. With 5 nodes, only ~20% of keys keep their old assignment; 80% remap.
Consequences:
- The cache hit rate drops to near zero immediately after the resize.
- Every new request misses the cache, falls through to the database, and the database melts.
- For a partitioned database, you have to physically move 80% of the data to its new home. Hours of downtime, weeks of operational pain.
This is the core reason consistent hashing exists. The goal: only ~K/N keys move when a node is added or removed, where K is total keys and N is cluster size.
The Hash Ring
Consistent hashing arranges all hash values on a circular space (typically 0 to 2^32 - 1 or 2^160 - 1 depending on the hash function).
---------- The hash ring (4 nodes) ----------
0 / 2^32
|
N1 ---+--+--+--- N2
(hash=1.07B) (hash=2.14B)
/ \
keys keys
\ /
N4 ---+--+--+--- N3
(hash=4.29B) (hash=3.22B)Placement:
- Hash each node's identifier (e.g.,
hash('cache-1.example.com')) onto the ring. Nodes are now points on the ring. - To assign a key, hash it onto the same ring. The key belongs to the first node found by walking clockwise from the key's position.
Adding a node: place the new node on the ring. Only the keys between the new node and its predecessor migrate to the new node. Roughly K/N keys move.
Removing a node: its keys go to the next clockwise node. Same K/N migration.
---------- Adding a node (N5 between N1 and N2) ----------
before: ... N1 [keys 1.07B - 2.14B all on N2] N2 ...
after: ... N1 [keys 1.07B - 1.6B on N5] N5 [keys 1.6B - 2.14B on N2] N2 ...
movement: only the keys between 1.07B and 1.6B move from N2 to N5.Lookup Algorithm
class ConsistentHashRing {
constructor(nodes = []) {
this.ring = []; // sorted array of {hash, node}
for (const n of nodes) this.addNode(n);
}
addNode(node) {
const h = hash(node);
this.ring.push({ hash: h, node });
this.ring.sort((a, b) => a.hash - b.hash);
}
removeNode(node) {
const h = hash(node);
this.ring = this.ring.filter((e) => e.hash !== h);
}
getNode(key) {
if (this.ring.length === 0) return null;
const h = hash(key);
// first ring entry with hash >= h (binary search in production)
for (const entry of this.ring) {
if (entry.hash >= h) return entry.node;
}
return this.ring[0].node; // wrap around
}
}Lookup is O(log N) with binary search. Add/remove is O(log N).
The Load-Balance Problem (and Virtual Nodes)
Vanilla consistent hashing has a serious flaw: with random node placement, some nodes get a much larger arc of the ring than others. With 4 nodes, the largest could easily own 40% of the keys instead of the ideal 25%. With 100 nodes, the imbalance is even more pronounced.
Solution: virtual nodes (vnodes). Each physical node is placed on the ring at many positions (typically 100 to 200). Now the ring has 100 * N points, the average arc per vnode is much smaller, and the law of large numbers smooths out the imbalance.
---------- With virtual nodes ----------
4 physical nodes x 100 vnodes = 400 ring points
Each physical node owns ~25% of total ring (close to ideal)
Variance: ~ +/- 2% in productionA second benefit: when a physical node fails, its 100 vnode arcs go to 100 different successors. Load is spread across the surviving cluster instead of dumping onto one neighbor.
Cassandra default: 256 vnodes per physical node (num_tokens: 256). DynamoDB and Riak use similar numbers under the hood.
Replication on the Ring
For durability, each key needs to live on N replicas (typical N = 3). The standard pattern: walk clockwise from the key's hash, assign to the first node, then keep walking and assign to the next N-1 distinct physical nodes.
---------- Replication factor 3 on the ring ----------
key 'photo:42' hashes to position p
-> primary: next vnode (physical node A)
-> replica 1: next vnode owned by a different physical node (B)
-> replica 2: next vnode owned by a different physical node (C)Dynamo and Cassandra do exactly this. Reads and writes can use a quorum (R + W > N) to give linearizability per key.
When the Hash Function Matters
The hash function choice has real impact:
- MD5 / SHA-1: cryptographically uniform, slow (~100 ns per call). Good for small key counts where collisions matter; total overkill for caches.
- MurmurHash3 (Cassandra default): non-cryptographic, very fast (~10 ns), excellent distribution.
- xxHash: even faster (~5 ns), used by Memcached clients and some CDNs.
- CRC32: 32-bit, weak distribution above ~10K keys, only good for tiny clusters.
For a high-QPS cache, the hash function is on the critical path. MurmurHash3 or xxHash is the production answer.
Where Consistent Hashing Shows Up
| System | Use of consistent hashing |
|---|---|
| Memcached client (libmemcached, mcrouter) | Routes keys to one of N memcached servers; survives server churn without remapping. |
| DynamoDB | Internal partitioning of keys across thousands of storage nodes. |
| Cassandra | Token ring with vnodes; each replica owns a set of token ranges. |
| Riak | Dynamo-style ring; configurable vnodes per physical node. |
| Redis Cluster | 16384 hash slots assigned to nodes; not strictly consistent hashing but the same idea. |
| Discord | Used consistent hashing to route guild traffic to gateway shards across thousands of servers. |
| CDN edge selection | Akamai and Cloudflare route the same URL consistently to the same edge node so caches stay warm. |
| Load balancers (Nginx upstream hash, HAProxy) | Sticky session routing - same client IP always lands on the same backend. |
Alternatives and Their Trade-offs
Rendezvous Hashing (Highest Random Weight, HRW)
For each key, compute hash(key, node_id) for every node and pick the node with the highest hash. Same minimal-disruption property as consistent hashing, no ring data structure, no virtual nodes needed.
Pros: simpler code, naturally balanced (no vnodes required), good for small clusters. Cons: O(N) per lookup vs O(log N) for ring; hurts at very large cluster sizes (1000+ nodes).
Used in: Akamai's load balancers, some content-delivery scenarios.
Jump Consistent Hash (Google, 2014)
A single fast function: given a key and bucket count N, returns a bucket in O(log N) without any ring or precomputed state. Designed for systems where N changes rarely.
Pros: extremely fast, no memory state, perfect uniformity. Cons: only works for adding buckets at the end; cannot remove arbitrary buckets and stay balanced.
Used in: Google's internal systems, F1, Spanner partitioning.
Maglev Hashing (Google's load balancer, 2016)
Builds a fixed-size lookup table (~65K entries) using a permutation algorithm. Lookup is O(1) array access. When backends change, ~1% of entries reshuffle.
Pros: O(1) lookup, very even distribution, low disruption. Cons: requires precomputed table; rebuild is O(M log N) where M is table size.
Used in: Google's Maglev L4 load balancer, Cilium's eBPF-based load balancer.
Choosing the Right Scheme
| Scheme | Best for | Why |
|---|---|---|
| Consistent hashing with vnodes | Most distributed databases and caches | Standard answer; well-understood; mature implementations. |
| Rendezvous hashing | Small clusters, simple implementations | Less code; naturally balanced without vnodes. |
| Jump consistent hash | Workloads where you only grow the cluster | Lowest overhead, perfect uniformity. |
| Maglev | High-QPS L4 load balancing | O(1) lookup matters when you do millions per second. |
| Range partitioning | Range queries (timestamp scans, alphabetical lookups) | Consistent hashing destroys range locality. Use range partitioning when scans matter. |
How to Talk About This in an Interview
- Lead with the problem. 'If we use
hash(key) % N, adding a node remaps almost all keys and crashes the cache. We use consistent hashing to bound key movement to ~K/N.' - Mention virtual nodes immediately. 'Each physical node has ~150 virtual nodes on the ring to balance load and spread failure cost.'
- Quote the migration property. 'Only the keys between the new node and its predecessor move; roughly K/N total.'
- Name a real system. 'DynamoDB, Cassandra, and Memcached all use a variation of this.'
- Discuss replication. 'For replication factor 3, we walk clockwise and pick the next 3 distinct physical nodes.'
- Mention alternatives if asked. 'Rendezvous hashing is simpler; jump consistent hash is faster but only allows growing; Maglev is the L4 load-balancer choice.'
Quick Review
- Naive
hash % Nremaps almost all keys when N changes - unacceptable. - Consistent hashing places nodes and keys on a ring; each key goes to the first clockwise node.
- Adding/removing a node moves only ~K/N keys.
- Virtual nodes (100-200 per physical node) are required for balanced load.
- Replication: walk clockwise, pick the next N-1 distinct physical nodes.
- Standard in DynamoDB, Cassandra, Memcached, Riak, Discord, CDNs.
- Alternatives: rendezvous (simpler), jump (faster, grow-only), Maglev (L4-LB).
Real-World Examples
How real systems implement this in production
DynamoDB partitions keys across thousands of storage nodes using a variant of consistent hashing. Each table is divided into partitions based on the hash of the partition key; partitions are placed on the ring with virtual nodes for balance. When you add capacity, DynamoDB reshards partitions in the background; clients are unaware of the move because partitions are addressed by the hash range, not the physical node.
Trade-off: The partitioning gives you predictable single-digit-millisecond access at any scale, but it punishes hot keys hard. A single celebrity user can saturate one partition while the rest of the table sits idle. Mitigation is write-sharding (suffix the key with a random shard number) or DAX caching.
Cassandra arranges nodes on a 64-bit token ring (using Murmur3 by default). Each physical node is assigned 256 vnodes by default. Replica placement walks clockwise, picking the next N-1 distinct physical nodes (with rack-awareness). When a node joins, only ~1/N of the data streams from existing nodes to it.
Trade-off: 256 vnodes give excellent load balance and fast bootstrap (parallel streaming from 256 sources). The cost: more metadata to track and slower repair operations. Tuning `num_tokens` is one of the standard Cassandra knobs.
Memcached itself is a simple in-memory KV store with no clustering. Cluster behavior is implemented entirely in the client: each client maintains a consistent hash ring of memcached servers and routes each key to the right one. When you add or remove a memcached node, all clients update their ring and only ~K/N keys remap.
Trade-off: Client-side routing keeps the server simple and stateless but requires every client to know the cluster topology. Tools like mcrouter (Facebook) centralize the routing into a proxy layer so applications do not have to embed it.
Discord routes WebSocket connections from millions of users to gateway servers using consistent hashing on the guild (server) ID. All connections for a given guild land on the same gateway shard, allowing in-memory state (presence, channel members) to live on one machine without cross-server coordination. They scaled this from hundreds to thousands of shards using consistent hashing's incremental rebalance property.
Trade-off: Sharding by guild keeps related state co-located, simplifying the application logic. The cost: a single mega-guild (Fortnite, Minecraft) can dominate one shard's resources. Discord has documented their work to detect and split such hot guilds across multiple shards.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Start with the problem: naive `hash % N` remaps ~all keys when N changes, crashing the cache and the database. Consistent hashing places nodes and keys on a circular hash space. A key belongs to the first clockwise node from its hash position. Adding a node moves only the keys between the new node and its predecessor (~K/N keys). Mention virtual nodes (100-200 per physical) for balanced load. Name a real system (DynamoDB, Cassandra, Memcached) that uses it.
Walk clockwise from the key's hash position. The first vnode is the primary; continue walking and assign each subsequent vnode to a replica until you have N distinct physical nodes. Add a placement constraint: no two replicas in the same rack or AZ to survive failure-domain outages. For consistency, use quorum reads/writes (R + W > N). This is exactly how Dynamo and Cassandra implement it.
With consistent hashing and 200 vnodes per physical node, only the keys whose hash falls between the new node's vnodes and their predecessors move. With 4 existing nodes and 1 new, that is roughly 1/5 of total keys. Caches cool briefly for those keys; the other 4/5 stay warm and continue to serve traffic. Without consistent hashing, almost all keys would remap and the cache hit rate would drop to near zero.
Both give minimal-disruption mapping (only ~K/N keys move on cluster change). Differences: consistent hashing uses a ring data structure with virtual nodes for balance; lookup is O(log N). Rendezvous hashing computes hash(key, node) for each node and picks the highest; lookup is O(N) but no vnodes needed and naturally balanced. For small clusters (under 100 nodes), rendezvous is simpler. For large clusters, consistent hashing's O(log N) lookup wins. Real systems usually use consistent hashing because it is the more familiar pattern.
Two main options. (1) Consistent hash on user_id - every user's timeline lives on a deterministic node. Adding nodes moves only ~K/N timelines. Replicate each timeline to 3 vnodes for durability. (2) For very heavy users (celebrities) you may shard their timeline across multiple nodes, since one celebrity's fan-out can saturate a single node (the celebrity problem). Discuss how you would detect heavy keys (sample-based hot-key detection) and rebalance. Mention rack-awareness for replica placement.
Interview Tips
How to discuss this topic effectively
Lead by stating why naive `hash(key) % N` fails. The interviewer wants to hear that you understand the problem before reaching for the solution.
Always mention virtual nodes. A consistent hashing answer without vnodes is incomplete and signals you have only read the headline of the technique.
Quote the K/N migration figure. It is the single most important property and the reason the technique exists.
Name a production system: DynamoDB, Cassandra, Memcached. Bonus points for mentioning Discord's gateway-sharding use case.
If asked about scale, mention Maglev or jump consistent hash. They are the modern follow-ups for high-QPS load balancing and grow-only workloads.
Common Mistakes
Pitfalls to avoid in interviews
Forgetting virtual nodes
Vanilla consistent hashing without vnodes has wildly imbalanced load, especially at small N. Production deployments use 100-256 vnodes per physical node to bring the variance down to a few percent.
Confusing consistent hashing with sticky sessions
Consistent hashing is a way to map keys to nodes deterministically. Sticky sessions is a load-balancer feature that routes a client to the same backend. They are related (sticky sessions often use consistent hashing under the hood) but not the same concept.
Using it for range queries
Consistent hashing distributes keys uniformly across nodes by hash, which destroys range locality. If your workload includes range scans (timestamp, alphabetical), use range partitioning instead. Cassandra offers both: hash-partitioned by default, with optional ordered partitioner for range queries.
Picking the wrong hash function
CRC32 has poor distribution above 10K keys; cryptographic hashes are slow. The production sweet spot is MurmurHash3 or xxHash: fast (~10 ns) and well-distributed. Cassandra defaults to Murmur3.
Ignoring replication topology
Naively walking clockwise N times can place all replicas on adjacent vnodes - which might all be in the same rack or AZ. Production systems add a constraint that no two replicas of the same key live in the same failure domain (rack-aware, AZ-aware placement).
