System Design Article

Consistent Hashing & Data Distribution

Difficulty: Medium

Consistent hashing is the trick that lets distributed caches and databases add or remove nodes without remapping every key in the cluster. This lesson explains why naive `hash(key) % N` is broken, how the hash ring works, why you need virtual nodes to keep load balanced, and how real systems (DynamoDB, Cassandra, Memcached, Discord) implement it. We finish with the modern alternatives (rendezvous hashing, jump consistent hash, Maglev) and the trade-offs that make consistent hashing the answer in interviews 90% of the time.

System Design
/

Consistent Hashing & Data Distribution

Consistent Hashing & Data Distribution

Consistent hashing is the trick that lets distributed caches and databases add or remove nodes without remapping every key in the cluster. This lesson explains why naive `hash(key) % N` is broken, how the hash ring works, why you need virtual nodes to keep load balanced, and how real systems (DynamoDB, Cassandra, Memcached, Discord) implement it. We finish with the modern alternatives (rendezvous hashing, jump consistent hash, Maglev) and the trade-offs that make consistent hashing the answer in interviews 90% of the time.

System Design
Medium
consistent-hashing
data-partitioning
distributed-systems
distributed-cache
database-sharding
system-design
intermediate
free

696 views

17

The Problem: Why Modulo Hashing Breaks

Imagine a cache cluster with 4 nodes. The naive way to assign a key to a node:

Text
---------- Naive modulo hashing ----------
  node_index = hash(key) % 4
  -> key 'user:42'  -> hash 17 -> 17 % 4 = 1 -> node[1]
  -> key 'photo:9'  -> hash 23 -> 23 % 4 = 3 -> node[3]

Works fine until you add a fifth node. Now hash(key) % 5 assigns almost every key to a different node. With 5 nodes, only ~20% of keys keep their old assignment; 80% remap.

Consequences:

  • The cache hit rate drops to near zero immediately after the resize.
  • Every new request misses the cache, falls through to the database, and the database melts.
  • For a partitioned database, you have to physically move 80% of the data to its new home. Hours of downtime, weeks of operational pain.

This is the core reason consistent hashing exists. The goal: only ~K/N keys move when a node is added or removed, where K is total keys and N is cluster size.

The Hash Ring

Consistent hashing arranges all hash values on a circular space (typically 0 to 2^32 - 1 or 2^160 - 1 depending on the hash function).

Text
---------- The hash ring (4 nodes) ----------
                    0 / 2^32
                       |
              N1 ---+--+--+--- N2
               (hash=1.07B)  (hash=2.14B)
                /              \
           keys                 keys
                \              /
              N4 ---+--+--+--- N3
               (hash=4.29B)  (hash=3.22B)

Placement:

  1. Hash each node's identifier (e.g., hash('cache-1.example.com')) onto the ring. Nodes are now points on the ring.
  2. To assign a key, hash it onto the same ring. The key belongs to the first node found by walking clockwise from the key's position.

Adding a node: place the new node on the ring. Only the keys between the new node and its predecessor migrate to the new node. Roughly K/N keys move.

Removing a node: its keys go to the next clockwise node. Same K/N migration.

Text
---------- Adding a node (N5 between N1 and N2) ----------
  before:  ... N1 [keys 1.07B - 2.14B all on N2] N2 ...
  after:   ... N1 [keys 1.07B - 1.6B on N5] N5 [keys 1.6B - 2.14B on N2] N2 ...
  movement: only the keys between 1.07B and 1.6B move from N2 to N5.

Lookup Algorithm

class ConsistentHashRing {
    constructor(nodes = []) {
        this.ring = []; // sorted array of {hash, node}
        for (const n of nodes) this.addNode(n);
    }

    addNode(node) {
        const h = hash(node);
        this.ring.push({ hash: h, node });
        this.ring.sort((a, b) => a.hash - b.hash);
    }

    removeNode(node) {
        const h = hash(node);
        this.ring = this.ring.filter((e) => e.hash !== h);
    }

    getNode(key) {
        if (this.ring.length === 0) return null;
        const h = hash(key);
        // first ring entry with hash >= h (binary search in production)
        for (const entry of this.ring) {
            if (entry.hash >= h) return entry.node;
        }
        return this.ring[0].node; // wrap around
    }
}

Lookup is O(log N) with binary search. Add/remove is O(log N).

The Load-Balance Problem (and Virtual Nodes)

Vanilla consistent hashing has a serious flaw: with random node placement, some nodes get a much larger arc of the ring than others. With 4 nodes, the largest could easily own 40% of the keys instead of the ideal 25%. With 100 nodes, the imbalance is even more pronounced.

Solution: virtual nodes (vnodes). Each physical node is placed on the ring at many positions (typically 100 to 200). Now the ring has 100 * N points, the average arc per vnode is much smaller, and the law of large numbers smooths out the imbalance.

Text
---------- With virtual nodes ----------
  4 physical nodes x 100 vnodes = 400 ring points
  Each physical node owns ~25% of total ring (close to ideal)
  Variance: ~ +/- 2% in production

A second benefit: when a physical node fails, its 100 vnode arcs go to 100 different successors. Load is spread across the surviving cluster instead of dumping onto one neighbor.

Cassandra default: 256 vnodes per physical node (num_tokens: 256). DynamoDB and Riak use similar numbers under the hood.

Replication on the Ring

For durability, each key needs to live on N replicas (typical N = 3). The standard pattern: walk clockwise from the key's hash, assign to the first node, then keep walking and assign to the next N-1 distinct physical nodes.

Text
---------- Replication factor 3 on the ring ----------
  key 'photo:42' hashes to position p
  -> primary:    next vnode (physical node A)
  -> replica 1:  next vnode owned by a different physical node (B)
  -> replica 2:  next vnode owned by a different physical node (C)

Dynamo and Cassandra do exactly this. Reads and writes can use a quorum (R + W > N) to give linearizability per key.

When the Hash Function Matters

The hash function choice has real impact:

  • MD5 / SHA-1: cryptographically uniform, slow (~100 ns per call). Good for small key counts where collisions matter; total overkill for caches.
  • MurmurHash3 (Cassandra default): non-cryptographic, very fast (~10 ns), excellent distribution.
  • xxHash: even faster (~5 ns), used by Memcached clients and some CDNs.
  • CRC32: 32-bit, weak distribution above ~10K keys, only good for tiny clusters.

For a high-QPS cache, the hash function is on the critical path. MurmurHash3 or xxHash is the production answer.

Where Consistent Hashing Shows Up

SystemUse of consistent hashing
Memcached client (libmemcached, mcrouter)Routes keys to one of N memcached servers; survives server churn without remapping.
DynamoDBInternal partitioning of keys across thousands of storage nodes.
CassandraToken ring with vnodes; each replica owns a set of token ranges.
RiakDynamo-style ring; configurable vnodes per physical node.
Redis Cluster16384 hash slots assigned to nodes; not strictly consistent hashing but the same idea.
DiscordUsed consistent hashing to route guild traffic to gateway shards across thousands of servers.
CDN edge selectionAkamai and Cloudflare route the same URL consistently to the same edge node so caches stay warm.
Load balancers (Nginx upstream hash, HAProxy)Sticky session routing - same client IP always lands on the same backend.

Alternatives and Their Trade-offs

Rendezvous Hashing (Highest Random Weight, HRW)

For each key, compute hash(key, node_id) for every node and pick the node with the highest hash. Same minimal-disruption property as consistent hashing, no ring data structure, no virtual nodes needed.

Pros: simpler code, naturally balanced (no vnodes required), good for small clusters. Cons: O(N) per lookup vs O(log N) for ring; hurts at very large cluster sizes (1000+ nodes).

Used in: Akamai's load balancers, some content-delivery scenarios.

Jump Consistent Hash (Google, 2014)

A single fast function: given a key and bucket count N, returns a bucket in O(log N) without any ring or precomputed state. Designed for systems where N changes rarely.

Pros: extremely fast, no memory state, perfect uniformity. Cons: only works for adding buckets at the end; cannot remove arbitrary buckets and stay balanced.

Used in: Google's internal systems, F1, Spanner partitioning.

Maglev Hashing (Google's load balancer, 2016)

Builds a fixed-size lookup table (~65K entries) using a permutation algorithm. Lookup is O(1) array access. When backends change, ~1% of entries reshuffle.

Pros: O(1) lookup, very even distribution, low disruption. Cons: requires precomputed table; rebuild is O(M log N) where M is table size.

Used in: Google's Maglev L4 load balancer, Cilium's eBPF-based load balancer.

Choosing the Right Scheme

SchemeBest forWhy
Consistent hashing with vnodesMost distributed databases and cachesStandard answer; well-understood; mature implementations.
Rendezvous hashingSmall clusters, simple implementationsLess code; naturally balanced without vnodes.
Jump consistent hashWorkloads where you only grow the clusterLowest overhead, perfect uniformity.
MaglevHigh-QPS L4 load balancingO(1) lookup matters when you do millions per second.
Range partitioningRange queries (timestamp scans, alphabetical lookups)Consistent hashing destroys range locality. Use range partitioning when scans matter.

How to Talk About This in an Interview

  1. Lead with the problem. 'If we use hash(key) % N, adding a node remaps almost all keys and crashes the cache. We use consistent hashing to bound key movement to ~K/N.'
  2. Mention virtual nodes immediately. 'Each physical node has ~150 virtual nodes on the ring to balance load and spread failure cost.'
  3. Quote the migration property. 'Only the keys between the new node and its predecessor move; roughly K/N total.'
  4. Name a real system. 'DynamoDB, Cassandra, and Memcached all use a variation of this.'
  5. Discuss replication. 'For replication factor 3, we walk clockwise and pick the next 3 distinct physical nodes.'
  6. Mention alternatives if asked. 'Rendezvous hashing is simpler; jump consistent hash is faster but only allows growing; Maglev is the L4 load-balancer choice.'

Quick Review

  • Naive hash % N remaps almost all keys when N changes - unacceptable.
  • Consistent hashing places nodes and keys on a ring; each key goes to the first clockwise node.
  • Adding/removing a node moves only ~K/N keys.
  • Virtual nodes (100-200 per physical node) are required for balanced load.
  • Replication: walk clockwise, pick the next N-1 distinct physical nodes.
  • Standard in DynamoDB, Cassandra, Memcached, Riak, Discord, CDNs.
  • Alternatives: rendezvous (simpler), jump (faster, grow-only), Maglev (L4-LB).

Real-World Examples

How real systems implement this in production

Amazon DynamoDB partitioning

DynamoDB partitions keys across thousands of storage nodes using a variant of consistent hashing. Each table is divided into partitions based on the hash of the partition key; partitions are placed on the ring with virtual nodes for balance. When you add capacity, DynamoDB reshards partitions in the background; clients are unaware of the move because partitions are addressed by the hash range, not the physical node.

Trade-off: The partitioning gives you predictable single-digit-millisecond access at any scale, but it punishes hot keys hard. A single celebrity user can saturate one partition while the rest of the table sits idle. Mitigation is write-sharding (suffix the key with a random shard number) or DAX caching.

Cassandra token ring

Cassandra arranges nodes on a 64-bit token ring (using Murmur3 by default). Each physical node is assigned 256 vnodes by default. Replica placement walks clockwise, picking the next N-1 distinct physical nodes (with rack-awareness). When a node joins, only ~1/N of the data streams from existing nodes to it.

Trade-off: 256 vnodes give excellent load balance and fast bootstrap (parallel streaming from 256 sources). The cost: more metadata to track and slower repair operations. Tuning `num_tokens` is one of the standard Cassandra knobs.

Memcached client routing (libmemcached)

Memcached itself is a simple in-memory KV store with no clustering. Cluster behavior is implemented entirely in the client: each client maintains a consistent hash ring of memcached servers and routes each key to the right one. When you add or remove a memcached node, all clients update their ring and only ~K/N keys remap.

Trade-off: Client-side routing keeps the server simple and stateless but requires every client to know the cluster topology. Tools like mcrouter (Facebook) centralize the routing into a proxy layer so applications do not have to embed it.

Discord gateway sharding

Discord routes WebSocket connections from millions of users to gateway servers using consistent hashing on the guild (server) ID. All connections for a given guild land on the same gateway shard, allowing in-memory state (presence, channel members) to live on one machine without cross-server coordination. They scaled this from hundreds to thousands of shards using consistent hashing's incremental rebalance property.

Trade-off: Sharding by guild keeps related state co-located, simplifying the application logic. The cost: a single mega-guild (Fortnite, Minecraft) can dominate one shard's resources. Discord has documented their work to detect and split such hot guilds across multiple shards.

Quick Interview Phrases

Key terms to use in your answer

consistent hashing
virtual nodes
hash ring
minimal key movement
replica set on the ring
rendezvous hashing

Common Interview Questions

Questions you might be asked about this topic

Start with the problem: naive `hash % N` remaps ~all keys when N changes, crashing the cache and the database. Consistent hashing places nodes and keys on a circular hash space. A key belongs to the first clockwise node from its hash position. Adding a node moves only the keys between the new node and its predecessor (~K/N keys). Mention virtual nodes (100-200 per physical) for balanced load. Name a real system (DynamoDB, Cassandra, Memcached) that uses it.

Interview Tips

How to discuss this topic effectively

1

Lead by stating why naive `hash(key) % N` fails. The interviewer wants to hear that you understand the problem before reaching for the solution.

2

Always mention virtual nodes. A consistent hashing answer without vnodes is incomplete and signals you have only read the headline of the technique.

3

Quote the K/N migration figure. It is the single most important property and the reason the technique exists.

4

Name a production system: DynamoDB, Cassandra, Memcached. Bonus points for mentioning Discord's gateway-sharding use case.

5

If asked about scale, mention Maglev or jump consistent hash. They are the modern follow-ups for high-QPS load balancing and grow-only workloads.

Common Mistakes

Pitfalls to avoid in interviews

Forgetting virtual nodes

Vanilla consistent hashing without vnodes has wildly imbalanced load, especially at small N. Production deployments use 100-256 vnodes per physical node to bring the variance down to a few percent.

Confusing consistent hashing with sticky sessions

Consistent hashing is a way to map keys to nodes deterministically. Sticky sessions is a load-balancer feature that routes a client to the same backend. They are related (sticky sessions often use consistent hashing under the hood) but not the same concept.

Using it for range queries

Consistent hashing distributes keys uniformly across nodes by hash, which destroys range locality. If your workload includes range scans (timestamp, alphabetical), use range partitioning instead. Cassandra offers both: hash-partitioned by default, with optional ordered partitioner for range queries.

Picking the wrong hash function

CRC32 has poor distribution above 10K keys; cryptographic hashes are slow. The production sweet spot is MurmurHash3 or xxHash: fast (~10 ns) and well-distributed. Cassandra defaults to Murmur3.

Ignoring replication topology

Naively walking clockwise N times can place all replicas on adjacent vnodes - which might all be in the same rack or AZ. Production systems add a constraint that no two replicas of the same key live in the same failure domain (rack-aware, AZ-aware placement).