System Design Article
Distributed Caching (Redis, Memcached)
Difficulty: Medium
A single-node cache eventually runs out of RAM, CPU, or network. Distributed caching spreads keys across many nodes so total capacity and throughput scale horizontally. This lesson covers how Redis and Memcached partition data, replicate it for availability, fail over when nodes die, and how to choose between them. By the end you can design a multi-node cache layer for a real workload, defend the topology in an interview, and recognize the bug class behind 'why is one cache node maxed at 100% CPU while the others are idle?'.
Distributed Caching (Redis, Memcached)
A single-node cache eventually runs out of RAM, CPU, or network. Distributed caching spreads keys across many nodes so total capacity and throughput scale horizontally. This lesson covers how Redis and Memcached partition data, replicate it for availability, fail over when nodes die, and how to choose between them. By the end you can design a multi-node cache layer for a real workload, defend the topology in an interview, and recognize the bug class behind 'why is one cache node maxed at 100% CPU while the others are idle?'.
806 views
6
Why Distribute the Cache?
A single cache node has three hard ceilings:
- Memory - even a fat machine maxes around 1 TB of usable RAM, and Redis recommends keeping each instance under 50 GB to keep replication and failover snappy.
- CPU - Redis is single-threaded for command execution; Memcached scales across cores but eventually saturates the NIC.
- Network - one 25 Gbps NIC carries about 3 GB/s. A million 3 KB cache reads per second is already pushing the limit.
Distributed caching is what you reach for once any of these ceilings is in sight. The idea is simple: many small cache nodes form a cluster, each holding a slice of the keyspace, and clients route each request to the right slice.
---------- Distributed cache cluster ----------
client - key 'user:42' --> [ shard 0 (slot 5298) ]
client - key 'cart:99' --> [ shard 1 (slot 12041) ]
client - key 'feed:7' --> [ shard 2 (slot 9821) ]
| | |
+-- replication ---> +-- replication ---> +-- replicationRedis vs Memcached: the Two Workhorses
Ninety-nine percent of distributed cache designs start with one of these two. Pick the right tool first; tuning matters less than the choice.
| Dimension | Redis | Memcached |
|---|---|---|
| Data types | Strings, hashes, lists, sets, sorted sets, streams, HyperLogLog, geo, bitmaps | Strings only |
| Persistence | Optional RDB snapshots + AOF append log | None - pure RAM |
| Replication | Built-in leader-follower | Not built-in (use a proxy or app-level sharding) |
| Threading | Single-threaded command loop (Redis 6+ has multi-threaded I/O) | Multi-threaded, scales across cores |
| Memory model | Custom allocator (jemalloc), hot/cold separation | Slab allocator, fixed-size chunks |
| Cluster mode | Native (Redis Cluster) with 16384 slots | Client-side sharding only |
| Eviction | LRU, LFU, random, TTL, no-eviction (8 policies) | LRU within slab class |
| Pub/Sub & Streams | Yes | No |
| Best for | Rich data structures, durable cache, leaderboards, queues, sessions | Pure key-value, simplest possible cache, max throughput per byte |
Rule of thumb: choose Redis when you need any of - data structures beyond strings, persistence, replication, pub/sub, or atomic multi-step operations (Lua scripting). Choose Memcached when you only need a fast distributed GET/SET/DELETE and want to maximize cores per dollar.
Partitioning: Spreading Keys Across Nodes
With N nodes, the cache layer must answer one question for every request: 'which node owns this key?' Three strategies, in increasing order of sophistication.
1. Modulo Hashing (the naive approach)
Hash the key, take hash(key) mod N, that is the node.
The problem: when N changes (a node added or removed), mod N changes for almost every key. You invalidate the entire cache in one operation. For a cluster serving a million reads per second at 90% hit rate, dropping to 0% hit rate sends 900K extra requests per second to the database. The database does not survive.
2. Consistent Hashing (Memcached, DynamoDB, Cassandra)
Hash both nodes and keys onto the same circular space (0 to 2^32). Each key is owned by the next node clockwise. Adding or removing a node only re-maps the keys between that node and its predecessor - roughly 1/N of the keyspace.
---------- Consistent hash ring ----------
node A (hash 100)
/ \
keys here keys here
\ /
node B (hash 1700)
/ \
keys here keys here
\ /
node C (hash 3000)
/ \
keys here keys here
\ /
(loops back to A at 2^32)Virtual nodes: with only N physical nodes, the ring is uneven; some nodes own much more space. Add 100 to 200 virtual node tokens per physical node to even out the distribution. This is how Cassandra's vnodes and Memcached's libketama work.
3. Slot-Based Sharding (Redis Cluster)
Redis Cluster pre-divides the keyspace into a fixed 16384 slots. Each key maps to a slot via CRC16(key) mod 16384. Each node owns a range of slots. Adding a node copies a contiguous slot range from existing nodes; the keyspace mapping is explicit and visible to clients.
---------- Redis Cluster slots ----------
node A: slots 0 - 5460
node B: slots 5461 - 10922
node C: slots 10923 - 16383Clients keep a local copy of the slot-to-node map. On a slot move, the cluster sends a MOVED response with the new node address; the client updates its map and retries. Resharding is online - the cluster moves slots one at a time without downtime.
Pseudocode: routing a request
import { createCluster } from 'redis';
const cluster = createCluster({
rootNodes: [
{ socket: { host: 'cache-0.example.com', port: 6379 } },
{ socket: { host: 'cache-1.example.com', port: 6379 } },
{ socket: { host: 'cache-2.example.com', port: 6379 } },
],
});
await cluster.connect();
// The client computes CRC16(key) mod 16384, looks up the slot owner,
// and sends the request to the right node automatically.
await cluster.set('user:42', JSON.stringify(profile));
const raw = await cluster.get('user:42');Replication and Failover
Partitioning gets you capacity and throughput. Replication gets you availability.
Replication topology
In Redis Cluster, each shard is a small leader-follower group. Writes go to the leader; reads can go to leader or follower (with READONLY).
---------- One Redis shard ----------
[ leader ]
/ \
v v
[ replica ] [ replica ]
(in another AZ) (in another AZ)Replication is asynchronous by default. The leader streams its command log; replicas apply it. Lag is sub-millisecond on a healthy cluster.
Failover
When a leader becomes unreachable, the cluster runs an election among the surviving leaders to promote one of the failed leader's replicas. The protocol: each shard's other-shard leaders vote, requiring a majority. The election takes 1 to 30 seconds depending on cluster-node-timeout.
Hazards:
- Lost writes: writes acknowledged by the old leader but not yet replicated are gone. Configure
min-replicas-to-write 1 min-replicas-max-lag 10to refuse writes when no replica is in sync. - Split brain: a partitioned old leader keeps accepting writes for
cluster-node-timeoutseconds before realizing it is isolated. Those writes are discarded when it rejoins. - Cascading failover: a slow replica gets promoted, becomes the new leader, can't keep up, fails again. Always size replicas at the same hardware as leaders.
Cluster Membership: The Gossip Protocol
Redis Cluster nodes maintain a mesh - each node opens a TCP connection to every other node and exchanges periodic 'PING' messages on a separate cluster bus port (10000 + Redis port). The PING carries a digest of the sender's view of the cluster: who is alive, who is suspect, what slots each node owns.
When a node misses heartbeats for cluster-node-timeout ms (default 15s), peers mark it PFAIL (probably failed). When a majority of leaders agree, it is promoted to FAIL. Failover then runs.
Gossip is robust under partial failures: even if 30% of inter-node messages drop, the cluster still converges. The cost is metadata overhead: a 100-node cluster gossips around 1 Mbps of cluster-bus traffic per node. Usually fine, occasionally surprising.
Multi-Region Distributed Caching
A cache that lives in one region forces every cross-region request to pay the WAN round trip (50 to 150 ms). Two patterns solve this.
Pattern 1: Per-region cluster + cache-aside
Each region runs its own Redis Cluster. The application talks only to the local cluster. Cache invalidation events are broadcast across regions via a pub/sub bus (Kafka, SNS, Redis Streams). Each regional cache subscribes and deletes the key locally.
Pros: simple, low-latency reads in every region, no cross-region cache traffic on the hot path. Cons: brief cross-region inconsistency between the write and the invalidation propagation (typically 100 to 500 ms).
Pattern 2: Active-active replication (Redis Enterprise CRDB, ElastiCache Global Datastore)
Multi-leader Redis with conflict-free replicated data types. Writes apply locally and replicate asynchronously across regions; CRDTs ensure deterministic convergence.
Pros: writes are local in every region; survives whole-region failures. Cons: only available in commercial Redis (Redis Enterprise) or as a managed option (AWS); CRDT semantics restrict you to commutative operations on most data types.
Operational Pitfalls
Hot shards (uneven load)
When one key takes 50% of traffic, the shard owning it saturates while others are idle. Symptoms: one Redis node at 100% CPU, others at 5%; latency spikes only on hot keys.
Mitigations:
- Key splitting: split the hot key into N shards via a salt (
top-products:0,top-products:1, ...,top-products:N). Clients pick a shard at random or round-robin. Aggregate at write time. - Local cache in front of the cluster for the top-N hottest keys with a 5 to 30s TTL. Cuts cluster load on the hot keys to near zero.
- Server-side replication: run extra replicas of the hot shard and route reads across them.
Fat keys
A single Redis key holding a 50 MB value (a giant cached JSON blob, an unbounded list). One operation now takes 50 ms instead of 100 us, blocking the single-threaded event loop and stalling all other commands on that node.
Mitigations:
- Cap value sizes (a few hundred KB max).
- Split lists/hashes into chunks with paginated keys.
- Run
redis-cli --bigkeysperiodically and alert on outliers.
Slow commands
KEYS *, SMEMBERS hugeset, SUNION over million-element sets all run on the single Redis thread and can lock it up for seconds. The request timing out is the easy outcome; the harder one is every other client on the node timing out too.
Mitigations:
- Replace
KEYSwithSCAN(cursor-based, non-blocking). - Move expensive aggregations to the application or a read replica.
- Set
slowlog-log-slower-than 10000(microseconds) and alert on entries.
Real-World Examples
How real systems implement this in production
Twitter operates one of the world's largest Memcached deployments: thousands of servers serving the timeline cache, holding hundreds of TB of working set across multiple datacenters. They use client-side consistent hashing (libketama) for partitioning and a custom proxy (twemcache) for slab class tuning.
Trade-off: At extreme throughput, Memcached's multi-threaded architecture and predictable memory layout beat Redis on raw operations per dollar.
Pinterest runs a Redis fleet of 5000+ instances handling 50 million operations per second across 700+ shards. They built a custom proxy layer (Mantle) to abstract the cluster topology from application code and to handle failover automatically.
Trade-off: At scale, the operational tooling around the cluster (proxies, monitoring, automated rebalancing) is as important as the cluster itself.
ElastiCache offers Redis Cluster with automatic failover, cross-AZ replication, and (in 'Global Datastore' mode) cross-region replication with sub-second lag. Engineers pick node sizes and replica counts; AWS handles patching, failover, and snapshot backups.
Trade-off: Managed services let you trade dollars for operational pain - a fair trade for most teams that are not Pinterest-scale.
Discord caches recent messages per channel in a Memcached-like in-memory store fronting Cassandra. They size each cache node so the working set of the top 1000 channels fits in RAM, and rely on cache-aside for the long tail.
Trade-off: It is rarely worth caching the long tail; concentrate cache memory on the head of the access distribution.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Per-region Redis Cluster: 6 shards, each with a primary and one replica in another AZ. At 100K QPS / 6 shards = ~17K ops/sec per shard, well within Redis limits. Use cache-aside with 30-minute TTL for sessions; invalidate on logout via `DEL`. Cross-region: do NOT replicate sessions; users are sticky to a region via DNS, and a re-login on region failover is acceptable. Total memory: 100M sessions x 1 KB = 100 GB per region. Pick `r6g.large` nodes (16 GB each) - 6 shards x 16 GB = 96 GB usable. Mention monitoring: hit rate, eviction count, p99 latency, replication lag.
Detection: peers stop receiving heartbeats from the primary for `cluster-node-timeout` ms (default 15s) and mark it PFAIL. When a majority of other primaries agree, the node is marked FAIL and an election begins. Election: replicas of the failed primary request votes from other primaries; the one with the highest replication offset (most up-to-date) wins. Promotion: the new primary takes over the slot range, gossips the change. Total: typically 5 to 30 seconds. Mention that recently-acknowledged writes that did not replicate are lost, and that `min-replicas-to-write` reduces this risk by refusing writes when no replica is in sync.
Sentinel: a single Redis primary plus replicas, with separate Sentinel processes monitoring health and orchestrating failover. No sharding; max throughput is bounded by one primary. Use when total working set fits in one node and you only need HA, not horizontal scale. Cluster: multi-shard, each shard has its own primary and replicas, no separate orchestrator (gossip handles it). Use when you need both HA and horizontal scale. Sentinel is simpler to operate but does not scale; Cluster scales but requires hash-tag awareness for multi-key ops.
Cannot fit in one node. Shard with Redis Cluster across 16-32 nodes, each holding 16-32 GB of working set. Replication factor 2 -> 32-64 nodes total. Use hash tags to colocate edges of the same vertex on the same shard so that adjacency lookups are single-shard. Add an in-process LRU for the top 1000 hottest vertices to absorb hot-key traffic. Persistence: AOF on, snapshot every 6 hours, so a node restart warms in minutes rather than rebuilding from the source. Mention monitoring: per-shard memory, eviction rate, slow log.
Step 1: identify the symptom - is it one node or all? Run `INFO commandstats` to see which commands dominate. Step 2: check for hot keys with `redis-cli --hotkeys` or `MONITOR` (briefly, in production it is expensive). Step 3: check for slow commands in `SLOWLOG GET`. Step 4: check for fat keys with `redis-cli --bigkeys`. Common findings: a `KEYS *` introduced in a new code path, an unbounded list growing past 1M elements, or a hot key from a viral post. Fixes: replace `KEYS` with `SCAN`, paginate the list, split or replicate the hot key. Long-term: alert on per-shard CPU divergence so the next hot shard is caught at 50% rather than 95%.
Interview Tips
How to discuss this topic effectively
State your sharding algorithm by name: 'consistent hashing with 200 vnodes per node' or 'Redis Cluster's 16384 slot model'. Naming the mechanism is a quick credibility win.
Always pair partitioning with replication in the same sentence. 'I would shard 6 ways with replication factor 2' is a complete answer; 'I would use Redis' is not.
Bring up hot keys before the interviewer asks. Mention key splitting, in-process caching, and read-replica fan-out as the three standard mitigations.
Pick Redis as your default unless the problem is clearly pure-string max-throughput. Most interview problems benefit from Redis's data structures (sorted sets for leaderboards, hashes for objects).
For multi-region, default to per-region clusters with cross-region invalidation events. Active-active CRDB is a more advanced answer that is correct only when local writes per region are required.
Common Mistakes
Pitfalls to avoid in interviews
Using modulo hashing instead of consistent hashing
Modulo hashing remaps almost every key when N changes, invalidating the entire cache and overwhelming the database. Always use consistent hashing or a slot-based scheme so that adding or removing a node only re-maps roughly 1/N of keys.
Treating Redis as a durable database
Even with AOF persistence, Redis can lose seconds of writes if the leader crashes before replicating. For data that cannot be lost, write to a durable store first and use Redis only as a cache. Configure `min-replicas-to-write` if you need stronger durability guarantees in Redis itself.
Running giant values or unbounded lists in Redis
A 50 MB value or a million-element list blocks the single-threaded event loop for tens of milliseconds, stalling every other request on that node. Cap value sizes, paginate large structures, and run `redis-cli --bigkeys` periodically.
Using `KEYS *` in production
`KEYS` scans the entire keyspace synchronously and locks Redis for seconds at scale. Use `SCAN` with a cursor and small `COUNT` to walk keys non-blockingly. Same goes for `SMEMBERS` on huge sets - prefer `SSCAN`.
Forgetting that Redis Cluster requires hash tags for multi-key operations
Multi-key commands (MSET, transactions, Lua scripts) must hit a single shard. Use hash tags - the curly-brace syntax `{user42}:profile` and `{user42}:cart` - to force related keys onto the same slot.
