System Design Article

Distributed Caching (Redis, Memcached)

Difficulty: Medium

A single-node cache eventually runs out of RAM, CPU, or network. Distributed caching spreads keys across many nodes so total capacity and throughput scale horizontally. This lesson covers how Redis and Memcached partition data, replicate it for availability, fail over when nodes die, and how to choose between them. By the end you can design a multi-node cache layer for a real workload, defend the topology in an interview, and recognize the bug class behind 'why is one cache node maxed at 100% CPU while the others are idle?'.

Distributed Caching (Redis, Memcached)

System Design

Medium

caching

redis

memcached

consistent-hashing

distributed-systems

replication

failover

system-design

intermediate

806 views

Why Distribute the Cache?

A single cache node has three hard ceilings:

Memory - even a fat machine maxes around 1 TB of usable RAM, and Redis recommends keeping each instance under 50 GB to keep replication and failover snappy.
CPU - Redis is single-threaded for command execution; Memcached scales across cores but eventually saturates the NIC.
Network - one 25 Gbps NIC carries about 3 GB/s. A million 3 KB cache reads per second is already pushing the limit.

Distributed caching is what you reach for once any of these ceilings is in sight. The idea is simple: many small cache nodes form a cluster, each holding a slice of the keyspace, and clients route each request to the right slice.

Text

---------- Distributed cache cluster ----------
  client - key 'user:42' --> [ shard 0 (slot 5298) ]
  client - key 'cart:99' --> [ shard 1 (slot 12041) ]
  client - key 'feed:7'  --> [ shard 2 (slot 9821) ]
            |                    |                    |
            +-- replication ---> +-- replication ---> +-- replication

Redis vs Memcached: the Two Workhorses

Ninety-nine percent of distributed cache designs start with one of these two. Pick the right tool first; tuning matters less than the choice.

Dimension	Redis	Memcached
Data types	Strings, hashes, lists, sets, sorted sets, streams, HyperLogLog, geo, bitmaps	Strings only
Persistence	Optional RDB snapshots + AOF append log	None - pure RAM
Replication	Built-in leader-follower	Not built-in (use a proxy or app-level sharding)
Threading	Single-threaded command loop (Redis 6+ has multi-threaded I/O)	Multi-threaded, scales across cores
Memory model	Custom allocator (jemalloc), hot/cold separation	Slab allocator, fixed-size chunks
Cluster mode	Native (Redis Cluster) with 16384 slots	Client-side sharding only
Eviction	LRU, LFU, random, TTL, no-eviction (8 policies)	LRU within slab class
Pub/Sub & Streams	Yes	No
Best for	Rich data structures, durable cache, leaderboards, queues, sessions	Pure key-value, simplest possible cache, max throughput per byte

Rule of thumb: choose Redis when you need any of - data structures beyond strings, persistence, replication, pub/sub, or atomic multi-step operations (Lua scripting). Choose Memcached when you only need a fast distributed GET/SET/DELETE and want to maximize cores per dollar.

Partitioning: Spreading Keys Across Nodes

With N nodes, the cache layer must answer one question for every request: 'which node owns this key?' Three strategies, in increasing order of sophistication.

1. Modulo Hashing (the naive approach)

Hash the key, take hash(key) mod N, that is the node.

The problem: when N changes (a node added or removed), mod N changes for almost every key. You invalidate the entire cache in one operation. For a cluster serving a million reads per second at 90% hit rate, dropping to 0% hit rate sends 900K extra requests per second to the database. The database does not survive.

2. Consistent Hashing (Memcached, DynamoDB, Cassandra)

Hash both nodes and keys onto the same circular space (0 to 2^32). Each key is owned by the next node clockwise. Adding or removing a node only re-maps the keys between that node and its predecessor - roughly 1/N of the keyspace.

Text

---------- Consistent hash ring ----------
        node A (hash 100)
           /        \
     keys here       keys here
           \        /
        node B (hash 1700)
           /        \
     keys here       keys here
           \        /
        node C (hash 3000)
           /        \
     keys here       keys here
           \        /
        (loops back to A at 2^32)

Virtual nodes: with only N physical nodes, the ring is uneven; some nodes own much more space. Add 100 to 200 virtual node tokens per physical node to even out the distribution. This is how Cassandra's vnodes and Memcached's libketama work.

3. Slot-Based Sharding (Redis Cluster)

Redis Cluster pre-divides the keyspace into a fixed 16384 slots. Each key maps to a slot via CRC16(key) mod 16384. Each node owns a range of slots. Adding a node copies a contiguous slot range from existing nodes; the keyspace mapping is explicit and visible to clients.

Text

---------- Redis Cluster slots ----------
  node A: slots     0 - 5460
  node B: slots  5461 - 10922
  node C: slots 10923 - 16383

Clients keep a local copy of the slot-to-node map. On a slot move, the cluster sends a MOVED response with the new node address; the client updates its map and retries. Resharding is online - the cluster moves slots one at a time without downtime.

Pseudocode: routing a request

JavaScript

Python

import { createCluster } from 'redis';

const cluster = createCluster({
    rootNodes: [
        { socket: { host: 'cache-0.example.com', port: 6379 } },
        { socket: { host: 'cache-1.example.com', port: 6379 } },
        { socket: { host: 'cache-2.example.com', port: 6379 } },
    ],
});

await cluster.connect();
// The client computes CRC16(key) mod 16384, looks up the slot owner,
// and sends the request to the right node automatically.
await cluster.set('user:42', JSON.stringify(profile));
const raw = await cluster.get('user:42');

Replication and Failover

Partitioning gets you capacity and throughput. Replication gets you availability.

Replication topology

In Redis Cluster, each shard is a small leader-follower group. Writes go to the leader; reads can go to leader or follower (with READONLY).

Text

---------- One Redis shard ----------
           [ leader ]
            /      \
           v        v
       [ replica ] [ replica ]
      (in another AZ) (in another AZ)

Replication is asynchronous by default. The leader streams its command log; replicas apply it. Lag is sub-millisecond on a healthy cluster.

Failover

When a leader becomes unreachable, the cluster runs an election among the surviving leaders to promote one of the failed leader's replicas. The protocol: each shard's other-shard leaders vote, requiring a majority. The election takes 1 to 30 seconds depending on cluster-node-timeout.

Hazards:

Lost writes: writes acknowledged by the old leader but not yet replicated are gone. Configure min-replicas-to-write 1 min-replicas-max-lag 10 to refuse writes when no replica is in sync.
Split brain: a partitioned old leader keeps accepting writes for cluster-node-timeout seconds before realizing it is isolated. Those writes are discarded when it rejoins.
Cascading failover: a slow replica gets promoted, becomes the new leader, can't keep up, fails again. Always size replicas at the same hardware as leaders.

Cluster Membership: The Gossip Protocol

Redis Cluster nodes maintain a mesh - each node opens a TCP connection to every other node and exchanges periodic 'PING' messages on a separate cluster bus port (10000 + Redis port). The PING carries a digest of the sender's view of the cluster: who is alive, who is suspect, what slots each node owns.

When a node misses heartbeats for cluster-node-timeout ms (default 15s), peers mark it PFAIL (probably failed). When a majority of leaders agree, it is promoted to FAIL. Failover then runs.

Gossip is robust under partial failures: even if 30% of inter-node messages drop, the cluster still converges. The cost is metadata overhead: a 100-node cluster gossips around 1 Mbps of cluster-bus traffic per node. Usually fine, occasionally surprising.

Multi-Region Distributed Caching

A cache that lives in one region forces every cross-region request to pay the WAN round trip (50 to 150 ms). Two patterns solve this.

Pattern 1: Per-region cluster + cache-aside

Each region runs its own Redis Cluster. The application talks only to the local cluster. Cache invalidation events are broadcast across regions via a pub/sub bus (Kafka, SNS, Redis Streams). Each regional cache subscribes and deletes the key locally.

Pros: simple, low-latency reads in every region, no cross-region cache traffic on the hot path. Cons: brief cross-region inconsistency between the write and the invalidation propagation (typically 100 to 500 ms).

Pattern 2: Active-active replication (Redis Enterprise CRDB, ElastiCache Global Datastore)

Multi-leader Redis with conflict-free replicated data types. Writes apply locally and replicate asynchronously across regions; CRDTs ensure deterministic convergence.

Pros: writes are local in every region; survives whole-region failures. Cons: only available in commercial Redis (Redis Enterprise) or as a managed option (AWS); CRDT semantics restrict you to commutative operations on most data types.

Operational Pitfalls

Hot shards (uneven load)

When one key takes 50% of traffic, the shard owning it saturates while others are idle. Symptoms: one Redis node at 100% CPU, others at 5%; latency spikes only on hot keys.

Mitigations:

Key splitting: split the hot key into N shards via a salt (top-products:0, top-products:1, ..., top-products:N). Clients pick a shard at random or round-robin. Aggregate at write time.
Local cache in front of the cluster for the top-N hottest keys with a 5 to 30s TTL. Cuts cluster load on the hot keys to near zero.
Server-side replication: run extra replicas of the hot shard and route reads across them.

Fat keys

A single Redis key holding a 50 MB value (a giant cached JSON blob, an unbounded list). One operation now takes 50 ms instead of 100 us, blocking the single-threaded event loop and stalling all other commands on that node.

Mitigations:

Cap value sizes (a few hundred KB max).
Split lists/hashes into chunks with paginated keys.
Run redis-cli --bigkeys periodically and alert on outliers.

Slow commands

KEYS *, SMEMBERS hugeset, SUNION over million-element sets all run on the single Redis thread and can lock it up for seconds. The request timing out is the easy outcome; the harder one is every other client on the node timing out too.

Mitigations:

Replace KEYS with SCAN (cursor-based, non-blocking).
Move expensive aggregations to the application or a read replica.
Set slowlog-log-slower-than 10000 (microseconds) and alert on entries.

Real-World Examples

How real systems implement this in production

Twitter Memcached at scale

Twitter operates one of the world's largest Memcached deployments: thousands of servers serving the timeline cache, holding hundreds of TB of working set across multiple datacenters. They use client-side consistent hashing (libketama) for partitioning and a custom proxy (twemcache) for slab class tuning.

Trade-off: At extreme throughput, Memcached's multi-threaded architecture and predictable memory layout beat Redis on raw operations per dollar.

Pinterest Redis at 50M ops/sec

Pinterest runs a Redis fleet of 5000+ instances handling 50 million operations per second across 700+ shards. They built a custom proxy layer (Mantle) to abstract the cluster topology from application code and to handle failover automatically.

Trade-off: At scale, the operational tooling around the cluster (proxies, monitoring, automated rebalancing) is as important as the cluster itself.

AWS ElastiCache for Redis

ElastiCache offers Redis Cluster with automatic failover, cross-AZ replication, and (in 'Global Datastore' mode) cross-region replication with sub-second lag. Engineers pick node sizes and replica counts; AWS handles patching, failover, and snapshot backups.

Trade-off: Managed services let you trade dollars for operational pain - a fair trade for most teams that are not Pinterest-scale.

Discord billion-message cache

Discord caches recent messages per channel in a Memcached-like in-memory store fronting Cassandra. They size each cache node so the working set of the top 1000 channels fits in RAM, and rely on cache-aside for the long tail.

Trade-off: It is rarely worth caching the long tail; concentrate cache memory on the head of the access distribution.

Quick Interview Phrases

Key terms to use in your answer

consistent hashing

Redis Cluster slots

leader-follower replication

automatic failover

hot shard rebalancing

cluster-node-timeout

Common Interview Questions

Questions you might be asked about this topic

Design the cache layer for a 100K-QPS user-session workload across three regions.

Per-region Redis Cluster: 6 shards, each with a primary and one replica in another AZ. At 100K QPS / 6 shards = ~17K ops/sec per shard, well within Redis limits. Use cache-aside with 30-minute TTL for sessions; invalidate on logout via `DEL`. Cross-region: do NOT replicate sessions; users are sticky to a region via DNS, and a re-login on region failover is acceptable. Total memory: 100M sessions x 1 KB = 100 GB per region. Pick `r6g.large` nodes (16 GB each) - 6 shards x 16 GB = 96 GB usable. Mention monitoring: hit rate, eviction count, p99 latency, replication lag.

What happens during a Redis Cluster failover and how long does it take?

Compare Redis Cluster with sentinel-based replication. When would you use each?

How would you cache a 500 GB social graph?

Your application latency just doubled and Redis CPU is at 95% on one node. Walk me through your debug.

Interview Tips

How to discuss this topic effectively

State your sharding algorithm by name: 'consistent hashing with 200 vnodes per node' or 'Redis Cluster's 16384 slot model'. Naming the mechanism is a quick credibility win.

Always pair partitioning with replication in the same sentence. 'I would shard 6 ways with replication factor 2' is a complete answer; 'I would use Redis' is not.

Bring up hot keys before the interviewer asks. Mention key splitting, in-process caching, and read-replica fan-out as the three standard mitigations.

Pick Redis as your default unless the problem is clearly pure-string max-throughput. Most interview problems benefit from Redis's data structures (sorted sets for leaderboards, hashes for objects).

For multi-region, default to per-region clusters with cross-region invalidation events. Active-active CRDB is a more advanced answer that is correct only when local writes per region are required.

Common Mistakes

Pitfalls to avoid in interviews

Using modulo hashing instead of consistent hashing

Modulo hashing remaps almost every key when N changes, invalidating the entire cache and overwhelming the database. Always use consistent hashing or a slot-based scheme so that adding or removing a node only re-maps roughly 1/N of keys.

Treating Redis as a durable database

Even with AOF persistence, Redis can lose seconds of writes if the leader crashes before replicating. For data that cannot be lost, write to a durable store first and use Redis only as a cache. Configure `min-replicas-to-write` if you need stronger durability guarantees in Redis itself.

Running giant values or unbounded lists in Redis

A 50 MB value or a million-element list blocks the single-threaded event loop for tens of milliseconds, stalling every other request on that node. Cap value sizes, paginate large structures, and run `redis-cli --bigkeys` periodically.

Using `KEYS *` in production

`KEYS` scans the entire keyspace synchronously and locks Redis for seconds at scale. Use `SCAN` with a cursor and small `COUNT` to walk keys non-blockingly. Same goes for `SMEMBERS` on huge sets - prefer `SSCAN`.

Forgetting that Redis Cluster requires hash tags for multi-key operations

Multi-key commands (MSET, transactions, Lua scripts) must hit a single shard. Use hash tags - the curly-brace syntax `{user42}:profile` and `{user42}:cart` - to force related keys onto the same slot.

Back to System Design