System Design Article
Design a Distributed Cache (Redis)
Difficulty: Medium
Design a Redis-style in-memory distributed cache that serves billions of GET/SET operations per day at sub-millisecond latency, with sharding across hundreds of nodes and explicit eviction when memory fills. The interview centerpiece is the eviction-and-partitioning combination: how LRU and LFU choose what to drop, and how a cluster picks which node owns each key without a central coordinator. We compare client-side hashing, proxy-based partitioning (twemproxy), and Redis Cluster's hash-slot model; we cover cache-aside as the dominant access pattern, replica failover, optional persistence, and the sub-ms latency budget that makes this design fundamentally different from the durable KV store covered in the previous case study.
Design a Distributed Cache (Redis)
Design a Redis-style in-memory distributed cache that serves billions of GET/SET operations per day at sub-millisecond latency, with sharding across hundreds of nodes and explicit eviction when memory fills. The interview centerpiece is the eviction-and-partitioning combination: how LRU and LFU choose what to drop, and how a cluster picks which node owns each key without a central coordinator. We compare client-side hashing, proxy-based partitioning (twemproxy), and Redis Cluster's hash-slot model; we cover cache-aside as the dominant access pattern, replica failover, optional persistence, and the sub-ms latency budget that makes this design fundamentally different from the durable KV store covered in the previous case study.
1,080 views
28
Requirements
Functional Requirements
- GET (key) -> value or miss at sub-millisecond p99 latency.
- SET (key, value, ttl) that stores the value with optional time-to-live.
- DELETE (key) that removes the value.
- Common data types: strings, lists, hashes, sets, sorted sets (Redis-style; Memcached is strings only).
- TTL expiration: keys with a TTL are automatically removed when expired.
- Eviction when full: when memory is exhausted, evict according to a configured policy (LRU, LFU, TTL).
- Pub/Sub (optional but common): publishers send messages on channels; subscribers receive them.
- Cluster mode: keys partitioned across N nodes; client requests routed to the owning node automatically.
Out of Scope (state explicitly)
- ACID transactions across keys (we offer single-key atomicity; multi-key Redis MULTI works only on the same shard).
- Secondary indexes / range queries (not a database).
- Schema enforcement (values are opaque).
- Long-term durability as the primary store of record (cache-aside is the dominant pattern).
Non-Functional Requirements
- Latency: p99 < 1 ms for GET on a hot key; p99 < 2 ms for SET.
- Throughput: 1M ops/sec across the cluster; 50K-100K ops/sec per node.
- Availability: 99.99%; replica failover under 5 seconds.
- Memory efficiency: minimize per-key overhead; aim for ~50-100 B of structural overhead per string key.
- Predictable eviction: when full, eviction kicks in without dragging tail latency above the 1 ms budget.
- Operationally simple: cluster topology changes (add/remove node) should be online with bounded data movement.
Back-of-the-Envelope Estimation
Capacity
---------- Memory and traffic ----------
Keys: 100M
Average value size: ~500 B
Logical data: ~50 GB
Replication factor: 2 (primary + 1 replica)
Replicated data: ~100 GB
Daily ops (GET + SET): ~100 billion
Average ops/sec: ~1.1M
Peak ops/sec: ~3M
Nodes (each 16 GB RAM): ~16 with headroomNetwork
---------- Network ----------
Per-op wire size: ~150 B (1 GET + value + framing)
1M ops/sec: ~150 MB/sec aggregate
Per-node: ~10 MB/sec (well within 10 Gbps NICs)
Replication traffic (writes only): ~100 MB/sec total clusterEviction Pressure
---------- Eviction characterization ----------
Working set: ~80% of allocated memory
New keys per second: ~50K
Evictions per second: ~50K when full (one in / one out)
Time budget per eviction: ~10 us (no impact at 100K ops/sec)High-Level Design
---------- Architecture overview ----------
+-----------+
| Client |
+-----------+
|
v (smart client knows hash slot map; sends to correct node)
+-------- cluster of nodes (hash slots) ------+
| n1 n2 n3 ... n16 |
| each node owns a contiguous slot range |
| each node has a replica (n1' n2' n3' ...) |
+---------------------------------------------+
|
v (per-node)
+------------------+
| Single-threaded |
| event loop | (Redis-style; one core handles all I/O + ops)
+------------------+
|
v
+------------------+
| In-memory dict |
| + LRU/LFU lists |
+------------------+
|
v (optional)
+------------------+
| RDB / AOF disk |
+------------------+Public API (Redis-style protocol, RESP)
---------- Wire protocol example ----------
*3\r\n$3\r\nSET\r\n$3\r\nfoo\r\n$3\r\nbar\r\n
response: +OK\r\n
*2\r\n$3\r\nGET\r\n$3\r\nfoo\r\n
response: $3\r\nbar\r\nClients commonly use SDKs (ioredis, redis-py) that negotiate cluster topology and route requests automatically.
Request Flow
- Client computes
slot = CRC16(key) mod 16384. - Client looks up which node owns that slot from its cached topology.
- Client sends the command to that node over a persistent TCP connection.
- Node processes in its single-threaded event loop, returns the response.
- If a SET, node also forwards the write to the replica asynchronously.
Detailed Design
The two interesting components are the eviction policies and the cluster partitioning model.
Eviction Policies
When used_memory >= maxmemory, the cache must drop something to make room. The policy determines what.
| Policy | Drops | Best for | Memory overhead |
|---|---|---|---|
| noeviction | nothing (returns OOM error on writes) | Strict cache where stale > miss | None |
| allkeys-lru | least recently used | Skewed read workloads (Pareto) | ~16 B per key for LRU list |
| allkeys-lfu | least frequently used | Long-lived popularity bias | ~24 B per key for counters |
| allkeys-random | random | Truly uniform access | None |
| volatile-lru / volatile-lfu | LRU/LFU among keys with a TTL | Mixed cache + persistent | per-key |
| volatile-ttl | shortest-TTL keys first | Sliding-window data | per-key |
Why LFU often beats LRU
LRU favors recency: a key just touched moves to the front. LFU favors frequency: a key touched 1000 times beats one touched once even if the once was recent. Real production traces (Facebook's Memcached studies) show that on heavy-tailed access distributions, LFU has 5-15% higher hit rate than LRU because it doesn't get fooled by one-time scans (analytics queries pulling rare keys exactly once shouldn't evict the popular ones).
The tradeoff: LFU needs a frequency counter per key, costing more memory; Redis 4+ uses a probabilistic counter (8 bits) plus aging to keep overhead small.
Approximate LRU
Maintaining exact LRU (a doubly linked list with O(1) move-to-front) costs 16 bytes per key for pointers. For a billion-key cache that's 16 GB of pure overhead. Redis uses approximate LRU: sample 5 random keys, evict the oldest among them. Repeated over many evictions, this converges close to true LRU at a fraction of the memory cost.
---------- Approximate LRU (Redis style) ----------
When eviction needed:
candidates = sample N=5 random keys
evict the one with the oldest 'last access' timestamp
Repeat until under maxmemory.Active vs Passive Expiration
Keys with TTL are removed two ways:
- Passive: on read, if the key is expired, return miss and delete it.
- Active: a background loop samples 20 random keys with TTL every 100 ms; deletes expired ones; if more than 25% were expired, repeats immediately.
Without active expiration, expired keys would leak memory until accidentally read. Active keeps the in-memory footprint close to 'live' size.
Cluster Partitioning Models
Three mainstream approaches; each has its sweet spot.
Client-Side Hashing
Client hashes the key (consistent hashing or simple CRC) and picks the node directly. No coordinator.
---------- Client-side hashing ----------
+ No central component to fail or scale
+ Lowest latency (one direct hop)
- Each client must know full topology; topology updates require client redeploy or a config push
- Adding/removing nodes requires careful client rollout to avoid splitsMemcached clients historically used this pattern (libmemcached's ketama).
Proxy-Based (Twemproxy / mcrouter)
A stateless proxy layer sits between clients and cache nodes. Clients connect to the proxy; proxy routes to the right node.
---------- Proxy-based partitioning ----------
+ Clients are simple; topology changes don't require client updates
+ Proxy can do batching, fan-out, replica failover
- Extra hop adds ~0.2-0.5 ms
- Proxy itself needs HA (run multiple, load balance)Facebook runs mcrouter in front of Memcached; Twitter open-sourced twemproxy.
Redis Cluster (Hash Slots)
Redis Cluster pre-allocates 16,384 hash slots; each node owns a subset. Clients learn the slot map via gossip and route directly. When a node moves slots (rebalance), clients receive a MOVED redirect on the next request and update their map.
---------- Hash slots ----------
slot = CRC16(key) mod 16384
Each node owns a contiguous range of slots (e.g., n1: 0-1023, n2: 1024-2047, ...)
Resharding moves slots between nodes one at a time, online.The slot indirection makes resharding much smoother than raw consistent hashing: you rebalance in slot-sized increments without rehashing every key. The trade-off is the fixed slot count; you can't have more than 16384 shards (which is way more than anyone needs in practice).
Hash Tags for Multi-Key Operations
Redis MULTI/EXEC and Lua scripts only work when all keys touch the same node. Use a hash tag in curly braces to force keys onto the same slot:
---------- Hash tag example ----------
Key 'cart:{user42}:items' and 'cart:{user42}:meta' both hash by 'user42' alone
Both land on the same slot, same node
MULTI / Lua across them works atomicallyUse this for related keys that need atomic multi-key operations (a user's session + cart, an order's items + total).
Replication and Failover
Each primary has 1 (or more) replicas. Writes go to the primary; replicas pull asynchronously.
---------- Replication flow ----------
1. Client SET on primary; primary responds OK after applying.
2. Replica streams the command from the primary's replication backlog.
3. If primary fails: cluster gossip detects (within 1-5 s); replicas race to become the new primary; the most up-to-date replica wins (compare replication offsets).
4. Clients receive MOVED redirect to the new primary on next request.Replication is async by default: a write that the primary acked may be lost if the primary dies before replicating. For stricter durability, use WAIT N timeout to block until N replicas have the write (synchronous-ish replication with timeout).
Memory Layout
A Redis dict (hash table) with chained collisions stores key -> value pointers. Keys are interned strings; values can be any of the supported types. Per-key overhead is roughly:
---------- Per-key memory ----------
dict entry overhead: ~32 B
key string + length: len + 16 B
value header: ~16 B
LRU/LFU bits: ~8-16 B
TTL: ~16 B (if set)
Total overhead per key: ~80-100 B before payloadFor billion-key caches, this overhead matters: 100 B * 1B keys = 100 GB just for structure. Memcached has slightly lower overhead (~50 B) by being more frugal but supports only strings.
Persistence (Optional)
Two modes:
- RDB snapshots: fork the process, dump in-memory state to a file. Cheap CPU cost (copy-on-write); periodic (every 5 min). Loses up to 5 min of writes on crash.
- AOF (append-only file): every write is appended to a log; rewritten periodically. Loses at most 1 second of writes (with
everysecfsync). Larger disk footprint.
For pure cache use, neither persistence is needed (cache miss is recoverable). For cache-as-store usage (session store), AOF + RDB combo is common.
Calling Pattern: Cache-Aside
import Redis from 'ioredis';
const cache = new Redis.Cluster([{ host: 'cache', port: 6379 }]);
async function getUserProfile(userId) {
const cacheKey = `user:${userId}`;
const cached = await cache.get(cacheKey);
if (cached) return JSON.parse(cached);
const profile = await db.queryUser(userId);
if (profile) {
await cache.setex(cacheKey, 300, JSON.stringify(profile)); // TTL 5 min
}
return profile;
}This is the dominant pattern: cache-aside with explicit TTL; the cache holds copies, the database is the source of truth.
Data Model
Per-Node In-Memory Layout
---------- Single node ----------
Main dict: key -> value (object header + payload)
Expiry dict: key -> expiry_ms (parallel hash table for keys with TTL)
LRU/LFU metadata: 8-24 bits per key, packed into the object header
Client buffer ring (output): per-connection
Replication backlog: ring buffer of recent commands for replicas to catch upCluster Topology State
---------- Topology ----------
slot_owners: array[16384] -> node_id of owning primary
node_table: node_id -> {host, port, role(primary/replica), epoch, replicates_of}
Gossiped on cluster bus port (separate from client port).Clients cache the slot_owners map; on MOVED redirect, they refresh their local copy.
Pub/Sub State
---------- Pub/Sub channels ----------
channels: channel_name -> set of subscriber_connections
patterns: glob_pattern -> set of subscriber_connectionsPub/Sub is fire-and-forget; subscribers offline at publish time miss the message.
Scaling and Bottlenecks
Single-threaded Bottleneck
Redis is single-threaded for command processing. A single node tops out at ~100K ops/sec on commodity hardware (more with pipelining). To scale beyond that, shard. Cluster is the standard way; multi-threaded variants exist (KeyDB, Dragonfly) but trade off some Redis semantics.
Hot Keys
A single very hot key serializes through one node's event loop. Mitigations:
- Local client cache: clients hold a small LRU cache of the hottest keys with very short TTL (e.g., 1 s). Reduces round trips dramatically.
- Read replicas: route reads of hot keys to replicas; SET-heavy keys can't use this trick.
- Key splitting: shard the value across N keys (e.g., counter:1, counter:2, ..., counter:N) and aggregate. Adds complexity; only worth it for the truly extreme cases.
Large Values
A single 100 MB value can monopolize the event loop during transfer (single-threaded I/O). Mitigations:
- Set per-key size limits at the application layer.
- Stream large values in chunks via separate keys.
- For media or blobs, use object storage (S3) instead and cache the URL or metadata.
Memory Fragmentation
libc's malloc fragments memory over time (allocate-free patterns leave gaps). Redis tracks mem_fragmentation_ratio = rss / used_memory; values > 1.5 indicate heavy fragmentation. Solution: jemalloc (less prone to fragmentation) and periodic restart or MEMORY PURGE on jemalloc.
Eviction Storm
When memory hits maxmemory, eviction runs on every write. If eviction can't keep up with the write rate, the node returns OOM. Watch the eviction rate; if it sustains > 10% of write rate, the cache is undersized. Add nodes or reduce working set.
Failover Latency
For hot caches, even 5 s of failover is felt by users (5 s of cache misses). Faster failover requires:
- Tighter heartbeat intervals (more network chatter).
- Short replica catch-up (smaller backlog).
- Anti-pattern: one cluster per region rather than cross-region (cross-region failover takes longer).
Cross-Region Replication
Redis Cluster doesn't natively support cross-region replication; teams typically run independent regional clusters with application-level write fan-out, or use Redis Enterprise's CRDB feature. Eventual consistency across regions is the rule.
Trade-offs and Alternatives
Approximate LRU vs Exact LRU
Exact LRU costs O(1) updates with a doubly linked list but 16 B per key in pointers. Approximate LRU samples 5 keys per eviction; nearly the same hit rate empirically at zero pointer overhead. Approximate is the right call for memory-tight caches; the empirical hit-rate gap is < 1% on typical workloads.
LRU vs LFU
LRU treats every access equally; LFU weighs frequency. On Pareto-distributed access (a few keys massively popular), LFU keeps the popular ones longer and gives 5-15% higher hit rate. Use LFU as the default for production; LRU when access is roughly uniform or when you don't want frequency to matter.
Cluster vs Standalone
Standalone is simpler (one node, one process, one config). Cluster adds operational complexity (gossip, slot rebalancing, hash tags) but scales horizontally. Use standalone for caches under ~50 GB and ~100K ops/sec; switch to cluster beyond that.
Persistence vs Pure In-Memory
Pure in-memory is simplest and fastest; cache miss is recoverable from the source. Persistence (RDB or AOF) trades disk I/O cost (10-30% throughput hit for AOF with everysec) for faster recovery after restart. Use persistence when warm-up time matters (large caches) or when the cache is the source of truth (session store).
Synchronous vs Asynchronous Replication
Async (default) is fast but loses recent writes on primary failure; sync (WAIT N timeout) costs cross-replica RTT per write. Most caches accept the small data loss window for the latency win; use WAIT only for caches that hold session state where losing a few minutes of writes is harmful.
Why Not Just Add a Replica to Postgres?
Postgres replicas serve reads but every read still goes through Postgres's relational query path. Cache hits at 0.5 ms per op vs 5-50 ms for Postgres replicas. For high-throughput read-mostly workloads, an explicit cache layer is dramatically cheaper than scaling Postgres replicas.
Memcached vs Redis
Memcached: simpler, multi-threaded, slightly lower overhead per key, strings only. Redis: rich data structures (lists, sets, sorted sets, hashes, streams), single-threaded, optional persistence. Choose Memcached when you only need string KV at the highest possible per-node throughput; choose Redis when you need data structures (rate limiters, leaderboards, queues), pub/sub, or scripting.
Real-World Examples
How real systems implement this in production
Redis Cluster ships with the open-source distribution and powers caches at Twitter, Snap, Pinterest, and many others. Redis Enterprise adds CRDB (active-active cross-region replication via CRDTs), tiered storage (hot in RAM, warm on flash), and managed operations. Single-threaded core remains; throughput per node is ~100K ops/sec.
Trade-off: Redis's single-threaded design gives predictable latency and simple correctness reasoning at the cost of per-node throughput. The lesson: a deliberately constrained execution model (one thread, one event loop) trades raw throughput for operational simplicity, which has been the right call for most users.
Facebook runs Memcached at unprecedented scale: multiple TB per region, billions of keys, millions of ops/sec. They built mcrouter as a routing layer in front (handles failover, sharding, replication). Memcached's slab allocator avoids fragmentation entirely; its multi-threaded event loop scales to multiple cores per node.
Trade-off: Memcached's strict feature set (string KV only, no persistence, no scripting) keeps it lean and very fast but pushes complexity to the application layer (no atomic counters, no rate limiting primitives). The lesson: when you only need cache semantics, the leaner tool wins on throughput per dollar; when you need data structures, Redis is worth its higher per-node overhead.
ElastiCache offers managed Redis and Memcached with one-click cluster mode, automatic failover, and integration with the rest of AWS. Most teams pick managed cache to avoid running the cluster themselves.
Trade-off: ElastiCache hides nearly all the operational pain of running clusters at the cost of vendor lock-in and limited tuning (you can't run custom modules, can't tweak some kernel settings). The lesson: managed cache is the right default for most teams; the cost premium is dwarfed by the engineering hours saved on operations.
Dragonfly is a multi-threaded Redis-compatible server that uses shared-nothing per-thread sharding within one process. Reports 25x throughput per node vs Redis on the same hardware. Backed by VC funding; gaining adoption at companies needing higher per-node density.
Trade-off: Dragonfly's per-thread sharding gets dramatically higher throughput but breaks some Redis semantics (notably, MULTI is per-thread; cross-thread atomicity needs explicit handling). The lesson: throwing more cores at a problem can be cheaper than scaling out, but only if you're willing to give up some of the constraints that made the original design predictable.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Client computes slot = CRC16(key) mod 16384 and looks up the owning node from its cached topology map. Client sends the SET command over a persistent TCP connection to that node. Node's single-threaded event loop reads the command from the socket, parses RESP, executes the SET (allocates the value, inserts into the dict, updates the LRU/LFU metadata), and writes +OK to the response buffer. If memory is at maxmemory, eviction kicks in BEFORE the insert: sample 5 random keys, drop the worst per the configured policy (LRU/LFU), repeat until under threshold. Asynchronously, the primary replicates the command to its replica via the replication backlog. Total p99: < 1 ms for the client; replication completes within a few hundred microseconds afterward.
Cluster nodes gossip heartbeats every ~1 s. After ~5 s of missed heartbeats, peers mark the primary FAIL. The primary's replica with the highest replication offset (most up-to-date) initiates a failover: it requests votes from a majority of primaries; if granted, it promotes itself, takes over the failed node's slots, and gossips its new role. Total time from failure to new primary serving: typically 5-10 s. Clients sending to the old primary's address get connection errors and retry; if they hit a node that knows the new topology, they receive a MOVED redirect with the new address; client refreshes its slot map. Writes acked by the old primary but not yet replicated are lost (small window because async replication is fast). For stricter durability, use WAIT 1 timeout in the SET command to block until the replica also acked.
Real cache workloads are typically Pareto-distributed: a small fraction of keys account for the majority of accesses (e.g., the 10 most popular product pages drive 80% of cache lookups). LRU treats a one-time scan (an analytics job that touches 100K rare keys exactly once) as equivalent to a popular key just touched: both move to the front. The scan can flush the popular keys out. LFU weighs by access count; the scan keys have count=1 and get evicted first when memory pressure returns, preserving the popular keys. Empirical studies (Facebook, Twitter) show LFU beats LRU by 5-15% on real production traces. The cost is per-key counter overhead (~1 byte for Redis's probabilistic counter); the hit-rate win typically dwarfs the counter cost. Use LRU when access is genuinely uniform or when frequency tracking is undesired.
A hot key (10K reads/sec on one node when the node's budget is 100K ops/sec) is 10% of one node's CPU. Several layered mitigations. (1) Client-side micro-cache: SDK holds an LRU of the hottest keys with a 1-second TTL; the network round trip is skipped entirely. Reduces hot-key load 10x for read-heavy keys. (2) Read from replicas: route GETs of hot keys to the primary's replica (cluster supports READONLY mode on replicas); spreads load across primary + replicas. (3) Key splitting: for write-hot counters or aggregates, shard the value into N sub-keys (counter:s1, counter:s2, ..., counter:sN), each on a different slot; aggregate on read; trades atomicity for parallelism. (4) Pre-aggregate: if the hot key is computed from many sources (e.g., a leaderboard), compute it offline and refresh periodically rather than on every write.
Several cases. (1) Small workloads (under ~50 GB and 100K ops/sec): a standalone primary + replica is simpler, faster to deploy, and avoids cluster's hash-tag rules. (2) Need for cross-shard transactions: cluster's atomic operations are limited to one slot. If you need atomic operations across many keys without hash-tag co-location, a standalone instance handles MULTI across all keys natively. (3) Multi-region active-active: cluster is single-region; cross-region replication requires either application-level fan-out or Redis Enterprise's CRDB. For multi-region needs, evaluate alternatives (DynamoDB Global Tables, etc.). (4) Hard durability needs: caches that absolutely cannot lose any data need either WAIT-with-replica or shouldn't be in a cache at all; consider a CP store with strong consistency (Spanner, CockroachDB) for source-of-truth data. Generally cluster is right for high-throughput in-region caching; other patterns fit other shapes.
Interview Tips
How to discuss this topic effectively
Lead with the latency budget. Saying 'sub-millisecond p99 forces in-memory and rules out almost everything else' frames why this design is so different from a durable KV store.
Bring up approximate LRU explicitly. Exact LRU costs 16 bytes per key in pointers; approximate trades a fraction of a percent of hit rate for billions of bytes of structural memory.
Distinguish the three partitioning models (client-side, proxy, hash slots) and pick hash slots as the modern default. Proxy still has its place for very-large fleets; client-side is mostly historical.
Mention hash tags when discussing multi-key operations. Without them, MULTI and Lua only work on a single shard; with them, you can guarantee co-location for related keys.
Default to LFU over LRU for read-heavy production workloads. Most candidates say LRU because it's the textbook answer; LFU's hit-rate advantage on Pareto distributions is a senior signal.
Common Mistakes
Pitfalls to avoid in interviews
Using exact LRU pointers in your design
Exact LRU costs 16 B per key in linked-list pointers. For a billion-key cache that's 16 GB of pure overhead. Approximate LRU (sample 5 keys, evict the oldest) gives the same hit rate within 1% at near-zero overhead and is what Redis actually uses.
Forgetting that Redis is single-threaded for command execution
A single Redis node tops out at ~100K ops/sec because all commands run on one event loop. To scale, shard with cluster mode (or use a multi-threaded variant like KeyDB or Dragonfly with the trade-offs that brings).
Assuming MULTI works across keys on a cluster
MULTI/EXEC and Lua scripts require all keys to live on the same shard. Use hash tags `{tag}` to force related keys onto the same slot when you need atomicity; otherwise multi-key transactions will fail with CROSSSLOT errors.
Treating cache as durable and skipping the source-of-truth database
Pure in-memory caches lose data on restart. Even with AOF persistence, the cache should not be the only copy of business-critical state; use cache-aside with the database as the durable source. Reserve cache-as-store usage for session-style data where loss is recoverable.
Choosing LRU for every workload
LRU is the default but not always best. On heavy-tailed access patterns (a few keys hugely popular), LFU keeps the popular ones longer and gives 5-15% higher hit rate. Pick LFU for read-heavy production workloads with skewed access; LRU for uniform access.
