System Design Article

Design a Distributed Cache (Redis)

Difficulty: Medium

Design a Redis-style in-memory distributed cache that serves billions of GET/SET operations per day at sub-millisecond latency, with sharding across hundreds of nodes and explicit eviction when memory fills. The interview centerpiece is the eviction-and-partitioning combination: how LRU and LFU choose what to drop, and how a cluster picks which node owns each key without a central coordinator. We compare client-side hashing, proxy-based partitioning (twemproxy), and Redis Cluster's hash-slot model; we cover cache-aside as the dominant access pattern, replica failover, optional persistence, and the sub-ms latency budget that makes this design fundamentally different from the durable KV store covered in the previous case study.

Design a Distributed Cache (Redis)

System Design

Medium

design-distributed-cache

case-study

infrastructure-storage

redis

memcached

lru

eviction-policy

consistent-hashing

cache-aside

in-memory-store

system-design

intermediate

premium

1,080 views

Requirements

Functional Requirements

GET (key) -> value or miss at sub-millisecond p99 latency.
SET (key, value, ttl) that stores the value with optional time-to-live.
DELETE (key) that removes the value.
Common data types: strings, lists, hashes, sets, sorted sets (Redis-style; Memcached is strings only).
TTL expiration: keys with a TTL are automatically removed when expired.
Eviction when full: when memory is exhausted, evict according to a configured policy (LRU, LFU, TTL).
Pub/Sub (optional but common): publishers send messages on channels; subscribers receive them.
Cluster mode: keys partitioned across N nodes; client requests routed to the owning node automatically.

Out of Scope (state explicitly)

ACID transactions across keys (we offer single-key atomicity; multi-key Redis MULTI works only on the same shard).
Secondary indexes / range queries (not a database).
Schema enforcement (values are opaque).
Long-term durability as the primary store of record (cache-aside is the dominant pattern).

Non-Functional Requirements

Latency: p99 < 1 ms for GET on a hot key; p99 < 2 ms for SET.
Throughput: 1M ops/sec across the cluster; 50K-100K ops/sec per node.
Availability: 99.99%; replica failover under 5 seconds.
Memory efficiency: minimize per-key overhead; aim for ~50-100 B of structural overhead per string key.
Predictable eviction: when full, eviction kicks in without dragging tail latency above the 1 ms budget.
Operationally simple: cluster topology changes (add/remove node) should be online with bounded data movement.

Back-of-the-Envelope Estimation

Capacity

Text

---------- Memory and traffic ----------
Keys:                   100M
Average value size:     ~500 B
Logical data:           ~50 GB
Replication factor:     2 (primary + 1 replica)
Replicated data:        ~100 GB
Daily ops (GET + SET):  ~100 billion
Average ops/sec:        ~1.1M
Peak ops/sec:           ~3M
Nodes (each 16 GB RAM): ~16 with headroom

Network

Text

---------- Network ----------
Per-op wire size:        ~150 B (1 GET + value + framing)
1M ops/sec:              ~150 MB/sec aggregate
Per-node:                ~10 MB/sec (well within 10 Gbps NICs)
Replication traffic (writes only): ~100 MB/sec total cluster

Eviction Pressure

Text

---------- Eviction characterization ----------
Working set:             ~80% of allocated memory
New keys per second:     ~50K
Evictions per second:    ~50K when full (one in / one out)
Time budget per eviction: ~10 us (no impact at 100K ops/sec)

High-Level Design

Text

---------- Architecture overview ----------
   +-----------+
   |  Client   |
   +-----------+
        |
        v (smart client knows hash slot map; sends to correct node)
   +-------- cluster of nodes (hash slots) ------+
   |  n1   n2   n3 ... n16                       |
   |  each node owns a contiguous slot range     |
   |  each node has a replica (n1' n2' n3' ...)  |
   +---------------------------------------------+
        |
        v (per-node)
   +------------------+
   | Single-threaded  |
   | event loop       |   (Redis-style; one core handles all I/O + ops)
   +------------------+
        |
        v
   +------------------+
   | In-memory dict   |
   | + LRU/LFU lists  |
   +------------------+
        |
        v (optional)
   +------------------+
   | RDB / AOF disk   |
   +------------------+

Public API (Redis-style protocol, RESP)

Text

---------- Wire protocol example ----------
*3\r\n$3\r\nSET\r\n$3\r\nfoo\r\n$3\r\nbar\r\n
response: +OK\r\n

*2\r\n$3\r\nGET\r\n$3\r\nfoo\r\n
response: $3\r\nbar\r\n

Clients commonly use SDKs (ioredis, redis-py) that negotiate cluster topology and route requests automatically.

Request Flow

Client computes slot = CRC16(key) mod 16384.
Client looks up which node owns that slot from its cached topology.
Client sends the command to that node over a persistent TCP connection.
Node processes in its single-threaded event loop, returns the response.
If a SET, node also forwards the write to the replica asynchronously.

Detailed Design

The two interesting components are the eviction policies and the cluster partitioning model.

Eviction Policies

When used_memory >= maxmemory, the cache must drop something to make room. The policy determines what.

Policy	Drops	Best for	Memory overhead
noeviction	nothing (returns OOM error on writes)	Strict cache where stale > miss	None
allkeys-lru	least recently used	Skewed read workloads (Pareto)	~16 B per key for LRU list
allkeys-lfu	least frequently used	Long-lived popularity bias	~24 B per key for counters
allkeys-random	random	Truly uniform access	None
volatile-lru / volatile-lfu	LRU/LFU among keys with a TTL	Mixed cache + persistent	per-key
volatile-ttl	shortest-TTL keys first	Sliding-window data	per-key

Why LFU often beats LRU

LRU favors recency: a key just touched moves to the front. LFU favors frequency: a key touched 1000 times beats one touched once even if the once was recent. Real production traces (Facebook's Memcached studies) show that on heavy-tailed access distributions, LFU has 5-15% higher hit rate than LRU because it doesn't get fooled by one-time scans (analytics queries pulling rare keys exactly once shouldn't evict the popular ones).

The tradeoff: LFU needs a frequency counter per key, costing more memory; Redis 4+ uses a probabilistic counter (8 bits) plus aging to keep overhead small.

Approximate LRU

Maintaining exact LRU (a doubly linked list with O(1) move-to-front) costs 16 bytes per key for pointers. For a billion-key cache that's 16 GB of pure overhead. Redis uses approximate LRU: sample 5 random keys, evict the oldest among them. Repeated over many evictions, this converges close to true LRU at a fraction of the memory cost.

Text

---------- Approximate LRU (Redis style) ----------
When eviction needed:
  candidates = sample N=5 random keys
  evict the one with the oldest 'last access' timestamp
Repeat until under maxmemory.

Active vs Passive Expiration

Keys with TTL are removed two ways:

Passive: on read, if the key is expired, return miss and delete it.
Active: a background loop samples 20 random keys with TTL every 100 ms; deletes expired ones; if more than 25% were expired, repeats immediately.

Without active expiration, expired keys would leak memory until accidentally read. Active keeps the in-memory footprint close to 'live' size.

Cluster Partitioning Models

Three mainstream approaches; each has its sweet spot.

Client-Side Hashing

Client hashes the key (consistent hashing or simple CRC) and picks the node directly. No coordinator.

Text

---------- Client-side hashing ----------
+ No central component to fail or scale
+ Lowest latency (one direct hop)
- Each client must know full topology; topology updates require client redeploy or a config push
- Adding/removing nodes requires careful client rollout to avoid splits

Memcached clients historically used this pattern (libmemcached's ketama).

Proxy-Based (Twemproxy / mcrouter)

A stateless proxy layer sits between clients and cache nodes. Clients connect to the proxy; proxy routes to the right node.

Text

---------- Proxy-based partitioning ----------
+ Clients are simple; topology changes don't require client updates
+ Proxy can do batching, fan-out, replica failover
- Extra hop adds ~0.2-0.5 ms
- Proxy itself needs HA (run multiple, load balance)

Facebook runs mcrouter in front of Memcached; Twitter open-sourced twemproxy.

Redis Cluster (Hash Slots)

Redis Cluster pre-allocates 16,384 hash slots; each node owns a subset. Clients learn the slot map via gossip and route directly. When a node moves slots (rebalance), clients receive a MOVED redirect on the next request and update their map.

Text

---------- Hash slots ----------
slot = CRC16(key) mod 16384
Each node owns a contiguous range of slots (e.g., n1: 0-1023, n2: 1024-2047, ...)
Resharding moves slots between nodes one at a time, online.

The slot indirection makes resharding much smoother than raw consistent hashing: you rebalance in slot-sized increments without rehashing every key. The trade-off is the fixed slot count; you can't have more than 16384 shards (which is way more than anyone needs in practice).

Hash Tags for Multi-Key Operations

Redis MULTI/EXEC and Lua scripts only work when all keys touch the same node. Use a hash tag in curly braces to force keys onto the same slot:

Text

---------- Hash tag example ----------
Key 'cart:{user42}:items' and 'cart:{user42}:meta' both hash by 'user42' alone
Both land on the same slot, same node
MULTI / Lua across them works atomically

Use this for related keys that need atomic multi-key operations (a user's session + cart, an order's items + total).

Replication and Failover

Each primary has 1 (or more) replicas. Writes go to the primary; replicas pull asynchronously.

Text

---------- Replication flow ----------
1. Client SET on primary; primary responds OK after applying.
2. Replica streams the command from the primary's replication backlog.
3. If primary fails: cluster gossip detects (within 1-5 s); replicas race to become the new primary; the most up-to-date replica wins (compare replication offsets).
4. Clients receive MOVED redirect to the new primary on next request.

Replication is async by default: a write that the primary acked may be lost if the primary dies before replicating. For stricter durability, use WAIT N timeout to block until N replicas have the write (synchronous-ish replication with timeout).

Memory Layout

A Redis dict (hash table) with chained collisions stores key -> value pointers. Keys are interned strings; values can be any of the supported types. Per-key overhead is roughly:

Text

---------- Per-key memory ----------
dict entry overhead:        ~32 B
key string + length:        len + 16 B
value header:               ~16 B
LRU/LFU bits:               ~8-16 B
TTL:                        ~16 B (if set)
Total overhead per key:     ~80-100 B before payload

For billion-key caches, this overhead matters: 100 B * 1B keys = 100 GB just for structure. Memcached has slightly lower overhead (~50 B) by being more frugal but supports only strings.

Persistence (Optional)

Two modes:

RDB snapshots: fork the process, dump in-memory state to a file. Cheap CPU cost (copy-on-write); periodic (every 5 min). Loses up to 5 min of writes on crash.
AOF (append-only file): every write is appended to a log; rewritten periodically. Loses at most 1 second of writes (with everysec fsync). Larger disk footprint.

For pure cache use, neither persistence is needed (cache miss is recoverable). For cache-as-store usage (session store), AOF + RDB combo is common.

Calling Pattern: Cache-Aside

JavaScript

Python

import Redis from 'ioredis';
const cache = new Redis.Cluster([{ host: 'cache', port: 6379 }]);

async function getUserProfile(userId) {
    const cacheKey = `user:${userId}`;
    const cached = await cache.get(cacheKey);
    if (cached) return JSON.parse(cached);
    const profile = await db.queryUser(userId);
    if (profile) {
        await cache.setex(cacheKey, 300, JSON.stringify(profile));   // TTL 5 min
    }
    return profile;
}

This is the dominant pattern: cache-aside with explicit TTL; the cache holds copies, the database is the source of truth.

Data Model

Per-Node In-Memory Layout

Text

---------- Single node ----------
Main dict:        key -> value (object header + payload)
Expiry dict:      key -> expiry_ms (parallel hash table for keys with TTL)
LRU/LFU metadata: 8-24 bits per key, packed into the object header
Client buffer ring (output): per-connection
Replication backlog: ring buffer of recent commands for replicas to catch up

Cluster Topology State

Text

---------- Topology ----------
slot_owners:  array[16384] -> node_id of owning primary
node_table:   node_id -> {host, port, role(primary/replica), epoch, replicates_of}
Gossiped on cluster bus port (separate from client port).

Clients cache the slot_owners map; on MOVED redirect, they refresh their local copy.

Pub/Sub State

Text

---------- Pub/Sub channels ----------
channels: channel_name -> set of subscriber_connections
patterns: glob_pattern -> set of subscriber_connections

Pub/Sub is fire-and-forget; subscribers offline at publish time miss the message.

Scaling and Bottlenecks

Single-threaded Bottleneck

Redis is single-threaded for command processing. A single node tops out at ~100K ops/sec on commodity hardware (more with pipelining). To scale beyond that, shard. Cluster is the standard way; multi-threaded variants exist (KeyDB, Dragonfly) but trade off some Redis semantics.

Hot Keys

A single very hot key serializes through one node's event loop. Mitigations:

Local client cache: clients hold a small LRU cache of the hottest keys with very short TTL (e.g., 1 s). Reduces round trips dramatically.
Read replicas: route reads of hot keys to replicas; SET-heavy keys can't use this trick.
Key splitting: shard the value across N keys (e.g., counter:1, counter:2, ..., counter:N) and aggregate. Adds complexity; only worth it for the truly extreme cases.

Large Values

A single 100 MB value can monopolize the event loop during transfer (single-threaded I/O). Mitigations:

Set per-key size limits at the application layer.
Stream large values in chunks via separate keys.
For media or blobs, use object storage (S3) instead and cache the URL or metadata.

Memory Fragmentation

libc's malloc fragments memory over time (allocate-free patterns leave gaps). Redis tracks mem_fragmentation_ratio = rss / used_memory; values > 1.5 indicate heavy fragmentation. Solution: jemalloc (less prone to fragmentation) and periodic restart or MEMORY PURGE on jemalloc.

Eviction Storm

When memory hits maxmemory, eviction runs on every write. If eviction can't keep up with the write rate, the node returns OOM. Watch the eviction rate; if it sustains > 10% of write rate, the cache is undersized. Add nodes or reduce working set.

Failover Latency

For hot caches, even 5 s of failover is felt by users (5 s of cache misses). Faster failover requires:

Tighter heartbeat intervals (more network chatter).
Short replica catch-up (smaller backlog).
Anti-pattern: one cluster per region rather than cross-region (cross-region failover takes longer).

Cross-Region Replication

Redis Cluster doesn't natively support cross-region replication; teams typically run independent regional clusters with application-level write fan-out, or use Redis Enterprise's CRDB feature. Eventual consistency across regions is the rule.

Trade-offs and Alternatives

Approximate LRU vs Exact LRU

Exact LRU costs O(1) updates with a doubly linked list but 16 B per key in pointers. Approximate LRU samples 5 keys per eviction; nearly the same hit rate empirically at zero pointer overhead. Approximate is the right call for memory-tight caches; the empirical hit-rate gap is < 1% on typical workloads.

LRU vs LFU

LRU treats every access equally; LFU weighs frequency. On Pareto-distributed access (a few keys massively popular), LFU keeps the popular ones longer and gives 5-15% higher hit rate. Use LFU as the default for production; LRU when access is roughly uniform or when you don't want frequency to matter.

Cluster vs Standalone

Standalone is simpler (one node, one process, one config). Cluster adds operational complexity (gossip, slot rebalancing, hash tags) but scales horizontally. Use standalone for caches under ~50 GB and ~100K ops/sec; switch to cluster beyond that.

Persistence vs Pure In-Memory

Pure in-memory is simplest and fastest; cache miss is recoverable from the source. Persistence (RDB or AOF) trades disk I/O cost (10-30% throughput hit for AOF with everysec) for faster recovery after restart. Use persistence when warm-up time matters (large caches) or when the cache is the source of truth (session store).

Synchronous vs Asynchronous Replication

Async (default) is fast but loses recent writes on primary failure; sync (WAIT N timeout) costs cross-replica RTT per write. Most caches accept the small data loss window for the latency win; use WAIT only for caches that hold session state where losing a few minutes of writes is harmful.

Why Not Just Add a Replica to Postgres?

Postgres replicas serve reads but every read still goes through Postgres's relational query path. Cache hits at 0.5 ms per op vs 5-50 ms for Postgres replicas. For high-throughput read-mostly workloads, an explicit cache layer is dramatically cheaper than scaling Postgres replicas.

Memcached vs Redis

Memcached: simpler, multi-threaded, slightly lower overhead per key, strings only. Redis: rich data structures (lists, sets, sorted sets, hashes, streams), single-threaded, optional persistence. Choose Memcached when you only need string KV at the highest possible per-node throughput; choose Redis when you need data structures (rate limiters, leaderboards, queues), pub/sub, or scripting.

Real-World Examples

How real systems implement this in production

Redis (open source + Redis Enterprise)

Redis Cluster ships with the open-source distribution and powers caches at Twitter, Snap, Pinterest, and many others. Redis Enterprise adds CRDB (active-active cross-region replication via CRDTs), tiered storage (hot in RAM, warm on flash), and managed operations. Single-threaded core remains; throughput per node is ~100K ops/sec.

Trade-off: Redis's single-threaded design gives predictable latency and simple correctness reasoning at the cost of per-node throughput. The lesson: a deliberately constrained execution model (one thread, one event loop) trades raw throughput for operational simplicity, which has been the right call for most users.

Memcached (Facebook scale)

Facebook runs Memcached at unprecedented scale: multiple TB per region, billions of keys, millions of ops/sec. They built mcrouter as a routing layer in front (handles failover, sharding, replication). Memcached's slab allocator avoids fragmentation entirely; its multi-threaded event loop scales to multiple cores per node.

Trade-off: Memcached's strict feature set (string KV only, no persistence, no scripting) keeps it lean and very fast but pushes complexity to the application layer (no atomic counters, no rate limiting primitives). The lesson: when you only need cache semantics, the leaner tool wins on throughput per dollar; when you need data structures, Redis is worth its higher per-node overhead.

AWS ElastiCache (Redis and Memcached)

ElastiCache offers managed Redis and Memcached with one-click cluster mode, automatic failover, and integration with the rest of AWS. Most teams pick managed cache to avoid running the cluster themselves.

Trade-off: ElastiCache hides nearly all the operational pain of running clusters at the cost of vendor lock-in and limited tuning (you can't run custom modules, can't tweak some kernel settings). The lesson: managed cache is the right default for most teams; the cost premium is dwarfed by the engineering hours saved on operations.

Dragonfly (BSD-licensed Redis-compatible)

Dragonfly is a multi-threaded Redis-compatible server that uses shared-nothing per-thread sharding within one process. Reports 25x throughput per node vs Redis on the same hardware. Backed by VC funding; gaining adoption at companies needing higher per-node density.

Trade-off: Dragonfly's per-thread sharding gets dramatically higher throughput but breaks some Redis semantics (notably, MULTI is per-thread; cross-thread atomicity needs explicit handling). The lesson: throwing more cores at a problem can be cheaper than scaling out, but only if you're willing to give up some of the constraints that made the original design predictable.

Quick Interview Phrases

Key terms to use in your answer

approximate LRU sampling

Redis Cluster hash slots with MOVED redirect

single-threaded event loop per node

cache-aside with explicit TTL

hash tags for atomic multi-key ops

active and passive expiration

Common Interview Questions

Questions you might be asked about this topic

Walk me through what happens to a SET command in your distributed cache.

Client computes slot = CRC16(key) mod 16384 and looks up the owning node from its cached topology map. Client sends the SET command over a persistent TCP connection to that node. Node's single-threaded event loop reads the command from the socket, parses RESP, executes the SET (allocates the value, inserts into the dict, updates the LRU/LFU metadata), and writes +OK to the response buffer. If memory is at maxmemory, eviction kicks in BEFORE the insert: sample 5 random keys, drop the worst per the configured policy (LRU/LFU), repeat until under threshold. Asynchronously, the primary replicates the command to its replica via the replication backlog. Total p99: < 1 ms for the client; replication completes within a few hundred microseconds afterward.

How do you handle a node failure in your cluster?

Why pick LFU over LRU for your default eviction policy?

How do you keep a hot key from saturating one node?

When would you NOT use Redis Cluster?

Interview Tips

How to discuss this topic effectively

Lead with the latency budget. Saying 'sub-millisecond p99 forces in-memory and rules out almost everything else' frames why this design is so different from a durable KV store.

Bring up approximate LRU explicitly. Exact LRU costs 16 bytes per key in pointers; approximate trades a fraction of a percent of hit rate for billions of bytes of structural memory.

Distinguish the three partitioning models (client-side, proxy, hash slots) and pick hash slots as the modern default. Proxy still has its place for very-large fleets; client-side is mostly historical.

Mention hash tags when discussing multi-key operations. Without them, MULTI and Lua only work on a single shard; with them, you can guarantee co-location for related keys.

Default to LFU over LRU for read-heavy production workloads. Most candidates say LRU because it's the textbook answer; LFU's hit-rate advantage on Pareto distributions is a senior signal.

Common Mistakes

Pitfalls to avoid in interviews

Using exact LRU pointers in your design

Exact LRU costs 16 B per key in linked-list pointers. For a billion-key cache that's 16 GB of pure overhead. Approximate LRU (sample 5 keys, evict the oldest) gives the same hit rate within 1% at near-zero overhead and is what Redis actually uses.

Forgetting that Redis is single-threaded for command execution

A single Redis node tops out at ~100K ops/sec because all commands run on one event loop. To scale, shard with cluster mode (or use a multi-threaded variant like KeyDB or Dragonfly with the trade-offs that brings).

Assuming MULTI works across keys on a cluster

MULTI/EXEC and Lua scripts require all keys to live on the same shard. Use hash tags `{tag}` to force related keys onto the same slot when you need atomicity; otherwise multi-key transactions will fail with CROSSSLOT errors.

Treating cache as durable and skipping the source-of-truth database

Pure in-memory caches lose data on restart. Even with AOF persistence, the cache should not be the only copy of business-critical state; use cache-aside with the database as the durable source. Reserve cache-as-store usage for session-style data where loss is recoverable.

Choosing LRU for every workload

LRU is the default but not always best. On heavy-tailed access patterns (a few keys hugely popular), LFU keeps the popular ones longer and gives 5-15% higher hit rate. Pick LFU for read-heavy production workloads with skewed access; LRU for uniform access.

Back to System Design