System Design Article

Design a Chat System (WhatsApp)

Difficulty: Medium

Design a real-time chat system like WhatsApp serving 2B users sending 100B messages per day with sub-second delivery, presence indicators, and read receipts. The interview centerpiece is the persistent WebSocket connection layer: how many connections per server, how to route a message to a recipient who may be on a different server, and how to guarantee delivery when the recipient is offline. We cover the message delivery state machine (sent, delivered, read), the connection routing layer that maps user_id to a chat server, the message store for offline delivery, and presence/typing indicators that operate at a higher write rate than messages themselves.

System Design
/

Design a Chat System (WhatsApp)

Design a Chat System (WhatsApp)

Design a real-time chat system like WhatsApp serving 2B users sending 100B messages per day with sub-second delivery, presence indicators, and read receipts. The interview centerpiece is the persistent WebSocket connection layer: how many connections per server, how to route a message to a recipient who may be on a different server, and how to guarantee delivery when the recipient is offline. We cover the message delivery state machine (sent, delivered, read), the connection routing layer that maps user_id to a chat server, the message store for offline delivery, and presence/typing indicators that operate at a higher write rate than messages themselves.

System Design
Medium
design-chat-system
case-study
messaging-communication
chat
websockets
real-time
presence
delivery-receipts
at-least-once
fan-out
session-affinity
system-design
intermediate
free

664 views

16

Requirements

Functional Requirements

  1. One-to-one messaging: Alice sends a text message to Bob; Bob receives it within 1 second when online, or as soon as he reconnects when offline.
  2. Group messaging: send a message to a group of up to 256 members; each member sees it in their conversation list.
  3. Delivery receipts: sender sees the message state transition through sent (server received it), delivered (recipient device received it), read (recipient opened the conversation).
  4. Presence: see whether a contact is online, offline, or last seen X minutes ago.
  5. Typing indicators: see Alice is typing... while she composes a message.
  6. Message history: view past conversations on any new device.

Out of Scope (state explicitly)

  • End-to-end encryption (WhatsApp uses Signal protocol; treat it as a black box that wraps the payload).
  • Voice and video calls (use the Video Conferencing case study).
  • Media attachments (similar pipeline to Instagram photo upload; we focus on text messages).
  • Message search (could plug in Elasticsearch later).

Non-Functional Requirements

  1. Scale: 2B users, 500M concurrent connections at peak, 100B messages per day.
  2. Latency: p99 message delivery < 1 second for online recipients.
  3. Availability: 99.99%. The product is the connection; if it drops, users notice instantly.
  4. Durability: messages must not be lost. At-least-once delivery is acceptable; the client deduplicates by message_id.
  5. Ordered delivery within a conversation: messages in conversation X must arrive in send order.
  6. Eventual consistency for receipts: a 2-3 second lag on delivered and read is fine.

Back-of-the-Envelope Estimation

Users, Connections, Messages

Text
---------- User and traffic estimation ----------
Total users:                          2B
Monthly active users:                 1.5B
Daily active users (DAU):             1B
Concurrent connections at peak:       500M (50% of DAU online together)

Messages per DAU per day:             100
Total messages per day:               1B * 100 = 100B
Messages per second (avg):            100B / 86400 ~= 1.16M /sec
Messages per second (peak 3x):        ~3.5M /sec

The headline number is 500M concurrent WebSocket connections at peak. With ~10M connections per chat server (achievable on a tuned Linux box with epoll and event-driven I/O), we need ~50,000 chat servers behind L4 load balancers.

Storage

Text
---------- Message storage ----------
Average message size:                 100 bytes (text + metadata)
Messages per day:                     100B
Raw storage per day:                  100B * 100 = 10 TB/day
After compression (3x):               ~3.3 TB/day
Per year:                             3.3 TB * 365 ~= 1.2 PB/year
With 3x replication:                  ~3.6 PB/year

Retention:                            keep forever (free), drop after 30 days for inactive

Bandwidth

Text
---------- Bandwidth ----------
Inbound (sender -> server):           1.16M msg/s * 200 B (with framing) = 232 MB/s
Outbound (server -> recipients):       1.16M msg/s * ~1.5 (groups average) = 1.7M deliveries/s
Delivery bandwidth:                    1.7M * 200 B = 340 MB/s
Plus presence/typing:                  ~10x message rate = ~3.5 GB/s outbound

Presence and typing indicators dominate bandwidth, which is why they need careful throttling.

High-Level Design

Text
---------- High-level architecture ----------
   +-----------+               +-----------+
   | Client A  |               | Client B  |
   +-----------+               +-----------+
         |                            |
         |  WSS (persistent)          |
         v                            v
   +-------------------------------------+
   |    L4 Load Balancer (sticky)        |
   +-------------------------------------+
         |                            |
         v                            v
   +-----------+               +-----------+
   |Chat Server|               |Chat Server|
   |   (1)     |   <-------->  |   (2)     |
   +-----------+   Pub/Sub     +-----------+
         |       (Redis/Kafka)         |
         v                            v
   +-------------------------------------+
   |       Session Registry (Redis)      |
   |  user_id -> chat_server_id          |
   +-------------------------------------+
         |                            |
         v                            v
   +-------------------------------------+
   |   Message Store (Cassandra)         |
   |  partition by conversation_id       |
   +-------------------------------------+
         |
         v
   +-------------------------------------+
   |  Inbox / Offline Queue (Kafka)      |
   +-------------------------------------+

API Design

Messages flow over a single bidirectional WebSocket connection. We define a small protocol on top of WSS frames; JSON shown for clarity (production WhatsApp uses a binary protobuf variant).

Jsonc
// Client -> Server: send a message
{
    "type": "send",
    "client_msg_id": "01HW3M9...",     // ULID, idempotency key
    "conversation_id": "conv_abc123",
    "recipient_ids": ["u_bob"],         // multiple for groups
    "body": "hello",
    "sent_at": 1714128000000
}

// Server -> Client: ack with server-assigned message_id
{
    "type": "ack",
    "client_msg_id": "01HW3M9...",
    "server_msg_id": "msg_xyz789",
    "server_ts": 1714128000123
}

// Server -> recipient client: deliver
{
    "type": "deliver",
    "server_msg_id": "msg_xyz789",
    "conversation_id": "conv_abc123",
    "sender_id": "u_alice",
    "body": "hello",
    "server_ts": 1714128000123
}

// Recipient -> Server: receipt
{
    "type": "receipt",
    "server_msg_id": "msg_xyz789",
    "state": "delivered"               // or "read"
}

// Server -> sender: receipt update
{
    "type": "receipt_update",
    "server_msg_id": "msg_xyz789",
    "state": "delivered",
    "by_user": "u_bob",
    "at": 1714128000456
}

Client Reconnection (HTTP fallback)

For the initial connection and history sync we still need a few REST endpoints:

Jsonc
// Authenticate and get a chat server URL
POST /api/v1/chat/connect
{ "device_id": "...", "auth_token": "..." }

// Response: tells the client which chat server to open a WS to
{
    "ws_url": "wss://chat-37.example.com/ws",
    "session_token": "..."
}

// Pull messages received while offline (or for a fresh device)
GET /api/v1/conversations/<id>/messages?after=<msg_id>&limit=100

Message Send Flow (online recipient)

Text
---------- Online delivery flow ----------
1. Client A sends `send` frame over WS to Chat Server 1
2. Chat Server 1 generates server_msg_id (Snowflake)
3. Chat Server 1 writes to Cassandra (durable) ~ 5 ms
4. Chat Server 1 sends `ack` back to Client A
5. Chat Server 1 looks up Bob in Session Registry
   -> finds Bob is on Chat Server 2
6. Chat Server 1 publishes to Redis Pub/Sub channel `chat-server-2`
7. Chat Server 2 receives the message via subscription
8. Chat Server 2 sends `deliver` frame to Client B over WS
9. Client B sends `receipt` (delivered) back
10. Chat Server 2 forwards receipt to Chat Server 1
11. Chat Server 1 sends `receipt_update` to Client A

End-to-end latency: ~50-200 ms in the same region, dominated by network RTT.

Message Send Flow (offline recipient)

Text
---------- Offline delivery flow ----------
1-4. Same as online flow.
5. Session Registry lookup returns NO active server for Bob.
6. Chat Server 1 writes the message to Bob's offline queue (Kafka topic partitioned by user_id).
7. When Bob reconnects:
   a. Bob's new chat server pulls from Bob's Kafka partition.
   b. Sends each pending message as a `deliver` frame.
   c. Bob acks; receipts flow back as in the online case.

Detailed Design

The two interesting components are the WebSocket gateway / session routing and the message delivery state machine.

WebSocket Gateway and Cross-Server Routing

Why persistent connections (not HTTP polling)?

A push model requires the server to initiate communication when a new message arrives. HTTP polling at 1 Hz means 500M users * 1 request/sec = 500M requests/sec just to check for nothing. WebSockets keep one connection open per user; the server pushes only when something happens.

Per-server connection capacity

A tuned Linux box (epoll, SO_REUSEPORT, increased file descriptor limits, large socket buffers) holds ~10M idle TCP connections in ~64 GB of memory. The math:

Text
---------- Per-server connection budget ----------
Kernel TCP socket overhead:           ~2 KB
Userspace per connection (buffers,    ~4 KB
  WS framing state, last activity):
Total per connection:                 ~6 KB
10M connections:                       ~60 GB
Leaves ~4 GB for the actual chat process

The throughput budget is separate. Each server pushes ~50K messages/sec at peak; CPU is dominated by TLS termination (often offloaded to the load balancer).

Sticky load balancing

The load balancer must keep a connection on the same chat server (no rebalancing mid-connection). L4 balancers do this naturally because TCP connections are pinned. New connections from the same client should also try to land on the same server (consistent hashing on (user_id, device_id)) so the session registry doesn't need updating.

Session Registry (the routing table)

Map each connected user to their chat server.

Text
Key:    session:<user_id>:<device_id>
Value:  chat_server_id, connected_at, last_seen
TTL:    120 seconds (refreshed by heartbeat)

Backed by Redis Cluster (sharded by user_id). On WebSocket open, the chat server writes the entry; on heartbeat (every 30s), the TTL is refreshed; on disconnect, the entry is deleted.

Size: 500M concurrent connections * 100 bytes = 50 GB in Redis, fits in ~10 nodes.

Cross-server delivery: Pub/Sub between chat servers

When Server 1 needs to push a message to Bob on Server 2, two patterns work:

PatternHow it worksTrade-off
Direct connectionServer 1 opens a TCP connection to Server 2 and sends directlyN^2 connections between servers; manageable for ~50K servers? No, 2.5B pairs.
Redis Pub/SubEach chat server subscribes to its own channel srv:<id>; Server 1 publishes to srv:2Single hop, ~5 ms; Redis is the bottleneck but pub/sub is cheap
KafkaEach chat server consumes a topic partitioned by server_idHigher latency (50-100 ms), higher throughput, durable

For a chat system, Redis Pub/Sub is the standard answer. It's fast (sub-10 ms), simple, and Redis Cluster scales horizontally. The downside is no durability: if Server 2 misses a publish (because it crashed mid-delivery), the message must come from the durable store. That's fine because we always write to Cassandra first before publishing; the publish is the fast-path.

Heartbeats and connection liveness

WebSockets do not detect a dead connection on their own (a network drop looks identical to an idle connection). Each side sends a ping frame every 30 seconds; if no pong within 60 seconds, the connection is closed and resources reclaimed.

Message Delivery State Machine

Each message moves through a small state machine. The sender's UI shows different ticks based on the state.

Text
---------- Message states ----------
      Client A          Server          Client B
          |               |                |
  send -->| [client_msg]  |                |
          | client picks ULID id           |
          |               |                |
          |   send frame  |                |
          | ------------> |                |
          |               | persist to     |
          |               | Cassandra      |
          |   ack         |                |
          | <------------ |                |
          | (state=sent,  |                |
          |  one tick)    |                |
          |               |  deliver       |
          |               | -------------->|
          |               |   receipt      |
          |               |   (delivered)  |
          |               | <--------------|
          |  receipt_upd  |                |
          | <------------ |                |
          | (state=       |                |
          |  delivered,   |                |
          |  two ticks)   |                |
          |               | open conv:     |
          |               |   receipt      |
          |               |   (read)       |
          |               | <--------------|
          |  receipt_upd  |                |
          | <------------ |                |
          | (state=read,  |                |
          |  blue ticks)  |                |
At-least-once delivery and idempotency

The network can drop any frame. To guarantee no message loss, both sides retry until acknowledged. Duplicates are inevitable; we deduplicate by:

  • Sender retries the same client_msg_id. Server checks (sender_id, client_msg_id); if the message already exists, return the same server_msg_id without re-persisting.
  • Receiver dedups by server_msg_id in a small per-conversation set.
Why server-assigned IDs (Snowflake)?

Server-side IDs give us total ordering within a conversation (sortable timestamps + machine bits + sequence). Client clocks drift and are untrusted; we cannot use them for ordering. The client also sends a client_msg_id (ULID) for idempotency, but the canonical ID is the server's.

Group Messaging Fan-Out

For a group of N members, the chat server fans the message out N-1 times: looks up each member in the Session Registry and publishes via Pub/Sub. With N capped at 256, fan-out is bounded; no celebrity problem.

For very large broadcast lists (channel-style products with 100K+ subscribers), we'd switch to a fan-out worker pool reading from a Kafka topic per channel. This is the boundary where chat ends and notification service begins.

Presence and Typing Indicators

Presence (online/offline) is fundamentally a different write pattern than messages: it changes constantly, has tiny payloads, and is OK to lose.

Text
---------- Presence design ----------
State:           online | offline | last_seen=<timestamp>
Stored in:       Redis only (no durable store)
TTL:             90 seconds; refreshed by WS heartbeat
Fan-out:         only to subscribers (people who have an open chat with you)
Throttle:        max 1 update per user per 10 seconds

Typing indicators are even more ephemeral: they live for 5 seconds, expire automatically, and never touch durable storage. We send typing_start once when the user starts typing and let it expire (no typing_stop unless the user closes the input).

Without throttling, presence and typing dominate the message bus. WhatsApp historically batched presence updates at the gateway level (one batch per connection per 5 seconds).

Data Model

Cassandra: messages (append-only, partition by conversation)

SQL
-- Cassandra schema
CREATE TABLE messages (
    conversation_id  text,
    server_msg_id    bigint,             -- Snowflake (sortable by time)
    sender_id        text,
    body             text,
    sent_at          timestamp,
    PRIMARY KEY ((conversation_id), server_msg_id)
) WITH CLUSTERING ORDER BY (server_msg_id DESC);

-- Per-conversation reads are a single partition scan, very fast.
-- Latest 50 messages: SELECT * FROM messages WHERE conversation_id = ? LIMIT 50;

Cassandra is ideal here: time-series writes, partition by conversation, range scan by server_msg_id for history. A single popular group conversation can produce thousands of messages per second to one partition; Cassandra handles that comfortably.

Postgres (sharded): users, conversations, membership

SQL
CREATE TABLE users (
    id           BIGINT PRIMARY KEY,
    phone        VARCHAR(20) UNIQUE NOT NULL,
    name         VARCHAR(64),
    created_at   TIMESTAMPTZ NOT NULL
);

CREATE TABLE conversations (
    id           VARCHAR(32) PRIMARY KEY,
    type         VARCHAR(8) NOT NULL,         -- 'one_to_one' | 'group'
    name         VARCHAR(64),                 -- group name
    created_at   TIMESTAMPTZ NOT NULL,
    last_msg_id  BIGINT                       -- denormalized for sort
);

CREATE TABLE conversation_members (
    conversation_id VARCHAR(32) NOT NULL,
    user_id         BIGINT NOT NULL,
    joined_at       TIMESTAMPTZ NOT NULL,
    last_read_msg   BIGINT,                   -- for unread-count display
    PRIMARY KEY (conversation_id, user_id)
);
CREATE INDEX idx_member_user ON conversation_members (user_id, conversation_id);

Shard by user_id. Most queries (my conversations, mark as read) are user-scoped; the conversation table is small (1B conversations * 200 bytes = 200 GB) and easily fits sharded.

Redis: session registry, presence, unread counts

Text
----------- Redis keys ----------
session:<user_id>:<device_id>     -> {server_id, last_heartbeat}    TTL 120s
presence:<user_id>                 -> {state, last_seen}             TTL 90s
unread:<user_id>:<conv_id>        -> integer counter                no TTL
typing:<conv_id>                   -> SET of user_ids                TTL 5s
idempotent:<sender>:<client_id>   -> server_msg_id                  TTL 60s

Kafka: offline message queue

One topic offline-messages, partitioned by recipient_user_id. Messages live until consumed (offset stored per user) with a 30-day retention so users returning from a long absence still get their inbox.

Scaling and Bottlenecks

Connection storm at peak

At 9 PM local time, all of India simultaneously opens WhatsApp. Connection rate spikes from 100K/sec to 1M/sec. Mitigations:

  • Connection rate limiting at the load balancer (drop excess and return retry-after).
  • Pre-provisioned chat server capacity in the affected region (autoscaling is too slow for connection storms; minutes vs the seconds we need).
  • Backoff with jitter in the client reconnection logic; never reconnect synchronously after a network event.

Hot conversation: a 256-person group with 100 active typers

The Cassandra partition for that conversation gets ~1K writes/sec. Cassandra handles per-partition writes well up to ~10K/sec. Beyond that we'd need to compose the partition key with a time bucket ((conv_id, hour)).

Presence storm

500M users coming online together would generate 500M Redis writes for presence. Mitigations:

  • Lazy presence: don't store online until someone actually requests it (only the contacts of the online user).
  • Coalesced updates: gateway batches presence updates per 5-second window before publishing.
  • TTL-based expiry rather than explicit offline writes: online is a heartbeat refresh; offline is just the absence of a key.

Multi-region replication

A chat between two users in different regions is the hard case. Two designs:

  • Home-region routing: each user has a home region; the conversation lives in one of the participants' home regions. Cross-region delivery hops through a region-aware Pub/Sub layer (50-150 ms added latency).
  • Multi-master with conflict-free types: every message has a globally unique server_msg_id and is replicated to both regions; reads are local. This is what real WhatsApp does. Trade-off: replication lag means a sender in EU might see their message appear before the recipient in US gets it.

What breaks at 100x?

At 200B users (hypothetical), the Session Registry becomes the bottleneck. Solution: shard session lookups by user_id and route Pub/Sub through a partitioned message bus (Kafka rather than Redis Pub/Sub) accepting the higher latency.

Trade-offs and Alternatives

Why WebSockets over Server-Sent Events (SSE)?

SSE is one-way (server -> client). Chat needs bidirectional flow (the client sends messages too). With SSE, the send path needs a separate HTTP POST per message, doubling the round trips for the most frequent operation. WebSockets give us send and receive on the same connection.

Why Redis Pub/Sub over Kafka for cross-server delivery?

Kafka is durable but adds 50-100 ms latency. Redis Pub/Sub is sub-10 ms but loses messages if a subscriber is briefly disconnected. We accept that loss because we always write to Cassandra first; if Pub/Sub drops a delivery, the recipient pulls it on reconnect from Kafka offline queue or by polling the conversation's latest messages.

Why Cassandra over Postgres for messages?

Cassandra's per-partition write pattern matches per-conversation writes. Postgres at 1M writes/sec across 100B rows would need aggressive sharding and complex partition management. Cassandra was built for this exact workload (originally at Facebook for the Inbox Search). Trade-off: no joins (we denormalize), eventual consistency by default.

Why server-side IDs over client-side UUIDs?

Client clocks drift. UUIDs aren't ordered. Sorting messages by client_ts produces wrong results when the client's clock is wrong. Server-side Snowflake IDs give us total order plus a useful timestamp embedded in the ID.

Why per-user offline queue over per-conversation?

When Bob comes online he wants all messages, not per-conversation streams. A single Kafka topic partitioned by user_id is one consumer per user, regardless of how many conversations they have. Per-conversation queues would mean N subscriptions for a user in N groups.

When to use group fan-out vs broadcast channel

WhatsApp groups cap at 256 members. Above that we should think of it as a broadcast (channel-style: WhatsApp Channels, Telegram channels). The architecture changes: a publisher writes once to a topic, a fan-out worker pool delivers to subscribers in batches. Trying to fan out a 100K-member message in real time over Pub/Sub overwhelms the message bus.

Real-World Examples

How real systems implement this in production

WhatsApp

WhatsApp runs on Erlang/OTP for the chat servers, leveraging the actor model to handle ~10M concurrent connections per server. Messages persist in a sharded Mnesia/MySQL hybrid; cross-server delivery uses an internal pub/sub. The 2014 acquisition disclosed only ~50 engineers serving 450M users at the time, largely thanks to Erlang's process model.

Trade-off: Erlang's per-actor isolation is the single biggest reason WhatsApp scales connections per server better than competitors. The trade-off is a smaller talent pool (most engineers don't know Erlang) and harder integration with non-Erlang services.

Telegram

Telegram uses MTProto, a custom binary protocol over TCP with optional encryption. Servers are written in C++ and use a sharded data center model where each user has a 'home' DC. Cross-DC messages route through a global directory. Storage is custom (not Cassandra), optimized for fast inbox sync.

Trade-off: MTProto's custom binary framing reduces bandwidth vs WhatsApp's protobuf-over-WSS, but the custom protocol means Telegram clients can't reuse standard WebSocket libraries. Telegram trades portability for efficiency.

Signal

Signal stores essentially nothing on the server: messages are end-to-end encrypted and the server only routes opaque blobs from sender to recipient. Group state is also encrypted client-side. The server's only durable store is the encrypted message queue waiting for offline recipients.

Trade-off: Signal's privacy model means features like cross-device sync, message search, and history backup require client-side workarounds. Server simplicity comes at the cost of feature complexity.

Slack

Slack's chat is similar but optimized for workspaces (one team = one channel set). They use WebSockets via their RTM (Real-Time Messaging) gateway, with messages persisted to MySQL sharded by team_id. A single workspace with 100K users is one big shard, which has driven their ongoing migration to per-channel sharding for very large enterprises.

Trade-off: Sharding by team_id was simple early on but became a hotspot for huge workspaces. The lesson: pick a sharding key that scales with your largest entity, not your average one.

Quick Interview Phrases

Key terms to use in your answer

persistent WebSocket connections
session registry
delivery state machine (sent / delivered / read)
at-least-once with idempotency keys
cross-server pub/sub fan-out
presence is a heartbeat, not a write

Common Interview Questions

Questions you might be asked about this topic

Client A sends a `send` frame over the WebSocket with a client_msg_id. Chat Server 1 generates a Snowflake server_msg_id, persists to Cassandra (partition by conversation), acks the sender. Server 1 looks up Bob in the Session Registry (Redis); finds Bob on Server 2. Server 1 publishes to Redis Pub/Sub channel `srv:2`. Server 2 receives the publish, sends `deliver` over Bob's WebSocket. Bob's client renders the bubble and sends back a `receipt` of `delivered`. Server 2 forwards the receipt to Server 1 via Pub/Sub; Server 1 sends `receipt_update` to Alice. End-to-end: ~50-200 ms.

Interview Tips

How to discuss this topic effectively

1

Lead with 'this is a persistent-connection problem, not a request/response problem'. That immediately frames the design around WebSockets, session affinity, and offline queues, which is the right shape.

2

When asked about cross-server delivery, name the Session Registry explicitly. Most candidates handwave 'the server knows where Bob is'. Drawing the Redis lookup and the Pub/Sub publish step is what separates a senior answer.

3

Always commit to at-least-once delivery and explain idempotency. Saying 'exactly-once' is a red flag; the interviewer will probe and you will fail. Real systems are at-least-once with client-side dedup.

4

Treat presence and typing as separate problems with separate write rates. Conflating them with messages leads to designs that look right but fail under presence storms.

5

Cite WhatsApp's 'tens of millions of connections per server' number from the original Erlang post. It signals you've read the engineering blogs and aren't pulling numbers from thin air.

Common Mistakes

Pitfalls to avoid in interviews

Picking HTTP long polling instead of WebSockets

Long polling means a new TCP connection (and a new TLS handshake) for every poll. At 500M concurrent users that is hundreds of millions of redundant handshakes per minute. WebSockets reuse one persistent connection. Long polling is acceptable only as a fallback for clients behind WebSocket-blocking middleboxes.

Storing messages in Postgres without sharding strategy

100B messages/day means a single table grows by 100B rows/day. No Postgres instance handles that. Either shard by conversation_id with a strict access pattern (always conversation-scoped reads) or use Cassandra, which was built for time-series partitioned writes.

Claiming exactly-once delivery

Exactly-once over an unreliable network is impossible without distributed transactions. Real systems guarantee at-least-once delivery and rely on the receiver to deduplicate by message_id. The illusion of exactly-once is a property of the client, not the network.

Treating presence like a regular write

Presence changes far more often than messages and is acceptable to lose. Storing each toggle in Cassandra would melt the cluster. Use Redis with TTL-based expiry: an `online` user is one whose key was refreshed in the last 90 seconds; `offline` is just the absence of a key.

Forgetting offline delivery and relying on the recipient being online

Most messages are sent to recipients who aren't currently connected. The Session Registry returning empty must trigger a write to a per-user offline queue (Kafka topic partitioned by user_id), drained on reconnect. Without this, a message sent at 2 AM to a sleeping recipient is lost.