System Design Article
Design a Chat System (WhatsApp)
Difficulty: Medium
Design a real-time chat system like WhatsApp serving 2B users sending 100B messages per day with sub-second delivery, presence indicators, and read receipts. The interview centerpiece is the persistent WebSocket connection layer: how many connections per server, how to route a message to a recipient who may be on a different server, and how to guarantee delivery when the recipient is offline. We cover the message delivery state machine (sent, delivered, read), the connection routing layer that maps user_id to a chat server, the message store for offline delivery, and presence/typing indicators that operate at a higher write rate than messages themselves.
Design a Chat System (WhatsApp)
Design a real-time chat system like WhatsApp serving 2B users sending 100B messages per day with sub-second delivery, presence indicators, and read receipts. The interview centerpiece is the persistent WebSocket connection layer: how many connections per server, how to route a message to a recipient who may be on a different server, and how to guarantee delivery when the recipient is offline. We cover the message delivery state machine (sent, delivered, read), the connection routing layer that maps user_id to a chat server, the message store for offline delivery, and presence/typing indicators that operate at a higher write rate than messages themselves.
664 views
16
Requirements
Functional Requirements
- One-to-one messaging: Alice sends a text message to Bob; Bob receives it within 1 second when online, or as soon as he reconnects when offline.
- Group messaging: send a message to a group of up to 256 members; each member sees it in their conversation list.
- Delivery receipts: sender sees the message state transition through
sent(server received it),delivered(recipient device received it),read(recipient opened the conversation). - Presence: see whether a contact is
online,offline, orlast seen X minutes ago. - Typing indicators: see
Alice is typing...while she composes a message. - Message history: view past conversations on any new device.
Out of Scope (state explicitly)
- End-to-end encryption (WhatsApp uses Signal protocol; treat it as a black box that wraps the payload).
- Voice and video calls (use the Video Conferencing case study).
- Media attachments (similar pipeline to Instagram photo upload; we focus on text messages).
- Message search (could plug in Elasticsearch later).
Non-Functional Requirements
- Scale: 2B users, 500M concurrent connections at peak, 100B messages per day.
- Latency: p99 message delivery < 1 second for online recipients.
- Availability: 99.99%. The product is the connection; if it drops, users notice instantly.
- Durability: messages must not be lost. At-least-once delivery is acceptable; the client deduplicates by message_id.
- Ordered delivery within a conversation: messages in conversation X must arrive in send order.
- Eventual consistency for receipts: a 2-3 second lag on
deliveredandreadis fine.
Back-of-the-Envelope Estimation
Users, Connections, Messages
---------- User and traffic estimation ----------
Total users: 2B
Monthly active users: 1.5B
Daily active users (DAU): 1B
Concurrent connections at peak: 500M (50% of DAU online together)
Messages per DAU per day: 100
Total messages per day: 1B * 100 = 100B
Messages per second (avg): 100B / 86400 ~= 1.16M /sec
Messages per second (peak 3x): ~3.5M /secThe headline number is 500M concurrent WebSocket connections at peak. With ~10M connections per chat server (achievable on a tuned Linux box with epoll and event-driven I/O), we need ~50,000 chat servers behind L4 load balancers.
Storage
---------- Message storage ----------
Average message size: 100 bytes (text + metadata)
Messages per day: 100B
Raw storage per day: 100B * 100 = 10 TB/day
After compression (3x): ~3.3 TB/day
Per year: 3.3 TB * 365 ~= 1.2 PB/year
With 3x replication: ~3.6 PB/year
Retention: keep forever (free), drop after 30 days for inactiveBandwidth
---------- Bandwidth ----------
Inbound (sender -> server): 1.16M msg/s * 200 B (with framing) = 232 MB/s
Outbound (server -> recipients): 1.16M msg/s * ~1.5 (groups average) = 1.7M deliveries/s
Delivery bandwidth: 1.7M * 200 B = 340 MB/s
Plus presence/typing: ~10x message rate = ~3.5 GB/s outboundPresence and typing indicators dominate bandwidth, which is why they need careful throttling.
High-Level Design
---------- High-level architecture ----------
+-----------+ +-----------+
| Client A | | Client B |
+-----------+ +-----------+
| |
| WSS (persistent) |
v v
+-------------------------------------+
| L4 Load Balancer (sticky) |
+-------------------------------------+
| |
v v
+-----------+ +-----------+
|Chat Server| |Chat Server|
| (1) | <--------> | (2) |
+-----------+ Pub/Sub +-----------+
| (Redis/Kafka) |
v v
+-------------------------------------+
| Session Registry (Redis) |
| user_id -> chat_server_id |
+-------------------------------------+
| |
v v
+-------------------------------------+
| Message Store (Cassandra) |
| partition by conversation_id |
+-------------------------------------+
|
v
+-------------------------------------+
| Inbox / Offline Queue (Kafka) |
+-------------------------------------+API Design
Messages flow over a single bidirectional WebSocket connection. We define a small protocol on top of WSS frames; JSON shown for clarity (production WhatsApp uses a binary protobuf variant).
// Client -> Server: send a message
{
"type": "send",
"client_msg_id": "01HW3M9...", // ULID, idempotency key
"conversation_id": "conv_abc123",
"recipient_ids": ["u_bob"], // multiple for groups
"body": "hello",
"sent_at": 1714128000000
}
// Server -> Client: ack with server-assigned message_id
{
"type": "ack",
"client_msg_id": "01HW3M9...",
"server_msg_id": "msg_xyz789",
"server_ts": 1714128000123
}
// Server -> recipient client: deliver
{
"type": "deliver",
"server_msg_id": "msg_xyz789",
"conversation_id": "conv_abc123",
"sender_id": "u_alice",
"body": "hello",
"server_ts": 1714128000123
}
// Recipient -> Server: receipt
{
"type": "receipt",
"server_msg_id": "msg_xyz789",
"state": "delivered" // or "read"
}
// Server -> sender: receipt update
{
"type": "receipt_update",
"server_msg_id": "msg_xyz789",
"state": "delivered",
"by_user": "u_bob",
"at": 1714128000456
}Client Reconnection (HTTP fallback)
For the initial connection and history sync we still need a few REST endpoints:
// Authenticate and get a chat server URL
POST /api/v1/chat/connect
{ "device_id": "...", "auth_token": "..." }
// Response: tells the client which chat server to open a WS to
{
"ws_url": "wss://chat-37.example.com/ws",
"session_token": "..."
}
// Pull messages received while offline (or for a fresh device)
GET /api/v1/conversations/<id>/messages?after=<msg_id>&limit=100Message Send Flow (online recipient)
---------- Online delivery flow ----------
1. Client A sends `send` frame over WS to Chat Server 1
2. Chat Server 1 generates server_msg_id (Snowflake)
3. Chat Server 1 writes to Cassandra (durable) ~ 5 ms
4. Chat Server 1 sends `ack` back to Client A
5. Chat Server 1 looks up Bob in Session Registry
-> finds Bob is on Chat Server 2
6. Chat Server 1 publishes to Redis Pub/Sub channel `chat-server-2`
7. Chat Server 2 receives the message via subscription
8. Chat Server 2 sends `deliver` frame to Client B over WS
9. Client B sends `receipt` (delivered) back
10. Chat Server 2 forwards receipt to Chat Server 1
11. Chat Server 1 sends `receipt_update` to Client AEnd-to-end latency: ~50-200 ms in the same region, dominated by network RTT.
Message Send Flow (offline recipient)
---------- Offline delivery flow ----------
1-4. Same as online flow.
5. Session Registry lookup returns NO active server for Bob.
6. Chat Server 1 writes the message to Bob's offline queue (Kafka topic partitioned by user_id).
7. When Bob reconnects:
a. Bob's new chat server pulls from Bob's Kafka partition.
b. Sends each pending message as a `deliver` frame.
c. Bob acks; receipts flow back as in the online case.Detailed Design
The two interesting components are the WebSocket gateway / session routing and the message delivery state machine.
WebSocket Gateway and Cross-Server Routing
Why persistent connections (not HTTP polling)?
A push model requires the server to initiate communication when a new message arrives. HTTP polling at 1 Hz means 500M users * 1 request/sec = 500M requests/sec just to check for nothing. WebSockets keep one connection open per user; the server pushes only when something happens.
Per-server connection capacity
A tuned Linux box (epoll, SO_REUSEPORT, increased file descriptor limits, large socket buffers) holds ~10M idle TCP connections in ~64 GB of memory. The math:
---------- Per-server connection budget ----------
Kernel TCP socket overhead: ~2 KB
Userspace per connection (buffers, ~4 KB
WS framing state, last activity):
Total per connection: ~6 KB
10M connections: ~60 GB
Leaves ~4 GB for the actual chat processThe throughput budget is separate. Each server pushes ~50K messages/sec at peak; CPU is dominated by TLS termination (often offloaded to the load balancer).
Sticky load balancing
The load balancer must keep a connection on the same chat server (no rebalancing mid-connection). L4 balancers do this naturally because TCP connections are pinned. New connections from the same client should also try to land on the same server (consistent hashing on (user_id, device_id)) so the session registry doesn't need updating.
Session Registry (the routing table)
Map each connected user to their chat server.
Key: session:<user_id>:<device_id>
Value: chat_server_id, connected_at, last_seen
TTL: 120 seconds (refreshed by heartbeat)Backed by Redis Cluster (sharded by user_id). On WebSocket open, the chat server writes the entry; on heartbeat (every 30s), the TTL is refreshed; on disconnect, the entry is deleted.
Size: 500M concurrent connections * 100 bytes = 50 GB in Redis, fits in ~10 nodes.
Cross-server delivery: Pub/Sub between chat servers
When Server 1 needs to push a message to Bob on Server 2, two patterns work:
| Pattern | How it works | Trade-off |
|---|---|---|
| Direct connection | Server 1 opens a TCP connection to Server 2 and sends directly | N^2 connections between servers; manageable for ~50K servers? No, 2.5B pairs. |
| Redis Pub/Sub | Each chat server subscribes to its own channel srv:<id>; Server 1 publishes to srv:2 | Single hop, ~5 ms; Redis is the bottleneck but pub/sub is cheap |
| Kafka | Each chat server consumes a topic partitioned by server_id | Higher latency (50-100 ms), higher throughput, durable |
For a chat system, Redis Pub/Sub is the standard answer. It's fast (sub-10 ms), simple, and Redis Cluster scales horizontally. The downside is no durability: if Server 2 misses a publish (because it crashed mid-delivery), the message must come from the durable store. That's fine because we always write to Cassandra first before publishing; the publish is the fast-path.
Heartbeats and connection liveness
WebSockets do not detect a dead connection on their own (a network drop looks identical to an idle connection). Each side sends a ping frame every 30 seconds; if no pong within 60 seconds, the connection is closed and resources reclaimed.
Message Delivery State Machine
Each message moves through a small state machine. The sender's UI shows different ticks based on the state.
---------- Message states ----------
Client A Server Client B
| | |
send -->| [client_msg] | |
| client picks ULID id |
| | |
| send frame | |
| ------------> | |
| | persist to |
| | Cassandra |
| ack | |
| <------------ | |
| (state=sent, | |
| one tick) | |
| | deliver |
| | -------------->|
| | receipt |
| | (delivered) |
| | <--------------|
| receipt_upd | |
| <------------ | |
| (state= | |
| delivered, | |
| two ticks) | |
| | open conv: |
| | receipt |
| | (read) |
| | <--------------|
| receipt_upd | |
| <------------ | |
| (state=read, | |
| blue ticks) | |At-least-once delivery and idempotency
The network can drop any frame. To guarantee no message loss, both sides retry until acknowledged. Duplicates are inevitable; we deduplicate by:
- Sender retries the same
client_msg_id. Server checks(sender_id, client_msg_id); if the message already exists, return the sameserver_msg_idwithout re-persisting. - Receiver dedups by
server_msg_idin a small per-conversation set.
Why server-assigned IDs (Snowflake)?
Server-side IDs give us total ordering within a conversation (sortable timestamps + machine bits + sequence). Client clocks drift and are untrusted; we cannot use them for ordering. The client also sends a client_msg_id (ULID) for idempotency, but the canonical ID is the server's.
Group Messaging Fan-Out
For a group of N members, the chat server fans the message out N-1 times: looks up each member in the Session Registry and publishes via Pub/Sub. With N capped at 256, fan-out is bounded; no celebrity problem.
For very large broadcast lists (channel-style products with 100K+ subscribers), we'd switch to a fan-out worker pool reading from a Kafka topic per channel. This is the boundary where chat ends and notification service begins.
Presence and Typing Indicators
Presence (online/offline) is fundamentally a different write pattern than messages: it changes constantly, has tiny payloads, and is OK to lose.
---------- Presence design ----------
State: online | offline | last_seen=<timestamp>
Stored in: Redis only (no durable store)
TTL: 90 seconds; refreshed by WS heartbeat
Fan-out: only to subscribers (people who have an open chat with you)
Throttle: max 1 update per user per 10 secondsTyping indicators are even more ephemeral: they live for 5 seconds, expire automatically, and never touch durable storage. We send typing_start once when the user starts typing and let it expire (no typing_stop unless the user closes the input).
Without throttling, presence and typing dominate the message bus. WhatsApp historically batched presence updates at the gateway level (one batch per connection per 5 seconds).
Data Model
Cassandra: messages (append-only, partition by conversation)
-- Cassandra schema
CREATE TABLE messages (
conversation_id text,
server_msg_id bigint, -- Snowflake (sortable by time)
sender_id text,
body text,
sent_at timestamp,
PRIMARY KEY ((conversation_id), server_msg_id)
) WITH CLUSTERING ORDER BY (server_msg_id DESC);
-- Per-conversation reads are a single partition scan, very fast.
-- Latest 50 messages: SELECT * FROM messages WHERE conversation_id = ? LIMIT 50;Cassandra is ideal here: time-series writes, partition by conversation, range scan by server_msg_id for history. A single popular group conversation can produce thousands of messages per second to one partition; Cassandra handles that comfortably.
Postgres (sharded): users, conversations, membership
CREATE TABLE users (
id BIGINT PRIMARY KEY,
phone VARCHAR(20) UNIQUE NOT NULL,
name VARCHAR(64),
created_at TIMESTAMPTZ NOT NULL
);
CREATE TABLE conversations (
id VARCHAR(32) PRIMARY KEY,
type VARCHAR(8) NOT NULL, -- 'one_to_one' | 'group'
name VARCHAR(64), -- group name
created_at TIMESTAMPTZ NOT NULL,
last_msg_id BIGINT -- denormalized for sort
);
CREATE TABLE conversation_members (
conversation_id VARCHAR(32) NOT NULL,
user_id BIGINT NOT NULL,
joined_at TIMESTAMPTZ NOT NULL,
last_read_msg BIGINT, -- for unread-count display
PRIMARY KEY (conversation_id, user_id)
);
CREATE INDEX idx_member_user ON conversation_members (user_id, conversation_id);Shard by user_id. Most queries (my conversations, mark as read) are user-scoped; the conversation table is small (1B conversations * 200 bytes = 200 GB) and easily fits sharded.
Redis: session registry, presence, unread counts
----------- Redis keys ----------
session:<user_id>:<device_id> -> {server_id, last_heartbeat} TTL 120s
presence:<user_id> -> {state, last_seen} TTL 90s
unread:<user_id>:<conv_id> -> integer counter no TTL
typing:<conv_id> -> SET of user_ids TTL 5s
idempotent:<sender>:<client_id> -> server_msg_id TTL 60sKafka: offline message queue
One topic offline-messages, partitioned by recipient_user_id. Messages live until consumed (offset stored per user) with a 30-day retention so users returning from a long absence still get their inbox.
Scaling and Bottlenecks
Connection storm at peak
At 9 PM local time, all of India simultaneously opens WhatsApp. Connection rate spikes from 100K/sec to 1M/sec. Mitigations:
- Connection rate limiting at the load balancer (drop excess and return retry-after).
- Pre-provisioned chat server capacity in the affected region (autoscaling is too slow for connection storms; minutes vs the seconds we need).
- Backoff with jitter in the client reconnection logic; never reconnect synchronously after a network event.
Hot conversation: a 256-person group with 100 active typers
The Cassandra partition for that conversation gets ~1K writes/sec. Cassandra handles per-partition writes well up to ~10K/sec. Beyond that we'd need to compose the partition key with a time bucket ((conv_id, hour)).
Presence storm
500M users coming online together would generate 500M Redis writes for presence. Mitigations:
- Lazy presence: don't store
onlineuntil someone actually requests it (only the contacts of the online user). - Coalesced updates: gateway batches presence updates per 5-second window before publishing.
- TTL-based expiry rather than explicit offline writes:
onlineis a heartbeat refresh;offlineis just the absence of a key.
Multi-region replication
A chat between two users in different regions is the hard case. Two designs:
- Home-region routing: each user has a home region; the conversation lives in one of the participants' home regions. Cross-region delivery hops through a region-aware Pub/Sub layer (50-150 ms added latency).
- Multi-master with conflict-free types: every message has a globally unique server_msg_id and is replicated to both regions; reads are local. This is what real WhatsApp does. Trade-off: replication lag means a sender in EU might see their message appear before the recipient in US gets it.
What breaks at 100x?
At 200B users (hypothetical), the Session Registry becomes the bottleneck. Solution: shard session lookups by user_id and route Pub/Sub through a partitioned message bus (Kafka rather than Redis Pub/Sub) accepting the higher latency.
Trade-offs and Alternatives
Why WebSockets over Server-Sent Events (SSE)?
SSE is one-way (server -> client). Chat needs bidirectional flow (the client sends messages too). With SSE, the send path needs a separate HTTP POST per message, doubling the round trips for the most frequent operation. WebSockets give us send and receive on the same connection.
Why Redis Pub/Sub over Kafka for cross-server delivery?
Kafka is durable but adds 50-100 ms latency. Redis Pub/Sub is sub-10 ms but loses messages if a subscriber is briefly disconnected. We accept that loss because we always write to Cassandra first; if Pub/Sub drops a delivery, the recipient pulls it on reconnect from Kafka offline queue or by polling the conversation's latest messages.
Why Cassandra over Postgres for messages?
Cassandra's per-partition write pattern matches per-conversation writes. Postgres at 1M writes/sec across 100B rows would need aggressive sharding and complex partition management. Cassandra was built for this exact workload (originally at Facebook for the Inbox Search). Trade-off: no joins (we denormalize), eventual consistency by default.
Why server-side IDs over client-side UUIDs?
Client clocks drift. UUIDs aren't ordered. Sorting messages by client_ts produces wrong results when the client's clock is wrong. Server-side Snowflake IDs give us total order plus a useful timestamp embedded in the ID.
Why per-user offline queue over per-conversation?
When Bob comes online he wants all messages, not per-conversation streams. A single Kafka topic partitioned by user_id is one consumer per user, regardless of how many conversations they have. Per-conversation queues would mean N subscriptions for a user in N groups.
When to use group fan-out vs broadcast channel
WhatsApp groups cap at 256 members. Above that we should think of it as a broadcast (channel-style: WhatsApp Channels, Telegram channels). The architecture changes: a publisher writes once to a topic, a fan-out worker pool delivers to subscribers in batches. Trying to fan out a 100K-member message in real time over Pub/Sub overwhelms the message bus.
Real-World Examples
How real systems implement this in production
WhatsApp runs on Erlang/OTP for the chat servers, leveraging the actor model to handle ~10M concurrent connections per server. Messages persist in a sharded Mnesia/MySQL hybrid; cross-server delivery uses an internal pub/sub. The 2014 acquisition disclosed only ~50 engineers serving 450M users at the time, largely thanks to Erlang's process model.
Trade-off: Erlang's per-actor isolation is the single biggest reason WhatsApp scales connections per server better than competitors. The trade-off is a smaller talent pool (most engineers don't know Erlang) and harder integration with non-Erlang services.
Telegram uses MTProto, a custom binary protocol over TCP with optional encryption. Servers are written in C++ and use a sharded data center model where each user has a 'home' DC. Cross-DC messages route through a global directory. Storage is custom (not Cassandra), optimized for fast inbox sync.
Trade-off: MTProto's custom binary framing reduces bandwidth vs WhatsApp's protobuf-over-WSS, but the custom protocol means Telegram clients can't reuse standard WebSocket libraries. Telegram trades portability for efficiency.
Signal stores essentially nothing on the server: messages are end-to-end encrypted and the server only routes opaque blobs from sender to recipient. Group state is also encrypted client-side. The server's only durable store is the encrypted message queue waiting for offline recipients.
Trade-off: Signal's privacy model means features like cross-device sync, message search, and history backup require client-side workarounds. Server simplicity comes at the cost of feature complexity.
Slack's chat is similar but optimized for workspaces (one team = one channel set). They use WebSockets via their RTM (Real-Time Messaging) gateway, with messages persisted to MySQL sharded by team_id. A single workspace with 100K users is one big shard, which has driven their ongoing migration to per-channel sharding for very large enterprises.
Trade-off: Sharding by team_id was simple early on but became a hotspot for huge workspaces. The lesson: pick a sharding key that scales with your largest entity, not your average one.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Client A sends a `send` frame over the WebSocket with a client_msg_id. Chat Server 1 generates a Snowflake server_msg_id, persists to Cassandra (partition by conversation), acks the sender. Server 1 looks up Bob in the Session Registry (Redis); finds Bob on Server 2. Server 1 publishes to Redis Pub/Sub channel `srv:2`. Server 2 receives the publish, sends `deliver` over Bob's WebSocket. Bob's client renders the bubble and sends back a `receipt` of `delivered`. Server 2 forwards the receipt to Server 1 via Pub/Sub; Server 1 sends `receipt_update` to Alice. End-to-end: ~50-200 ms.
When Server 1 crashes, all its WebSocket connections drop. Clients detect the drop via heartbeat timeout (~60s) and reconnect; the load balancer routes them to a different server. The new server registers their session. Any message in flight on Server 1 was already persisted to Cassandra (we write before acking); the sender's retry on the new connection finds the existing message via the idempotency key and skips re-persisting. The Session Registry entry on Server 1 expires by TTL (~120s) so cross-server lookups eventually stop targeting the dead server.
Read receipts are one write per (user, conversation) pair when a user opens a conversation; we don't need per-message receipts on the server. We update `last_read_msg_id` in conversation_members and emit a single `receipt_update` to the sender. The sender's client computes 'all messages with id <= last_read_msg are read'. This collapses a potentially huge fan-out (100 messages, 5 group members = 500 receipts) into one row update and one fan-out per reader.
Typing indicators are ephemeral and lossy by design. Use Redis SET with a 5-second TTL keyed by conversation_id. Throttle client emissions to one `typing_start` per 3 seconds. Fan out only to actively-open conversations (the recipient must be looking at this conversation). At a billion typers, only ~10M typing events per second matter (most users aren't typing); presence-style aggregation in the gateway batches these into per-second pushes. Never store typing state in a durable database.
Server-assigned Snowflake IDs are monotonically increasing within a single chat server. To get a global order across servers within a conversation, we route all writes for a conversation to a 'home' chat server (consistent hashing on conversation_id). This serializes writes at one server, giving total order. The Cassandra clustering key on `server_msg_id DESC` then returns history in reverse chronological order. Without this, two messages sent simultaneously to different chat servers would have IDs that don't reflect the actual send order.
Interview Tips
How to discuss this topic effectively
Lead with 'this is a persistent-connection problem, not a request/response problem'. That immediately frames the design around WebSockets, session affinity, and offline queues, which is the right shape.
When asked about cross-server delivery, name the Session Registry explicitly. Most candidates handwave 'the server knows where Bob is'. Drawing the Redis lookup and the Pub/Sub publish step is what separates a senior answer.
Always commit to at-least-once delivery and explain idempotency. Saying 'exactly-once' is a red flag; the interviewer will probe and you will fail. Real systems are at-least-once with client-side dedup.
Treat presence and typing as separate problems with separate write rates. Conflating them with messages leads to designs that look right but fail under presence storms.
Cite WhatsApp's 'tens of millions of connections per server' number from the original Erlang post. It signals you've read the engineering blogs and aren't pulling numbers from thin air.
Common Mistakes
Pitfalls to avoid in interviews
Picking HTTP long polling instead of WebSockets
Long polling means a new TCP connection (and a new TLS handshake) for every poll. At 500M concurrent users that is hundreds of millions of redundant handshakes per minute. WebSockets reuse one persistent connection. Long polling is acceptable only as a fallback for clients behind WebSocket-blocking middleboxes.
Storing messages in Postgres without sharding strategy
100B messages/day means a single table grows by 100B rows/day. No Postgres instance handles that. Either shard by conversation_id with a strict access pattern (always conversation-scoped reads) or use Cassandra, which was built for time-series partitioned writes.
Claiming exactly-once delivery
Exactly-once over an unreliable network is impossible without distributed transactions. Real systems guarantee at-least-once delivery and rely on the receiver to deduplicate by message_id. The illusion of exactly-once is a property of the client, not the network.
Treating presence like a regular write
Presence changes far more often than messages and is acceptable to lose. Storing each toggle in Cassandra would melt the cluster. Use Redis with TTL-based expiry: an `online` user is one whose key was refreshed in the last 90 seconds; `offline` is just the absence of a key.
Forgetting offline delivery and relying on the recipient being online
Most messages are sent to recipients who aren't currently connected. The Session Registry returning empty must trigger a write to a per-user offline queue (Kafka topic partitioned by user_id), drained on reconnect. Without this, a message sent at 2 AM to a sleeping recipient is lost.
