System Design Article
Design a Notification Service
Difficulty: Medium
Design a multi-channel notification service that delivers 10B push, email, and SMS notifications per day across three independent provider networks (APNs, FCM, SendGrid, Twilio) with priority queues, per-user rate limits, and idempotent retries. The interview centerpiece is the fan-out from a single application event to multiple channels and providers, each with its own rate limits, failure modes, and delivery semantics. We cover priority queues for transactional vs marketing traffic, retry policies with exponential backoff, deduplication of duplicate triggers, user preference enforcement, and the device token lifecycle that quietly invalidates tens of millions of tokens per day.
Design a Notification Service
Design a multi-channel notification service that delivers 10B push, email, and SMS notifications per day across three independent provider networks (APNs, FCM, SendGrid, Twilio) with priority queues, per-user rate limits, and idempotent retries. The interview centerpiece is the fan-out from a single application event to multiple channels and providers, each with its own rate limits, failure modes, and delivery semantics. We cover priority queues for transactional vs marketing traffic, retry policies with exponential backoff, deduplication of duplicate triggers, user preference enforcement, and the device token lifecycle that quietly invalidates tens of millions of tokens per day.
946 views
29
Requirements
Functional Requirements
- Send notifications across multiple channels: push (iOS APNs, Android FCM), email (SendGrid, SES), SMS (Twilio, Vonage).
- Accept a single application event (
order_shipped,password_reset,friend_request) and deliver to all channels the user opted into. - Honor user preferences: per-channel opt-in, per-category opt-in (marketing vs transactional), quiet hours.
- Track delivery status: queued, sent, delivered, failed, bounced, opened (where the channel supports it).
- Support templated content:
welcome_email_v3template with variable substitution; localized per user language. - Schedule notifications: send at a future time (
send_at: 2026-04-26T20:00:00Z).
Out of Scope (state explicitly)
- In-app notifications (those are part of the app's own UI; this service handles only out-of-band delivery).
- The marketing campaign authoring tool (we accept events, not campaigns; campaigns are upstream).
- Recommendation logic deciding what to send; we deliver what we are told.
Non-Functional Requirements
- Scale: 10B notifications/day across 100M users. Push is ~70%, email ~25%, SMS ~5%.
- Latency: transactional p99 < 5 seconds end-to-end (event in -> provider accepted). Marketing can take minutes.
- Availability: 99.95% (lower than chat because the notification eventually delivering matters more than instant delivery).
- Reliability: at-least-once delivery; duplicate suppression via idempotency keys (no double-charge on a payment confirmation email).
- Provider-aware throttling: stay below APNs/FCM/SendGrid/Twilio rate limits without dropping traffic.
- Auditability: every notification's lifecycle queryable for 90 days (compliance, debugging).
Back-of-the-Envelope Estimation
Volume per Channel
---------- Volume per channel per day ----------
Total notifications/day: 10B
Push (iOS + Android): 7B (~80K/sec avg, ~250K/sec peak)
Email: 2.5B (~30K/sec avg, ~100K/sec peak)
SMS: 500M (~6K/sec avg, ~20K/sec peak)
Priority breakdown:
Transactional (P0): 1B (10%) - OTP, password reset, payment
User-triggered (P1): 4B (40%) - new message, friend request
Marketing (P2): 5B (50%) - newsletters, promotionsProvider Rate Limits
---------- Provider rate limits ----------
APNs (Apple): ~1000 notifications/sec per HTTP/2 connection;
scale by opening more connections
FCM (Firebase): ~600/sec per project; quota expandable
SendGrid: 300 emails/sec on standard tier; expandable
Twilio SMS: 1 SMS/sec per long code (US) or higher per short codeThe SMS limit is the tightest. We need many sender numbers (long codes) or short codes pre-registered with carriers, and per-number rate budgets.
Storage
---------- Notification record storage ----------
Per-record size (event, status, history): ~500 bytes
Notifications per day: 10B
Daily storage: 10B * 500 = 5 TB/day
90-day retention: 5 TB * 90 = 450 TB
3x replication: 1.4 PB totalDevice token storage:
---------- Device token storage ----------
Users: 100M
Devices per user (avg): 2.5
Tokens stored: 250M
Per token (256 char + meta): ~500 bytes
Total: ~125 GB (small; fits in a sharded SQL cluster)Bandwidth
Mostly internal. Outbound to providers is small (push payloads ~2 KB, email ~10 KB, SMS ~200 bytes). 80K push/sec * 2 KB = ~160 MB/s outbound to APNs/FCM combined.
High-Level Design
---------- High-level architecture ----------
+-------------+
| Caller | (any service: orders, social, billing, ...)
| Application |
+-------------+
|
v
+-------------------------+
| Notification API |
| POST /v1/notifications |
| - validates payload |
| - dedupes by key |
| - resolves user prefs |
+-------------------------+
|
v
+-------------------------+
| Event Bus (Kafka) |
| topic: notif.events |
+-------------------------+
|
v
+-------------------------+
| Fan-Out Service |
| - reads event |
| - looks up channels per |
| user preference |
| - emits N delivery jobs |
+-------------------------+
/ | \
v v v
+-------+ +-------+ +-------+
| Push | | Email | | SMS | one Kafka topic per channel,
| Queue | | Queue | | Queue | each with priority partitions
+-------+ +-------+ +-------+
| | |
v v v
+-------+ +-------+ +-------+
| Push | | Email | | SMS |
|Workers| |Workers| |Workers| pool sized per provider rate limit
+-------+ +-------+ +-------+
| | |
v v v
+-------+ +-------+ +-------+
| APNs | |SendGrid| |Twilio| actual provider HTTP APIs
| / FCM | | / SES | | etc. |
+-------+ +-------+ +-------+
|
v
+-------------------------+
| Status Tracker |
| - delivery callbacks |
| - status updates |
+-------------------------+
|
v
+-------------------------+
| Notification Store |
| (Cassandra) |
+-------------------------+API Design
A single send endpoint accepts a notification request from any internal service.
POST /api/v1/notifications
Idempotency-Key: order-shipped-12345-attempt-1
{
"user_id": "u_alice",
"category": "transactional",
"template_key": "order_shipped_v2",
"variables": {
"order_id": "O-12345",
"tracking_url": "https://..."
},
"channels": ["push", "email"], // optional; default = user prefs
"send_at": null, // null = send now
"ttl_seconds": 86400 // discard if not delivered in 24h
}
// Response
{
"notification_id": "n_xyz789",
"status": "queued",
"channels_targeted": ["push", "email"]
}Send Pipeline
---------- End-to-end send flow ----------
1. Caller POSTs to Notification API with Idempotency-Key.
2. API checks Redis for the idempotency key:
- If found, return the cached notification_id; STOP.
- If not, store key + notification_id with 24h TTL.
3. API validates: template exists, user exists, user opted in.
4. API enqueues to Kafka topic `notif.events` partitioned by user_id.
5. Fan-Out Service consumes the event:
- Loads user prefs and active devices.
- Filters out channels user opted out of.
- Filters out quiet hours (for non-transactional).
- For each remaining channel + device, emits a delivery job to that channel's queue.
6. Channel workers consume their queue, render the template, call the provider.
7. Provider responds with success/failure or a 202 + delivery webhook later.
8. Status Tracker updates notification record in Cassandra.
9. Caller can poll GET /v1/notifications/{id} or subscribe to a webhook.Detailed Design
The two interesting components are the fan-out + priority queue layer and the per-provider delivery worker with retries and token lifecycle.
Fan-Out Service and Priority Queues
One event becomes many jobs
A single order_shipped event for Alice, who has 3 devices and opted into push + email, generates 4 jobs (3 push tokens + 1 email). This explosion is the fan-out.
The Fan-Out Service is stateless. It reads the event, queries the user prefs cache (Redis: prefs:<user_id>), queries the device tokens cache (Redis: devices:<user_id>), and emits jobs to the relevant per-channel topic.
Priority queues per channel
Each channel has 3 logical Kafka topics (or 3 partitions per topic, by category):
---------- Channel queue layout ----------
notif.push.p0 <- transactional (OTP, password reset)
notif.push.p1 <- user-triggered (new message, friend request)
notif.push.p2 <- marketing (newsletter, promo)
notif.email.p0 / p1 / p2 (same pattern)
notif.sms.p0 / p1 / p2 (same pattern)Workers pull from the highest-priority topic first; they only consume from p2 when p0 and p1 are empty for some bounded time. This guarantees the OTP for Bob's bank login arrives before the 4 PM Black Friday newsletter.
A second design choice: separate worker pools per priority. P0 gets a dedicated pool (always-on, never starves); P1 and P2 share a larger pool. This isolates noisy P2 from blocking P0.
Why Kafka instead of RabbitMQ?
| Property | Kafka | RabbitMQ |
|---|---|---|
| Throughput | Millions/sec per cluster | ~50K/sec per node |
| Replay | Built in (offset rewind) | Hard (messages consumed = gone) |
| Priority queues | Via topic-per-priority | Native priority queues |
| Routing complexity | Simple (topic + partition) | Rich (exchanges, bindings) |
For 10B/day, throughput wins; RabbitMQ would need many clusters. Kafka also gives us replay, which is critical for recovering from a downstream provider outage (we can re-process the last N hours of events when the provider comes back).
Per-Provider Worker with Retries
Worker pool sizing
Each worker holds open HTTP/2 connections to its provider. APNs allows ~1000 push/sec per HTTP/2 connection; with 250K peak push/sec we need ~250 connections, distributed across worker processes. We size the pool to match the provider's rate limit, not our event rate.
---------- Worker pool sizing example (APNs) ----------
Peak push rate: 250K /sec
Per-connection capacity: 1000 /sec
Target connections: 250
Connections per worker: 20
Total workers: ~13 push workers (with headroom)Retry policy by failure type
Providers return many failure types. They are NOT all retried the same way.
| HTTP status | Meaning | Action |
|---|---|---|
| 200 / 202 | Accepted | Mark sent; await delivery callback if provider supports |
| 4xx (validation) | Bad payload (missing field, wrong format) | Do NOT retry; mark failed_permanent; alert |
| 410 Gone (APNs) | Token invalid (app uninstalled) | Do NOT retry; remove token from devices table |
| 429 Too Many Requests | Rate limited | Retry with exponential backoff; honor Retry-After header |
| 5xx | Provider issue | Retry with exponential backoff; max 5 attempts |
The critical lesson: don't retry 4xx. A bad payload will be bad on the next attempt too; retrying just wastes provider quota.
Exponential backoff with jitter
---------- Retry schedule ----------
Attempt 1: immediate
Attempt 2: 1s + jitter(0-500ms)
Attempt 3: 4s + jitter(0-2s)
Attempt 4: 16s + jitter(0-8s)
Attempt 5: 64s + jitter(0-32s)
Dead-letter after attempt 5Jitter prevents thundering-herd retries when a provider has a brief outage and 100K notifications retry simultaneously the moment the outage ends.
Dead-Letter Queue
Messages that exhaust retries land in a per-channel DLQ. We don't drop them silently; an operator dashboard surfaces DLQ counts, and a periodic job tries to recover (e.g., re-resolving the user's tokens in case the original ones were stale).
Device token lifecycle (the hidden complexity)
App uninstalls don't notify your service; you discover the token is dead by trying to send and getting 410 Gone (APNs) or NotRegistered (FCM). Real systems get millions of 410s per day.
The token cleanup loop:
---------- Token cleanup ----------
Worker sends push -> APNs returns 410 Gone
Worker emits a `token.invalid` event to Kafka
Device Service consumes:
DELETE FROM devices WHERE token = ?
Add to a 30-day blacklist (key: hash(token)) so re-registration
of the same token isn't immediately spammed if the user reinstalls
User registers a new token on next app open -> normal flowIdempotency
Upstream services may retry their POST to our API on a network blip. Without idempotency, a payment_succeeded notification fires twice and we send the receipt twice.
The Idempotency-Key header is hashed to a Redis key with 24h TTL. The first request stores the resulting notification_id; subsequent requests with the same key return that ID without enqueuing another event.
The key generation responsibility is on the caller (typical pattern: <event_type>-<entity_id>-<attempt_id>). Our API documents this requirement.
Data Model
Postgres (sharded by user_id): users, preferences, devices
CREATE TABLE users (
id BIGINT PRIMARY KEY,
locale VARCHAR(10),
timezone VARCHAR(40),
created_at TIMESTAMPTZ NOT NULL
);
CREATE TABLE preferences (
user_id BIGINT NOT NULL,
channel VARCHAR(16) NOT NULL, -- 'push' | 'email' | 'sms'
category VARCHAR(32) NOT NULL, -- 'transactional' | 'marketing' | 'social'
enabled BOOLEAN NOT NULL,
quiet_hours_start TIME,
quiet_hours_end TIME,
PRIMARY KEY (user_id, channel, category)
);
CREATE TABLE devices (
user_id BIGINT NOT NULL,
device_id VARCHAR(64) NOT NULL,
platform VARCHAR(16) NOT NULL, -- 'ios' | 'android' | 'web'
push_token VARCHAR(512) NOT NULL,
locale VARCHAR(10),
last_seen TIMESTAMPTZ NOT NULL,
PRIMARY KEY (user_id, device_id)
);
CREATE INDEX idx_token ON devices (push_token); -- for cleanup lookupsCassandra: notification records (high-write, time-series)
-- Cassandra
CREATE TABLE notifications (
user_id bigint,
notification_id bigint, -- Snowflake (sortable)
template_key text,
category text,
status text, -- queued|sent|delivered|failed
channels_sent set<text>,
created_at timestamp,
PRIMARY KEY ((user_id), notification_id)
) WITH CLUSTERING ORDER BY (notification_id DESC);
CREATE TABLE delivery_attempts (
notification_id bigint,
attempt_no int,
channel text,
provider text,
status text,
error_code text,
timestamp timestamp,
PRIMARY KEY ((notification_id), attempt_no, channel)
);Queries hit one partition: "Alice's notifications in the last 30 days" is a single Cassandra partition scan. "Audit this notification's delivery history" is a single-partition range scan on delivery_attempts.
Redis: idempotency, prefs cache, device cache, rate limits
---------- Redis keys ----------
idempotent:<sha256-of-key> -> notification_id TTL 24h
prefs:<user_id> -> JSON of preferences TTL 5 min
devices:<user_id> -> JSON of active tokens TTL 5 min
ratelimit:user:<user_id> -> rolling counter TTL 60s
ratelimit:provider:<provider_id> -> token bucket no TTLTemplates
Templates live in object storage (S3) or a small versioned table. Cached in workers' memory keyed by (template_key, locale).
Scaling and Bottlenecks
Provider rate limits as the hard ceiling
We can scale our infrastructure infinitely; provider rate limits cannot. If FCM caps us at 10K/sec and we need 50K/sec, we need to negotiate quota with Google or partition across multiple FCM projects (each gets its own quota). Real systems run dozens of FCM/APNs project pairs to multiplex quota.
The midnight UTC marketing campaign
Marketing teams love sending newsletters at exactly 00:00 UTC. 50M emails enqueued in 1 second. Mitigations:
- Spread send_at times automatically by adding
+/- 30 minjitter at enqueue. - Throttle marketing topic consumers so they cannot starve transactional traffic.
- Pre-scale workers based on scheduled volume (we know what's coming via the schedule table).
User-level rate limits
A bug in an upstream service triggers 1000 notifications for a single user in a minute. Without protection, the user is spammed.
Per-user rate limit: max 10 notifications/minute, max 100/day, with carve-outs for transactional category. Implemented as a sliding-window counter in Redis; checked in the Fan-Out Service.
Webhook delivery callback storm
Providers POST delivery confirmations back to us at high rates (millions/sec from SendGrid alone during a blast). Solutions:
- A dedicated callback ingestion service that writes directly to Kafka (not the API).
- Async processing of callbacks; the user-facing API doesn't wait for
deliveredto marksent.
Multi-region
Notifications are typically regional: APNs has US/EU/JP edges, SendGrid has region-specific senders (for SPF/DKIM compliance). We deploy a notification stack per region; the Fan-Out Service routes based on the user's home region.
Trade-offs and Alternatives
Push-vs-pull model
Alternative: clients periodically poll GET /pending-notifications. Works for slow channels (email, SMS aren't real-time anyway) but breaks the value proposition of push: we want to wake the device. Push tokens + APNs/FCM is the only way to deliver to a backgrounded mobile app.
Single queue vs per-channel queues
A single Kafka topic is simpler but couples the failure modes: if SendGrid is down and we keep retrying, we slow down push delivery too. Per-channel queues isolate failures and let us tune worker pools per provider.
Synchronous vs async API
We could return 200 only after the provider accepts. Pros: caller knows immediately. Cons: caller waits for our worst provider (SMS via flaky Twilio backup) and our service availability is the product of all provider availabilities. Async (return 202 Queued, deliver later) is the standard answer.
Why Cassandra over a SQL audit log?
10B notifications/day with full audit history is 5 TB/day. SQL would need aggressive sharding and lose query flexibility. Cassandra's per-partition writes match the access pattern (per-user history) and the cluster can grow horizontally without re-sharding.
Idempotency window
A 24-hour idempotency TTL handles most retry scenarios. Some payment providers want longer (7 days). Trade-off: more Redis memory. We could externalize to Postgres for callers needing longer windows, at the cost of slower idempotency check (~5 ms vs ~0.5 ms).
Why not let each app handle its own notifications?
Most large companies do start that way. The pain points that drive consolidation: each app re-implements retries, each app integrates APNs/FCM/SendGrid/Twilio independently, each app hits provider rate limits separately, user preferences are duplicated across apps. A central notification service de-duplicates this effort and produces consistent UX (a single 'unsubscribe from all marketing' that actually works).
Real-World Examples
How real systems implement this in production
Slack sends ~1B notifications/day across push, email, and in-app. They run a service called `notification-service` that fans out from Slack events (mention, DM, channel reply) into per-channel jobs. Push goes through APNs/FCM with per-workspace token pools; email through SES with reputation management.
Trade-off: Slack famously over-notifies users by default, then provides extensive preference controls. The lesson: granular per-category preferences are essential at scale, but the default settings determine perceived noisiness more than the preference UI.
Airbnb's notification platform uses Kafka as the event bus and a Druid-like system for delivery analytics, with priority topics that route urgent host messages ahead of guest browsing nudges. (Aside: Uber open-sourced a similar internal queue called Cherami in 2017 with native per-message priority, since archived; it is a useful reference for the priority-queue design pattern.)
Trade-off: Airbnb keeps the queueing layer simple by using Kafka topics tiered by priority rather than building a custom priority queue. Trade-off: less in-flight reprioritization (you can't promote a queued message to urgent after it's enqueued), but no extra system to operate.
Twilio offers Notify as a fan-out service customers can use instead of building their own. Internally it abstracts APNs, FCM, SMS, and email behind one API; Twilio handles per-provider quota negotiation. It is essentially the architecture in this lesson, sold as a managed service.
Trade-off: Buying Twilio Notify is faster than building, but you trade per-message cost (a few cents) and lose deep integration with internal user preferences. Most large companies build their own once volume crosses ~100M notifications/day.
APNs itself is a distributed notification gateway: Apple receives ~50B notifications/day from app developers worldwide, queues them per device, and delivers when the device is online. Apple does not retry; if a device is offline, APNs holds the latest one (collapsing key) and discards older ones for that key.
Trade-off: APNs's collapsing-key behavior is a feature: if your app sends 100 'new email' notifications while the device is offline, APNs delivers only the latest, preventing notification spam on reconnect. The trade-off is that you can't rely on APNs to deliver every push; for guaranteed delivery, you need an in-app pull on reconnect.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
The service POSTs to Notification API with Idempotency-Key. API checks Redis for the key (miss), validates the template, enqueues a single event to `notif.events`. Fan-Out Service consumes, loads user prefs (push + email opted in), loads 3 active push tokens. Emits 4 jobs: 3 to `notif.push.p1`, 1 to `notif.email.p1`. Push workers pull, render, send to APNs/FCM; email worker pulls, renders, sends to SendGrid. Status updated in Cassandra after each provider response. The whole flow is sub-5 seconds end-to-end.
Separate priority topics per channel: `notif.push.p0` for transactional (OTPs, password resets), `notif.push.p1` for user-triggered, `notif.push.p2` for marketing. Dedicated worker pool consumes only p0; never starves. Marketing workers consume p2; can be throttled or paused without affecting OTPs. At the provider, OTPs use a separate APNs project/quota so even provider-side rate limits can't impact them.
We don't know it was uninstalled until we try to send. APNs returns 410 Gone or FCM returns NotRegistered. The push worker classifies this as a permanent failure, marks the notification as `failed_token_invalid`, and emits a `token.invalid` event. The Device Service consumes, deletes the row from the devices table, and adds a 30-day blacklist entry to prevent immediate re-registration spam. Result: no further sends to dead tokens, and the devices table self-cleans.
Two designs. Simpler: a `scheduled_notifications` table with index on `send_at`; a poller selects due rows every 30 seconds and enqueues them. Works to ~10K/min; falls behind at higher rates. Better: an in-memory time-wheel scheduler (or Kafka with a delay topic). Each scheduled notification gets a deferred Kafka message with a release timestamp; a scheduler service holds the message in memory until due, then forwards to the normal pipeline. At 10M/day scheduled, this scales without DB pressure.
Per-user rate limits in Redis (sliding window): max 10 non-transactional notifications/min/user, 100/day. Enforced in the Fan-Out Service before per-channel emission. Caller-level rate limits at the API gateway (e.g., 1000 RPS per service). Anomaly detection on category mix: if a single caller suddenly fires 10K marketing notifications when their baseline is 100, alert and rate-limit. Final defense: user can mark a notification as spam, which automatically suppresses similar (template_key, source_service) for them.
Interview Tips
How to discuss this topic effectively
Lead with the fan-out: one event explodes into N delivery jobs. That framing immediately reveals priority queues, per-provider workers, and per-user dedup.
Always separate transactional and marketing into different queues with different worker pools. If you say 'one queue for everything', the interviewer will ask how OTPs survive a marketing blast and you will be stuck.
Mention device token cleanup explicitly. Saying 'we get millions of 410 Gone responses per day and feed them back into the device table' signals you've operated this kind of system.
Cite real provider rate limits (APNs ~1000/conn/sec, Twilio 1 SMS/sec per long code). Anchoring to real numbers proves you've shipped this and aren't bluffing.
Highlight idempotency as a contract with the caller. The Idempotency-Key header is a standard interview signal that you understand at-least-once systems.
Common Mistakes
Pitfalls to avoid in interviews
Treating all notifications as the same priority
Without separate queues, a 50M-recipient marketing blast can starve a single OTP and your user can't log in. Use per-priority topics (or per-priority partitions) with isolated worker pools so transactional always wins.
Retrying 4xx errors the same way as 5xx
4xx errors mean the request itself is bad: the token is dead, the email address is malformed, the template variable is missing. Retrying wastes provider quota and never succeeds. Only retry 429 (rate limited) and 5xx (transient). 410 Gone specifically means delete the token immediately.
Forgetting that providers have rate limits and trying to send 100K/sec to one APNs connection
APNs caps at ~1000/sec per HTTP/2 connection. You need a worker pool with many connections (or many APNs projects) to scale beyond that. Always size workers off the provider's limits, not your event rate.
Doing template rendering on the API hot path
Templates can have hundreds of variables, localization, A/B variants. Rendering on the synchronous API call slows it down and complicates capacity planning. Render at the worker, just before sending; the API just enqueues the template_key + variables.
No idempotency key, leading to double-sends on caller retries
Networks flake; callers retry. Without an Idempotency-Key the same payment confirmation email goes twice. Require the header in the API contract; reject requests without it; cache the resulting notification_id by key for 24 hours minimum.
