System Design Article

Design a Notification Service

Difficulty: Medium

Design a multi-channel notification service that delivers 10B push, email, and SMS notifications per day across three independent provider networks (APNs, FCM, SendGrid, Twilio) with priority queues, per-user rate limits, and idempotent retries. The interview centerpiece is the fan-out from a single application event to multiple channels and providers, each with its own rate limits, failure modes, and delivery semantics. We cover priority queues for transactional vs marketing traffic, retry policies with exponential backoff, deduplication of duplicate triggers, user preference enforcement, and the device token lifecycle that quietly invalidates tens of millions of tokens per day.

Design a Notification Service

System Design

Medium

design-notification-service

case-study

messaging-communication

push-notifications

sms

priority-queue

rate-limiting

idempotency

fan-out

retry-policy

dead-letter-queue

system-design

intermediate

premium

946 views

Requirements

Functional Requirements

Send notifications across multiple channels: push (iOS APNs, Android FCM), email (SendGrid, SES), SMS (Twilio, Vonage).
Accept a single application event (order_shipped, password_reset, friend_request) and deliver to all channels the user opted into.
Honor user preferences: per-channel opt-in, per-category opt-in (marketing vs transactional), quiet hours.
Track delivery status: queued, sent, delivered, failed, bounced, opened (where the channel supports it).
Support templated content: welcome_email_v3 template with variable substitution; localized per user language.
Schedule notifications: send at a future time (send_at: 2026-04-26T20:00:00Z).

Out of Scope (state explicitly)

In-app notifications (those are part of the app's own UI; this service handles only out-of-band delivery).
The marketing campaign authoring tool (we accept events, not campaigns; campaigns are upstream).
Recommendation logic deciding what to send; we deliver what we are told.

Non-Functional Requirements

Scale: 10B notifications/day across 100M users. Push is ~70%, email ~25%, SMS ~5%.
Latency: transactional p99 < 5 seconds end-to-end (event in -> provider accepted). Marketing can take minutes.
Availability: 99.95% (lower than chat because the notification eventually delivering matters more than instant delivery).
Reliability: at-least-once delivery; duplicate suppression via idempotency keys (no double-charge on a payment confirmation email).
Provider-aware throttling: stay below APNs/FCM/SendGrid/Twilio rate limits without dropping traffic.
Auditability: every notification's lifecycle queryable for 90 days (compliance, debugging).

Back-of-the-Envelope Estimation

Volume per Channel

Text

---------- Volume per channel per day ----------
Total notifications/day:              10B
Push (iOS + Android):                 7B  (~80K/sec avg, ~250K/sec peak)
Email:                                2.5B (~30K/sec avg, ~100K/sec peak)
SMS:                                  500M (~6K/sec avg, ~20K/sec peak)

Priority breakdown:
  Transactional (P0):                 1B  (10%)  - OTP, password reset, payment
  User-triggered (P1):                4B  (40%)  - new message, friend request
  Marketing (P2):                     5B  (50%)  - newsletters, promotions

Provider Rate Limits

Text

---------- Provider rate limits ----------
APNs (Apple):           ~1000 notifications/sec per HTTP/2 connection;
                        scale by opening more connections
FCM (Firebase):         ~600/sec per project; quota expandable
SendGrid:               300 emails/sec on standard tier; expandable
Twilio SMS:             1 SMS/sec per long code (US) or higher per short code

The SMS limit is the tightest. We need many sender numbers (long codes) or short codes pre-registered with carriers, and per-number rate budgets.

Storage

Text

---------- Notification record storage ----------
Per-record size (event, status, history):  ~500 bytes
Notifications per day:                       10B
Daily storage:                                10B * 500 = 5 TB/day
90-day retention:                             5 TB * 90 = 450 TB
3x replication:                               1.4 PB total

Device token storage:

Text

---------- Device token storage ----------
Users:                          100M
Devices per user (avg):         2.5
Tokens stored:                  250M
Per token (256 char + meta):    ~500 bytes
Total:                          ~125 GB (small; fits in a sharded SQL cluster)

Bandwidth

Mostly internal. Outbound to providers is small (push payloads ~2 KB, email ~10 KB, SMS ~200 bytes). 80K push/sec * 2 KB = ~160 MB/s outbound to APNs/FCM combined.

High-Level Design

Text

---------- High-level architecture ----------
  +-------------+
  |   Caller    |  (any service: orders, social, billing, ...)
  | Application |
  +-------------+
         |
         v
  +-------------------------+
  |   Notification API      |
  |  POST /v1/notifications |
  |  - validates payload    |
  |  - dedupes by key       |
  |  - resolves user prefs  |
  +-------------------------+
         |
         v
  +-------------------------+
  | Event Bus (Kafka)       |
  | topic: notif.events     |
  +-------------------------+
         |
         v
  +-------------------------+
  | Fan-Out Service         |
  | - reads event           |
  | - looks up channels per |
  |   user preference       |
  | - emits N delivery jobs |
  +-------------------------+
      /        |        \
     v         v         v
  +-------+ +-------+ +-------+
  | Push  | | Email | |  SMS  |    one Kafka topic per channel,
  | Queue | | Queue | | Queue |    each with priority partitions
  +-------+ +-------+ +-------+
     |         |         |
     v         v         v
  +-------+ +-------+ +-------+
  | Push  | | Email | |  SMS  |
  |Workers| |Workers| |Workers|    pool sized per provider rate limit
  +-------+ +-------+ +-------+
     |         |         |
     v         v         v
  +-------+ +-------+ +-------+
  | APNs  | |SendGrid| |Twilio|    actual provider HTTP APIs
  | / FCM | |  / SES | | etc. |
  +-------+ +-------+ +-------+
         |
         v
  +-------------------------+
  |  Status Tracker         |
  |  - delivery callbacks   |
  |  - status updates       |
  +-------------------------+
         |
         v
  +-------------------------+
  |  Notification Store     |
  |  (Cassandra)            |
  +-------------------------+

API Design

A single send endpoint accepts a notification request from any internal service.

Jsonc

POST /api/v1/notifications
Idempotency-Key: order-shipped-12345-attempt-1

{
    "user_id": "u_alice",
    "category": "transactional",
    "template_key": "order_shipped_v2",
    "variables": {
        "order_id": "O-12345",
        "tracking_url": "https://..."
    },
    "channels": ["push", "email"],          // optional; default = user prefs
    "send_at": null,                         // null = send now
    "ttl_seconds": 86400                     // discard if not delivered in 24h
}

// Response
{
    "notification_id": "n_xyz789",
    "status": "queued",
    "channels_targeted": ["push", "email"]
}

Send Pipeline

Text

---------- End-to-end send flow ----------
1. Caller POSTs to Notification API with Idempotency-Key.
2. API checks Redis for the idempotency key:
   - If found, return the cached notification_id; STOP.
   - If not, store key + notification_id with 24h TTL.
3. API validates: template exists, user exists, user opted in.
4. API enqueues to Kafka topic `notif.events` partitioned by user_id.
5. Fan-Out Service consumes the event:
   - Loads user prefs and active devices.
   - Filters out channels user opted out of.
   - Filters out quiet hours (for non-transactional).
   - For each remaining channel + device, emits a delivery job to that channel's queue.
6. Channel workers consume their queue, render the template, call the provider.
7. Provider responds with success/failure or a 202 + delivery webhook later.
8. Status Tracker updates notification record in Cassandra.
9. Caller can poll GET /v1/notifications/{id} or subscribe to a webhook.

Detailed Design

The two interesting components are the fan-out + priority queue layer and the per-provider delivery worker with retries and token lifecycle.

Fan-Out Service and Priority Queues

One event becomes many jobs

A single order_shipped event for Alice, who has 3 devices and opted into push + email, generates 4 jobs (3 push tokens + 1 email). This explosion is the fan-out.

The Fan-Out Service is stateless. It reads the event, queries the user prefs cache (Redis: prefs:<user_id>), queries the device tokens cache (Redis: devices:<user_id>), and emits jobs to the relevant per-channel topic.

Priority queues per channel

Each channel has 3 logical Kafka topics (or 3 partitions per topic, by category):

Text

---------- Channel queue layout ----------
notif.push.p0    <- transactional (OTP, password reset)
notif.push.p1    <- user-triggered (new message, friend request)
notif.push.p2    <- marketing (newsletter, promo)

notif.email.p0 / p1 / p2  (same pattern)
notif.sms.p0 / p1 / p2    (same pattern)

Workers pull from the highest-priority topic first; they only consume from p2 when p0 and p1 are empty for some bounded time. This guarantees the OTP for Bob's bank login arrives before the 4 PM Black Friday newsletter.

A second design choice: separate worker pools per priority. P0 gets a dedicated pool (always-on, never starves); P1 and P2 share a larger pool. This isolates noisy P2 from blocking P0.

Why Kafka instead of RabbitMQ?

Property	Kafka	RabbitMQ
Throughput	Millions/sec per cluster	~50K/sec per node
Replay	Built in (offset rewind)	Hard (messages consumed = gone)
Priority queues	Via topic-per-priority	Native priority queues
Routing complexity	Simple (topic + partition)	Rich (exchanges, bindings)

For 10B/day, throughput wins; RabbitMQ would need many clusters. Kafka also gives us replay, which is critical for recovering from a downstream provider outage (we can re-process the last N hours of events when the provider comes back).

Per-Provider Worker with Retries

Worker pool sizing

Each worker holds open HTTP/2 connections to its provider. APNs allows ~1000 push/sec per HTTP/2 connection; with 250K peak push/sec we need ~250 connections, distributed across worker processes. We size the pool to match the provider's rate limit, not our event rate.

Text

---------- Worker pool sizing example (APNs) ----------
Peak push rate:                250K /sec
Per-connection capacity:       1000 /sec
Target connections:            250
Connections per worker:        20
Total workers:                 ~13 push workers (with headroom)

Retry policy by failure type

Providers return many failure types. They are NOT all retried the same way.

HTTP status	Meaning	Action
200 / 202	Accepted	Mark `sent`; await delivery callback if provider supports
4xx (validation)	Bad payload (missing field, wrong format)	Do NOT retry; mark `failed_permanent`; alert
410 Gone (APNs)	Token invalid (app uninstalled)	Do NOT retry; remove token from devices table
429 Too Many Requests	Rate limited	Retry with exponential backoff; honor Retry-After header
5xx	Provider issue	Retry with exponential backoff; max 5 attempts

The critical lesson: don't retry 4xx. A bad payload will be bad on the next attempt too; retrying just wastes provider quota.

Exponential backoff with jitter

Text

---------- Retry schedule ----------
Attempt 1: immediate
Attempt 2: 1s + jitter(0-500ms)
Attempt 3: 4s + jitter(0-2s)
Attempt 4: 16s + jitter(0-8s)
Attempt 5: 64s + jitter(0-32s)
Dead-letter after attempt 5

Jitter prevents thundering-herd retries when a provider has a brief outage and 100K notifications retry simultaneously the moment the outage ends.

Dead-Letter Queue

Messages that exhaust retries land in a per-channel DLQ. We don't drop them silently; an operator dashboard surfaces DLQ counts, and a periodic job tries to recover (e.g., re-resolving the user's tokens in case the original ones were stale).

Device token lifecycle (the hidden complexity)

App uninstalls don't notify your service; you discover the token is dead by trying to send and getting 410 Gone (APNs) or NotRegistered (FCM). Real systems get millions of 410s per day.

The token cleanup loop:

Text

JavaScript

Python

---------- Token cleanup ----------
Worker sends push -> APNs returns 410 Gone
Worker emits a `token.invalid` event to Kafka
Device Service consumes:
  DELETE FROM devices WHERE token = ?
  Add to a 30-day blacklist (key: hash(token)) so re-registration
  of the same token isn't immediately spammed if the user reinstalls
User registers a new token on next app open -> normal flow

Idempotency

Upstream services may retry their POST to our API on a network blip. Without idempotency, a payment_succeeded notification fires twice and we send the receipt twice.

The Idempotency-Key header is hashed to a Redis key with 24h TTL. The first request stores the resulting notification_id; subsequent requests with the same key return that ID without enqueuing another event.

The key generation responsibility is on the caller (typical pattern: <event_type>-<entity_id>-<attempt_id>). Our API documents this requirement.

Data Model

Postgres (sharded by user_id): users, preferences, devices

SQL

CREATE TABLE users (
    id          BIGINT PRIMARY KEY,
    locale      VARCHAR(10),
    timezone    VARCHAR(40),
    created_at  TIMESTAMPTZ NOT NULL
);

CREATE TABLE preferences (
    user_id     BIGINT NOT NULL,
    channel     VARCHAR(16) NOT NULL,        -- 'push' | 'email' | 'sms'
    category    VARCHAR(32) NOT NULL,        -- 'transactional' | 'marketing' | 'social'
    enabled     BOOLEAN NOT NULL,
    quiet_hours_start TIME,
    quiet_hours_end   TIME,
    PRIMARY KEY (user_id, channel, category)
);

CREATE TABLE devices (
    user_id     BIGINT NOT NULL,
    device_id   VARCHAR(64) NOT NULL,
    platform    VARCHAR(16) NOT NULL,        -- 'ios' | 'android' | 'web'
    push_token  VARCHAR(512) NOT NULL,
    locale      VARCHAR(10),
    last_seen   TIMESTAMPTZ NOT NULL,
    PRIMARY KEY (user_id, device_id)
);
CREATE INDEX idx_token ON devices (push_token);    -- for cleanup lookups

Cassandra: notification records (high-write, time-series)

SQL

-- Cassandra
CREATE TABLE notifications (
    user_id          bigint,
    notification_id  bigint,                  -- Snowflake (sortable)
    template_key     text,
    category         text,
    status           text,                    -- queued|sent|delivered|failed
    channels_sent    set<text>,
    created_at       timestamp,
    PRIMARY KEY ((user_id), notification_id)
) WITH CLUSTERING ORDER BY (notification_id DESC);

CREATE TABLE delivery_attempts (
    notification_id  bigint,
    attempt_no       int,
    channel          text,
    provider         text,
    status           text,
    error_code       text,
    timestamp        timestamp,
    PRIMARY KEY ((notification_id), attempt_no, channel)
);

Queries hit one partition: "Alice's notifications in the last 30 days" is a single Cassandra partition scan. "Audit this notification's delivery history" is a single-partition range scan on delivery_attempts.

Redis: idempotency, prefs cache, device cache, rate limits

Text

---------- Redis keys ----------
idempotent:<sha256-of-key>         -> notification_id    TTL 24h
prefs:<user_id>                    -> JSON of preferences TTL 5 min
devices:<user_id>                  -> JSON of active tokens TTL 5 min
ratelimit:user:<user_id>           -> rolling counter      TTL 60s
ratelimit:provider:<provider_id>   -> token bucket         no TTL

Templates

Templates live in object storage (S3) or a small versioned table. Cached in workers' memory keyed by (template_key, locale).

Scaling and Bottlenecks

Provider rate limits as the hard ceiling

We can scale our infrastructure infinitely; provider rate limits cannot. If FCM caps us at 10K/sec and we need 50K/sec, we need to negotiate quota with Google or partition across multiple FCM projects (each gets its own quota). Real systems run dozens of FCM/APNs project pairs to multiplex quota.

The midnight UTC marketing campaign

Marketing teams love sending newsletters at exactly 00:00 UTC. 50M emails enqueued in 1 second. Mitigations:

Spread send_at times automatically by adding +/- 30 min jitter at enqueue.
Throttle marketing topic consumers so they cannot starve transactional traffic.
Pre-scale workers based on scheduled volume (we know what's coming via the schedule table).

User-level rate limits

A bug in an upstream service triggers 1000 notifications for a single user in a minute. Without protection, the user is spammed.

Per-user rate limit: max 10 notifications/minute, max 100/day, with carve-outs for transactional category. Implemented as a sliding-window counter in Redis; checked in the Fan-Out Service.

Webhook delivery callback storm

Providers POST delivery confirmations back to us at high rates (millions/sec from SendGrid alone during a blast). Solutions:

A dedicated callback ingestion service that writes directly to Kafka (not the API).
Async processing of callbacks; the user-facing API doesn't wait for delivered to mark sent.

Multi-region

Notifications are typically regional: APNs has US/EU/JP edges, SendGrid has region-specific senders (for SPF/DKIM compliance). We deploy a notification stack per region; the Fan-Out Service routes based on the user's home region.

Trade-offs and Alternatives

Push-vs-pull model

Alternative: clients periodically poll GET /pending-notifications. Works for slow channels (email, SMS aren't real-time anyway) but breaks the value proposition of push: we want to wake the device. Push tokens + APNs/FCM is the only way to deliver to a backgrounded mobile app.

Single queue vs per-channel queues

A single Kafka topic is simpler but couples the failure modes: if SendGrid is down and we keep retrying, we slow down push delivery too. Per-channel queues isolate failures and let us tune worker pools per provider.

Synchronous vs async API

We could return 200 only after the provider accepts. Pros: caller knows immediately. Cons: caller waits for our worst provider (SMS via flaky Twilio backup) and our service availability is the product of all provider availabilities. Async (return 202 Queued, deliver later) is the standard answer.

Why Cassandra over a SQL audit log?

10B notifications/day with full audit history is 5 TB/day. SQL would need aggressive sharding and lose query flexibility. Cassandra's per-partition writes match the access pattern (per-user history) and the cluster can grow horizontally without re-sharding.

Idempotency window

A 24-hour idempotency TTL handles most retry scenarios. Some payment providers want longer (7 days). Trade-off: more Redis memory. We could externalize to Postgres for callers needing longer windows, at the cost of slower idempotency check (~5 ms vs ~0.5 ms).

Why not let each app handle its own notifications?

Most large companies do start that way. The pain points that drive consolidation: each app re-implements retries, each app integrates APNs/FCM/SendGrid/Twilio independently, each app hits provider rate limits separately, user preferences are duplicated across apps. A central notification service de-duplicates this effort and produces consistent UX (a single 'unsubscribe from all marketing' that actually works).

Real-World Examples

How real systems implement this in production

Slack notifications

Slack sends ~1B notifications/day across push, email, and in-app. They run a service called `notification-service` that fans out from Slack events (mention, DM, channel reply) into per-channel jobs. Push goes through APNs/FCM with per-workspace token pools; email through SES with reputation management.

Trade-off: Slack famously over-notifies users by default, then provides extensive preference controls. The lesson: granular per-category preferences are essential at scale, but the default settings determine perceived noisiness more than the preference UI.

Airbnb notifications

Airbnb's notification platform uses Kafka as the event bus and a Druid-like system for delivery analytics, with priority topics that route urgent host messages ahead of guest browsing nudges. (Aside: Uber open-sourced a similar internal queue called Cherami in 2017 with native per-message priority, since archived; it is a useful reference for the priority-queue design pattern.)

Trade-off: Airbnb keeps the queueing layer simple by using Kafka topics tiered by priority rather than building a custom priority queue. Trade-off: less in-flight reprioritization (you can't promote a queued message to urgent after it's enqueued), but no extra system to operate.

Twilio Notify

Twilio offers Notify as a fan-out service customers can use instead of building their own. Internally it abstracts APNs, FCM, SMS, and email behind one API; Twilio handles per-provider quota negotiation. It is essentially the architecture in this lesson, sold as a managed service.

Trade-off: Buying Twilio Notify is faster than building, but you trade per-message cost (a few cents) and lose deep integration with internal user preferences. Most large companies build their own once volume crosses ~100M notifications/day.

Apple's APNs (the upstream)

APNs itself is a distributed notification gateway: Apple receives ~50B notifications/day from app developers worldwide, queues them per device, and delivers when the device is online. Apple does not retry; if a device is offline, APNs holds the latest one (collapsing key) and discards older ones for that key.

Trade-off: APNs's collapsing-key behavior is a feature: if your app sends 100 'new email' notifications while the device is offline, APNs delivers only the latest, preventing notification spam on reconnect. The trade-off is that you can't rely on APNs to deliver every push; for guaranteed delivery, you need an in-app pull on reconnect.

Quick Interview Phrases

Key terms to use in your answer

fan-out per channel and per device token

priority queues for transactional vs marketing

exponential backoff with jitter

idempotency key with TTL

device token lifecycle (410 Gone -> remove)

per-provider rate limiting

Common Interview Questions

Questions you might be asked about this topic

Walk me through what happens when a backend service fires `order_shipped` for a user with 3 devices.

The service POSTs to Notification API with Idempotency-Key. API checks Redis for the key (miss), validates the template, enqueues a single event to `notif.events`. Fan-Out Service consumes, loads user prefs (push + email opted in), loads 3 active push tokens. Emits 4 jobs: 3 to `notif.push.p1`, 1 to `notif.email.p1`. Push workers pull, render, send to APNs/FCM; email worker pulls, renders, sends to SendGrid. Status updated in Cassandra after each provider response. The whole flow is sub-5 seconds end-to-end.

How do you make sure an OTP code is never delayed by a marketing blast?

How do you handle a device that uninstalled the app three months ago?

How would you design scheduled notifications (`send_at: tomorrow at 9 AM`)?

How do you stop a sender from being abused (a malicious caller spamming a user)?

Interview Tips

How to discuss this topic effectively

Lead with the fan-out: one event explodes into N delivery jobs. That framing immediately reveals priority queues, per-provider workers, and per-user dedup.

Always separate transactional and marketing into different queues with different worker pools. If you say 'one queue for everything', the interviewer will ask how OTPs survive a marketing blast and you will be stuck.

Mention device token cleanup explicitly. Saying 'we get millions of 410 Gone responses per day and feed them back into the device table' signals you've operated this kind of system.

Cite real provider rate limits (APNs ~1000/conn/sec, Twilio 1 SMS/sec per long code). Anchoring to real numbers proves you've shipped this and aren't bluffing.

Highlight idempotency as a contract with the caller. The Idempotency-Key header is a standard interview signal that you understand at-least-once systems.

Common Mistakes

Pitfalls to avoid in interviews

Treating all notifications as the same priority

Without separate queues, a 50M-recipient marketing blast can starve a single OTP and your user can't log in. Use per-priority topics (or per-priority partitions) with isolated worker pools so transactional always wins.

Retrying 4xx errors the same way as 5xx

4xx errors mean the request itself is bad: the token is dead, the email address is malformed, the template variable is missing. Retrying wastes provider quota and never succeeds. Only retry 429 (rate limited) and 5xx (transient). 410 Gone specifically means delete the token immediately.

Forgetting that providers have rate limits and trying to send 100K/sec to one APNs connection

APNs caps at ~1000/sec per HTTP/2 connection. You need a worker pool with many connections (or many APNs projects) to scale beyond that. Always size workers off the provider's limits, not your event rate.

Doing template rendering on the API hot path

Templates can have hundreds of variables, localization, A/B variants. Rendering on the synchronous API call slows it down and complicates capacity planning. Render at the worker, just before sending; the API just enqueues the template_key + variables.

No idempotency key, leading to double-sends on caller retries

Networks flake; callers retry. Without an Idempotency-Key the same payment confirmation email goes twice. Require the header in the API contract; reject requests without it; cache the resulting notification_id by key for 24 hours minimum.

Back to System Design