Community Article

Building a Notification Service From Scratch

Delivery is the easy part. Preferences, dedup, throttling, and timezone-aware digests are where notification services succeed or generate complaints.

Building a Notification Service From Scratch

Delivery is the easy part. Preferences, dedup, throttling, and timezone-aware digests are where notification services succeed or generate complaints.

notification-service
fan-out
queue
system-design
message-queue
sofiacollins

By @sofiacollins

March 18, 2026

·

Updated May 18, 2026

1,000 views

8

4.2 (13)

The first notification service I built took a week. The second one took a quarter. The difference is what the article is about.

The week-long version sent emails, push notifications, and SMS to a list of recipients. It worked. It also sent a user three duplicate emails when an upstream event was retried, woke up users at 3 AM in their local timezone for non-urgent updates, sent marketing emails to users who had unsubscribed last month, and ignored the in-app preferences page entirely. The quarter-long version did none of those things. The delta between the two is what people mean when they say "notifications are deceptively hard."

This article walks through the architecture of a notification service that handles the parts most teams underestimate: preferences, deduplication, rate limiting, channel routing, templating, and the operational machinery to keep all of that working. My stance: the delivery infrastructure (the SMTP integration, the FCM push, the Twilio call) is the easy part. Build it last, not first. Build the preferences and dedup layers first, because they are what determines whether your notification service is a feature or a complaint generator.

What a notification service actually does

The naive view: "send a message to a user." The actual job is more like:

Notification pipeline
  1. Receive an event ("user X commented on user Y's post")
  2. Decide who to notify (Y, plus anyone subscribed to that thread)
  3. Decide whether to notify them (preferences, mute settings, do-not-disturb)
  4. Decide which channels to use (email, push, in-app, SMS)
  5. Decide when to send (immediate, batched, scheduled, throttled)
  6. Render the content per channel (subject + HTML for email, title + body for push)
  7. Dispatch to delivery providers (SES, FCM, Twilio)
  8. Track the result (delivered, bounced, failed, clicked)
  9. Update aggregates (unread count, daily digest queue)

That is nine steps, and the delivery step (step 7) is one of them. Most teams build step 7 first and bolt the rest on later. That order is wrong; the dedup and preferences logic affects every other step, and adding it after the fact requires touching every code path.

The two failure modes that make users hate notifications

I keep these in mind for every design decision:

The first is noise: too many notifications, in the wrong channel, at the wrong time. Users mute the app or unsubscribe entirely. From the team's perspective, the notification system is "working" (messages are being delivered); from the user's perspective, the system is broken (they never read any of them).

The second is drops: a notification the user expected never arrived. They missed a meeting, missed a critical alert, missed a password-reset email. From the team's perspective, the system has failed silently; from the user's perspective, the company is unreliable.

A notification service must avoid both, and the avoidance strategies are different. Noise is solved by preferences, dedup, throttling, and digesting. Drops are solved by retries, idempotency, monitoring, and observability. A team that prioritizes one and ignores the other ends up with a system that is loud or lossy.

The architecture I would build now

The shape, drawn at the level of services and queues:

Notification service (logical components)
  Event ingestor      receives "something happened" events
  Recipient resolver  decides who to notify, applies preferences/mute
  Deduplicator        suppresses duplicate notifications within a window
  Throttler           caps per-user-per-channel-per-window send rate
  Renderer            produces channel-specific content from a template
  Dispatcher          sends via SES, FCM, Twilio, etc.
  Receipt tracker     records delivery, bounce, click events
  Preferences API     CRUD for user notification preferences
  Audit log           every notification's lifecycle, queryable

Each of those can be a module in a monolith or its own service; the boundaries are the same either way. The important thing is that the pipeline runs in this order and that each stage is idempotent.

Step 1-2: ingest and resolve recipients

The event ingestor accepts events from upstream services. The shape I use:

{
    "event_id": "evt_2026_04_30_abc123",
    "event_type": "comment.created",
    "actor_id": "user_42",
    "context": {
        "post_id": "post_99",
        "comment_id": "comment_777",
        "thread_subscribers": ["user_8", "user_42"]
    },
    "created_at": "2026-04-30T10:00:00Z"
}

The event_id is the key insight. Upstream services include a stable, unique ID with every event. Downstream stages use that ID for deduplication. If the upstream service retries the event publish, the same event_id arrives twice, and the deduplicator drops the second one before any delivery happens.

The recipient resolver maps the event to a list of (user_id, channel_set) tuples. For a comment event, the post author plus any thread subscribers, minus the comment author themselves (don't notify someone of their own action), minus anyone who has muted this thread.

A subtler aspect of the recipient resolver: it is the natural place to apply batch coalescing across actors. Three friends comment on the same post in five minutes. Three separate events arrive. The naive resolver dispatches three notifications. The smart resolver detects "three events for the same recipient, same notification_type, same context (post_id) within a short window" and produces one notification with merged actors: "Alice, Bob, and Carol commented on your post." This is what most consumer apps do and it is how a notification feed feels well-curated rather than spammy. The merge logic is small but high-leverage; users notice the difference.

Step 3-4: preferences and channel selection

Preferences are the part teams underestimate. The minimum schema:

CREATE TABLE notification_preferences (
    user_id         TEXT NOT NULL,
    notification_type   TEXT NOT NULL,  -- 'comment.reply', 'mention', 'digest', etc.
    channel         TEXT NOT NULL,  -- 'email', 'push', 'sms', 'in_app'
    enabled         BOOLEAN NOT NULL,
    quiet_hours_start   TIME,
    quiet_hours_end     TIME,
    timezone        TEXT,
    PRIMARY KEY (user_id, notification_type, channel)
);

The cardinality is users * types * channels. For ten thousand users, twenty notification types, and four channels, that is 800,000 rows. Indexed appropriately, that is fine.

Three rules I would write into the spec:

  1. Default preferences are explicit, not implicit. Every notification type has a documented default (on or off) per channel. New users get the defaults; the defaults are reviewed when new types are added.
  2. Quiet hours are timezone-aware. A user's 10 PM is not your server's 10 PM. Store the timezone with the user; convert at decision time.
  3. One "unsubscribe everything" toggle that actually works. Legal requirement in many jurisdictions. Test it. Verify that unsubscribed users get nothing across all channels.

The channel selection logic, given preferences:

Channel selection rules
  - If the user has disabled this notification_type for all channels: skip
  - If the user is in quiet hours: skip non-urgent channels (push, email)
                                   allow urgent channels (sms, in-app)
  - If the channel address is missing/invalid (no email, no FCM token): skip that channel
  - Otherwise: dispatch to each enabled channel

The "urgent" classification is per-notification-type. A login-from-new-device alert is urgent and ignores quiet hours; a daily digest is not and respects them.

Step 5: deduplication

Two distinct kinds of dedup:

Event-level dedup: the same upstream event was published twice. The deduplicator stores the event_id for some window (24 hours is typical) and drops repeats. This catches at-least-once delivery from upstream queues.

Notification-level dedup: the user did the same action twice in quick succession (commented twice, edited and re-published). The deduplicator stores a key like (user_id, notification_type, context_hash) for a shorter window (5-10 minutes) and collapses repeats into one notification. This is what prevents "three emails in two minutes" rage-quits.

The implementation is a Redis SET with TTL or a Postgres table with a TTL cleanup job. Both work; pick the one that matches your other infrastructure.

def is_duplicate(event_id: str) -> bool:
    return redis.set(
        f"event_dedup:{event_id}", "1",
        nx=True, ex=86400  # SET if not exists, expire in 24h
    ) is None  # returns None on collision

The SET ... NX EX pattern is atomic and is the standard way to do this kind of dedup in Redis.

Step 6: throttling

Even with perfect dedup, a user can legitimately generate too many notifications. Someone replies to ten of your comments in one minute; that is ten valid comment.reply events. Sending ten notifications is correct from a logical standpoint and broken from a user-experience standpoint.

The throttle is a per-user-per-channel-per-type rate limit. Token bucket works well here: refill rate of one notification per minute, capacity of three, applied per user per type. After three notifications in a minute, the rest get coalesced into a digest ("5 new comments on your post").

class NotificationThrottle:
    def admit(self, user_id, channel, notification_type, count=1):
        bucket = f"throttle:{user_id}:{channel}:{notification_type}"
        # token bucket logic against Redis
        return token_bucket_admit(bucket, refill=1/60, capacity=3, cost=count)

If the throttle rejects, the notification is queued for the next digest run instead of being dropped. The user gets a roll-up after the throttle window expires.

Step 7: rendering and dispatch

Rendering is templating with channel-specific concerns. Email needs a subject, HTML body, plain-text fallback, and reply-to address. Push needs a title and short body (under ~150 characters on most platforms). SMS needs a single short string with no formatting. The same notification produces different content per channel.

I have used a template-per-channel approach: comment_reply.email.html, comment_reply.push.json, comment_reply.sms.txt. Each template renders against the same context object. Adding a new channel is adding new templates, not new logic.

Dispatch is the wrapper around the actual provider call. The dispatcher's job:

  1. Pick the provider (SES, SendGrid, Twilio, FCM, APNS).
  2. Make the call with the rendered content.
  3. Record the receipt (provider message ID, success/failure, error code).
  4. On retryable failure, requeue with backoff.
  5. On non-retryable failure (invalid email, expired push token), mark the channel as broken on the user's account.

The dispatcher is the layer that knows about provider-specific quirks. SES rejects email addresses that fail strict validation; FCM tokens expire; Twilio has per-country restrictions. The rest of the pipeline should not know about any of that.

Step 8: receipt tracking and bounce handling

The dispatch call returns "the provider accepted the message", not "the user got it". Acceptance is the easy half; actual delivery is what providers report later, asynchronously, via webhooks. Without wiring those webhooks back into your data, "did this email reach the user" is unanswerable, and the unsubscribe and re-engagement flows downstream have no source of truth.

Every serious provider (SES, SendGrid, Postmark, FCM, Twilio) sends webhook events for delivered, bounced, complained, opened, and clicked. The webhook handler parses the event, looks up the dispatch row by the external message ID the provider returned at send time, and writes a receipt:

CREATE TABLE notification_receipts (
    external_message_id TEXT PRIMARY KEY,  -- the provider's ID returned at dispatch
    notification_id     TEXT NOT NULL,     -- joins back to the dispatch row
    event_type          TEXT NOT NULL,     -- 'delivered' | 'bounced' | 'clicked' | 'complained'
    bounce_kind         TEXT,              -- 'hard' | 'soft' | null
    received_at         TIMESTAMP NOT NULL
);

The bounce taxonomy matters. A hard bounce (mailbox does not exist, recipient blocked) means the address is permanently undeliverable; mark the user's email as invalid, stop sending to it, and surface a "please update your email" prompt next login. A soft bounce (mailbox full, temporary DNS failure) means try again; retry up to three or four times with backoff before treating it as hard. Conflating the two is what produces bounce loops in the first place: every retry to a hard-bounced address damages your domain's sender reputation, and a few thousand of those will get your IP rate-limited or blacklisted across providers.

The receipts table also feeds the click-through and open-rate dashboards from Step 9, and gives compliance a clean audit trail when someone asks "prove this user actually received the GDPR notice we sent."

Step 9: aggregates and digests

The aggregate I find essential is unread count. For most apps, the badge on the icon is the unread notification count. Updating that aggregate atomically with the dispatch is an interesting design problem.

The pattern I have used: write the notification to an inbox table at dispatch time, with read_at null. The unread count is a query against that table. When the user opens the inbox, mark as read. The count is always correct because it is derived, not maintained separately.

The digest is the other aggregate: a daily or weekly summary of notifications the user did not engage with directly. The digest job runs at a scheduled time, queries the user's unread notifications since the last digest, renders them as a single email or in-app summary, and sends. This is what catches up users who do not check the app daily.

Five failure modes I have hit in production

Five failure modes I have hit:

  1. Preference inconsistency. The user updates their preferences in the app; the notification service reads stale preferences from a cached copy and sends an unwanted email. The fix is to invalidate the cache on every preference change, or read preferences directly from the source of truth at decision time. Cache invalidation is a more reliable answer than "hope the cache TTL is short enough."
  2. Provider outage. SES is down; emails queue up. The fix is to have a retry queue with exponential backoff and a dead-letter queue for messages that exceed the retry budget. Monitor both.
  3. Wrong-timezone digest. The digest job runs at server-time midnight, sending to users at random local times. The fix is to schedule the digest per-user-timezone, running multiple digest jobs through the night to hit each user's morning.
  4. Bounce loops. A bounced email keeps getting resent because the bounce is not being processed. SES, SendGrid, and most providers send webhooks for bounces; you must subscribe and update the user's email status on bounce. Otherwise you keep hitting the same dead address.
  5. Notification storms during incidents. Something goes wrong upstream; a million events get republished as a backlog drains; users get a hundred notifications each. The fix is a circuit breaker on the throttler: if a user is being notified at more than N times per minute for any reason, halt their dispatch until a human intervenes.

A topic worth its own paragraph: observability for notifications. Three dashboards I would build day one:

Required dashboards
  - per-channel delivery rate (last 1h, last 24h)
  - per-channel error rate, broken down by error code (bounce, rate-limited, invalid token)
  - per-notification-type send count and click-through rate

The first tells you whether the system is delivering at all. The second tells you which provider or which user-population is having problems. The third tells you which notification types are actually useful (low click-through means users do not care, and you should consider muting that type by default). Teams that skip these dashboards do not know their notification service is broken until users complain. Teams that have them know within minutes.

What I would build first, what I would build last

Build first:

  1. The event schema, with event_id as the dedup primary key.
  2. The preferences API and preference enforcement at the recipient resolver.
  3. The dedup and throttle layers.
  4. The audit log.

Build last:

  1. Provider integrations (SES, FCM, Twilio).
  2. Templates and rendering.
  3. Aggregates (unread count, digest).

This order is reverse from how most teams build it, and that is the source of most of my advice. Building the dispatch first means you are sending notifications without preferences, without dedup, without throttling. The first user complaint comes within a week. Building the controls first means the day you hook up the first provider, the system already respects everything that matters.

What I tell teams scoping a notification service

Three sentences that compress everything above:

  1. Notifications are not a delivery problem; they are a preferences-and-dedup problem.
  2. The bottom of the stack is the easy part; build it last.
  3. Every team I have worked with who skipped this advice ended up rewriting the service within a year.

I want to be specific about one more thing teams underestimate: legal and regulatory requirements. Email has CAN-SPAM (US) and GDPR (EU) requirements that mandate one-click unsubscribe and honest sender information. SMS has stricter rules: per-country opt-in requirements, per-country sender ID rules, and restrictions on the time of day messages can be sent. Push notifications have Apple and Google policy rules about what can and cannot be sent without explicit consent. Building these into the system day one is much cheaper than retrofitting them after a compliance audit. The unsubscribe toggle, the consent record (when did the user opt in to SMS, what was the consent text), and the audit log together provide most of the evidence a compliance review needs. Skip them and the same review will require a code freeze to add them under deadline.

A note on choosing providers

Three categories of choice every team makes:

Provider choices, with rough reasoning
  Email          SES (cheap, AWS-native), SendGrid (better deliverability), Postmark (best transactional)
  Push           FCM (cross-platform, free), OneSignal (multi-channel)
  SMS            Twilio (broad country coverage), MessageBird (Europe-strong), Plivo (cheapest at scale)
  Multi-channel  Courier, Knock, MagicBell (full notification platforms)

The make-or-buy decision for the platform itself: if your team is small (under twenty engineers) and notifications are not a core differentiator, evaluate buying. Courier and Knock both offer roughly the architecture I described above as a managed service. The cost (a few thousand dollars a month) is less than the cost of one engineer's quarter, and the buyers I have talked to are happy with the trade. If notifications are core to your product (a chat app, a financial alerts product, a real-time monitoring tool), build. The customization you need will exceed what a vendor offers. For everything in between, build the dedup and preferences layer in-house, and use providers for the actual delivery.

A small operational note: most providers send delivery webhooks (delivered, bounced, complained, clicked). Subscribe to those webhooks and feed the events back into your audit log. Without that signal, you do not know whether your notifications are actually reaching users. With it, you can spot a deliverability problem the moment it starts.

Build the controls before the dispatch

A notification service well-built is invisible. Users get the notifications they want, in the channels they want, at the times they want, exactly once. They never wonder how it works because it just works. A notification service poorly built is a top complaint source for the whole product. The difference between the two is not delivery infrastructure (both have that); it is the discipline to model preferences, dedup, throttling, and audit before the first message is sent. The teams that get notifications right are the teams that treated it as an architecture problem, not an integration problem. The teams that got it wrong treated it as "hook up SES, ship."

Back to Articles