Community Article

Event-Driven Architecture and the Three Failure Modes

Lost messages, out-of-order delivery, duplicate processing. EDA buys decoupling and replay; the price is three failure modes you must operate.

Event-Driven Architecture and the Three Failure Modes

Lost messages, out-of-order delivery, duplicate processing. EDA buys decoupling and replay; the price is three failure modes you must operate.

event-driven

message-queue

kafka

distributed-systems

system-design

By @kavyanovak

February 18, 2026

Updated May 18, 2026

907 views

4.3 (12)

A team I joined had moved to event-driven architecture two years before I arrived. The migration story they told me was a victory: monolith decomposed into seven services, communicating via a Kafka cluster, latency down, throughput up. The story they did not tell me was the on-call rotation, which had a roughly weekly incident with one of three causes. A message was lost. A message arrived out of order. A message was processed twice. After three months I had a name for each of those failure modes and a runbook for each, and I realized that what they had built was not a monolith-killer but a different architecture with a different operational cost profile that nobody had budgeted for.

This article is the version of "adopting EDA" I would write for the team I joined. My stance: event-driven architecture is the right answer for some problems and the wrong answer for many others. The wrongness almost always shows up in the same three failure modes, and a team that cannot name them, design for them, and operate them probably should not adopt the architecture yet.

What event-driven architecture actually is

Event-driven architecture is a style where services communicate by publishing and subscribing to events on a shared message bus, instead of calling each other directly via RPC or HTTP. A user signs up, the auth service emits a UserSignedUp event, the email service consumes it and sends a welcome email, the analytics service consumes it and increments a signup counter, the billing service consumes it and creates an empty subscription record.

The advertised wins are real:

Decoupling. The auth service does not know who consumes its events. Adding a new consumer (the analytics team wants signup events too) requires no changes to auth.
Throughput. A message bus can handle huge volumes; partitioning lets you scale consumers horizontally.
Replay. Most modern brokers (Kafka, Pulsar) keep events on disk for days. A new consumer can replay history; a buggy consumer can be reset and re-process.
Resilience. The producer does not block on consumer health. A downed consumer eventually catches up when it comes back.

The advertised wins are real, but they are not free. They come with three failure modes that direct RPC does not have. I will name each, show how it manifests, and end with the design choices that contain it.

Failure mode 1: lost messages

The first failure mode is the message that was supposed to be published but was not, or was published but never reached the consumer. In an RPC world, the caller knows when the call failed; in an EDA world, the producer publishes and moves on. A bug between the producer and the broker, or between the broker and the consumer, can drop a message and nobody notices for hours.

The two patterns I have seen drop messages most often:

Loss patterns
  1. Producer crashes after the database write but before the broker publish
  2. Broker accepts the message but the consumer offset is committed before processing finishes

The first is the dual-write problem. The auth service writes the user row to Postgres and then publishes UserSignedUp to Kafka. If the service crashes between those two operations, the user exists but the event was never published. The welcome email is never sent.

The fix for the first pattern is the outbox pattern: write the event into an outbox table in the same database transaction as the user row, then have a separate process drain the outbox into the broker. The transaction makes the database write and the outbox write atomic; the drainer publishes from the outbox at-least-once. Combined with idempotent consumers (more on that under failure mode 3), this gives you exactly-once-semantics in practice.

A practical addendum on observability for the dual-write problem. Even with the outbox pattern, you want to monitor outbox lag (number of unpublished rows) and outbox age (how long the oldest unpublished row has been sitting). The outbox is the closest thing to a queue inside your database; treating it like one means alerting on it like one. I have seen outbox tables grow to hundreds of thousands of rows because the publisher process silently died and nobody noticed for two days. The fix is a metric and an alert; the metric is count(*) where published_at is null and the alert fires if that number exceeds a threshold appropriate to your traffic.

BEGIN;
INSERT INTO users (id, email) VALUES ($1, $2);
INSERT INTO outbox (event_type, payload, created_at)
    VALUES ('UserSignedUp', $1::text, now());
COMMIT;

A separate worker process, often running as part of the same service or as a sidecar, reads the outbox in order, publishes to Kafka, and marks each row as published. If it crashes mid-publish, on restart it sees unpublished rows and retries. The broker dedupe key is the outbox row's primary key, so a duplicate publish is harmless if the consumer is idempotent.

The fix for the second pattern is commit offsets after processing, not before. Most Kafka client libraries default to auto-commit, which commits offsets on a timer regardless of whether processing finished. If the consumer crashes between auto-commit and processing, the message is lost. Switch to manual commit, commit only after the side effect is durable, and accept that a crash before commit will replay the message (which is fine if your consumer is idempotent).

If you skip the outbox pattern, you do not have an event-driven architecture. You have a hopeful architecture that mostly works.

Failure mode 2: out-of-order delivery

The second failure mode is the event that arrives after a later event. The user updates their email twice in quick succession; the second update event arrives at the consumer before the first; the consumer applies them in the wrong order, and the user's email reverts to the older value.

This happens because most message brokers preserve order only within a partition. Kafka guarantees order within a partition, not across partitions. If the producer round-robins messages across partitions, two related messages can land on different partitions and arrive in arbitrary order.

The standard fix is to partition by a stable key. For user-related events, partition by user ID. All UserUpdatedEmail events for user 42 land on the same partition, are read by the same consumer, and arrive in order.

# Bad: round-robin partitioning
producer.send("users", value=event_payload)

# Good: partition by user ID, all events for one user arrive in order
producer.send("users", key=str(user_id), value=event_payload)

The cost of partitioning by user ID: hot users (those generating many events) put more load on one partition than the average. If user 42 is responsible for 5% of events, one partition gets 5% of the load. Most workloads are more uniform than that, but I have seen one customer's API integration generate enough events to saturate a single partition while others were near-idle. The mitigation there is a finer partitioning key (user ID + event type), at the cost of weaker ordering guarantees (different event types for the same user can arrive out of order).

There is a deeper issue: out-of-order can happen within a single producer too, if the producer retries a failed publish while a later message has already been published. Most Kafka clients handle this with enable.idempotence=true, which guarantees in-partition ordering even with retries. Turn it on. It has a tiny throughput cost and prevents a real bug.

Failure mode 3: duplicate processing

The third failure mode is the consumer that processes the same message twice. This is the most common of the three because at-least-once delivery is the default. The producer commits, the consumer reads, the consumer crashes mid-processing, the consumer restarts, the broker redelivers the message, and the consumer processes it again.

The only fix is idempotent consumers. If the consumer's side effect is to send an email, the second processing sends a second email. If the side effect is to increment a counter, the counter is now wrong. The consumer must detect the duplicate and skip the side effect.

Three patterns for this, in order of strength:

Idempotency patterns
  1. Idempotent operation (PUT, DELETE)            naturally idempotent
  2. Deduplication table keyed by event ID         consumer-side dedupe
  3. State-based check ("did we already do X?")     business-logic dedupe

The first is the cleanest: structure the side effect so reprocessing is harmless. "Set the user's email to X" is idempotent regardless of how many times you do it. "Increment the counter" is not, but "set the counter to N" is.

The second is the most general. The consumer keeps a table of (event_id, processed_at) and skips events whose ID is already in the table. This works even when the side effect cannot be made idempotent at the operation level. The table grows; you need a TTL or partition strategy to keep it bounded.

The third is for business-level checks. Before sending the welcome email, query the database: "has user 42 received a welcome email?" If yes, skip. This works without a dedupe table but requires the business logic to expose a queryable state.

I have used pattern 2 most often. The dedupe table is a small operational cost (one extra index, one extra query per event) but it works for any side effect and it is simple to reason about.

A deeper subtlety on dedupe tables: the dedupe write and the side-effect write should ideally happen in the same transaction. If they do not, you can have a crash between the side effect and the dedupe row, and the next replay will repeat the side effect. For consumers writing to the same database where the dedupe table lives, this is straightforward: one transaction. For consumers whose side effect is an external API call (sending an email, hitting a webhook), full transactionality is impossible, and you fall back to either accepting that some duplicates may slip through during failures, or pushing dedupe responsibility to the external system if it supports idempotency keys.

A fourth failure mode I should call out

Beyond the three above, there is a meta-failure: schema drift. The producer changes the event payload (adds a new field, renames an existing one), the consumer breaks because it does not know about the new shape. This is not delivery-related but it is event-driven-specific in the same way: the producer and consumer are decoupled in code but coupled in data shape, and that coupling is invisible until the schema changes.

The mitigations are the usual ones: schema registry (Confluent's, AWS Glue's), backward-compatible changes only (add optional fields, never remove or rename), and contract tests that fail in CI when the producer breaks an existing consumer's expected schema. Most teams underspend on this until they have shipped a payload-shape bug to production.

When EDA is the right choice

I want to be specific about when this architecture earns its operational cost.

EDA is right when the producer and consumers are owned by different teams with different release cadences. The team owning the auth service should not have to coordinate releases with five downstream consumers. EDA gives them a stable contract (the event schema) and lets each team ship independently.

EDA is right when the work is asynchronous by nature. Sending a welcome email does not need to block the signup response. Updating an analytics counter does not need to block the user. The user-facing path returns immediately; the side effects happen out of band.

EDA is right when the same event has many consumers. A UserSignedUp event consumed by email, analytics, and billing services is much cheaper to fan out via a broker than to model as three RPC calls from auth.

EDA is right when you genuinely need replay. Audit, analytics rebuild, and recovering from a buggy consumer all benefit from the broker keeping events for days.

When EDA is the wrong choice

EDA is wrong when the workflow is request-response. If the user is waiting for the result, you do not want a fire-and-forget event. RPC is the right primitive there.

EDA is wrong when the producer needs to know the consumer succeeded. The decoupling that makes EDA scalable also removes the producer's visibility into consumer outcomes. If you need that visibility, you are recreating RPC over a broker, badly.

EDA is wrong when the team has not built the operational maturity. Lost messages, out-of-order delivery, and duplicate processing are not edge cases; they are guaranteed to happen. A team without an outbox table, idempotent consumers, monitoring on consumer lag, alerts on dead-letter queue depth, and runbooks for replay should not adopt EDA. The architecture will hide bugs until they explode.

EDA is wrong when the workflow is two services that always run together. If service A always synchronously triggers service B, an event between them is just slow RPC. Use RPC.

A pattern I would steal from teams that do this well

The teams I have seen run EDA without weekly incidents share a few habits:

Operational habits
  - every event has a stable, versioned schema in a registry
  - every event carries an event_id (UUID) and a timestamp
  - every consumer is idempotent by default (dedupe table or natural idempotence)
  - producers use the outbox pattern; nobody publishes from app code without a transactional write
  - dead-letter queues are monitored and have an SLA for clearing
  - consumer lag is alertable
  - replay is rehearsed (you have actually replayed events in production within the last quarter)

That is a lot of operational machinery. Building it costs time. The teams that skip the machinery are the teams that have weekly incidents. The teams that build it have boring weeks.

What I tell engineers considering EDA

Two questions, asked in order:

Is the workflow asynchronous, fan-out, or owned by separate teams? If yes, EDA is on the table. If no, RPC is probably better.
Does your team have (or have a plan to build) the outbox pattern, idempotent consumers, schema registry, dead-letter queues, and replay runbooks? If yes, adopt EDA. If no, build those first or pick a different architecture.

The second question is the one teams skip. EDA is sold as a way to decouple services; it does that, but in exchange it requires a different set of operational disciplines than RPC. If you can budget for those disciplines, you get the wins. If you cannot, you get the failure modes without the wins.

What I would do differently next time

If I were starting a new project today and asynchrony was the right primitive, I would still reach for EDA. I would not skip the broker; I would skip the temptation to skip the operational layer. Specifically: I would build the outbox pattern on day one (not "we'll add it later"), I would write the dedupe table into the consumer template (not "we'll add it when we hit a duplicate"), and I would rehearse a replay scenario in staging before the first incident, not after.

The other thing I would change: I would write less code that uses events synchronously. The team I joined had a pattern where service A would publish an event and service B would do the work and publish a result event, and service A would block until it saw the result. That is RPC over a broker, and it has all the latency overhead of a broker without the decoupling benefit. If A needs B's result, A should call B directly. Events are for fan-out and asynchrony, not for request-response with extra steps.

Where I have landed

EDA is a real architectural style with real wins and a real operational tax. The wins are decoupling, throughput, and replay. The tax is the three failure modes (lost messages, out-of-order, duplicates) plus schema drift. A team that names all four, designs for all four, and operates all four can extract the wins. A team that adopts EDA because it sounds modern, without doing that work, ends up with a distributed monolith that fails in surprising ways. The architecture is not the problem; the operational discipline is the problem. If your team is willing to fund that discipline, EDA is a great choice. If not, please stay with RPC for another year and revisit when you have the maturity to handle what the broker brings with it.

Back to Articles