System Design Article

Event-Driven Architecture & Pub/Sub

Difficulty: Medium

Event-driven architecture (EDA) is a style where services communicate by emitting and reacting to immutable events instead of calling each other directly. This lesson covers the publish/subscribe pattern, the difference between event notification and event-carried state transfer, the role of an event bus, and how EDA reshapes coupling, scalability, and consistency. We compare it with request/response, walk through real implementations on Kafka, Kinesis, EventBridge, and SNS, and end with the operational pitfalls (event versioning, ordering, schema drift, observability) that bite teams who adopt EDA without preparation.

Event-Driven Architecture & Pub/Sub

System Design

Medium

event-driven

pub-sub

kafka

message-queue

async-processing

distributed-systems

system-design

intermediate

premium

388 views

What is Event-Driven Architecture?

Event-driven architecture (EDA) is a style where services communicate by emitting events that describe something that has already happened, and by reacting to events they care about. Producers do not know who consumes their events; consumers do not know who produced them.

An event is an immutable record of a fact: 'Order 42 was created at 10:00:00.123'. It is past tense; the work has already been done. This is the key difference from a command (a request to do something) or a query (a request to learn something).

Text

---------- EDA at a glance ----------
  [OrderService] --emits--> 'OrderCreated' --> [EVENT BUS]
                                                  |
                +---------------------------------+--------------------+
                v                v                v                    v
        [InventoryService]  [BillingService]  [Notification]    [Analytics]

No service calls another. The event bus delivers the event to every interested subscriber.

EDA vs Request/Response

In request/response, the producer knows the consumer and waits for a reply.

Text

---------- Request/response ----------
  ServiceA --(call)--> ServiceB
            <--(reply)--

In EDA, the producer emits and forgets.

Text

---------- Event-driven ----------
  ServiceA --(emit OrderCreated)--> [BUS] --> ServiceB
                                          --> ServiceC
                                          --> ServiceD

Aspect	Request/Response	Event-Driven
Coupling	Tight (producer knows consumer)	Loose (producer knows nothing about consumers)
Consistency	Synchronous, immediate	Asynchronous, eventually consistent
Latency	Sum of all hops	Producer is fast; consumers process later
Failure mode	Cascade (downstream failure breaks caller)	Isolated (consumer failure does not affect producer)
Adding consumers	Requires producer change	Subscribe to existing events
Best for	Synchronous workflows, queries, user-facing	Async workflows, fan-out, integration

Most real systems mix both. User-facing APIs are request/response; back-end orchestration and integration are event-driven.

Three Patterns Inside EDA

EDA is a family. Picking the right variant matters.

1. Event Notification (the simplest)

The event carries only an identifier and minimal context. Consumers fetch full data from the producer if they need more.

Jsonc

// Event Notification: small, fetch-on-demand
{
    "type": "OrderCreated",
    "orderId": "o-42",
    "timestamp": "2026-04-26T10:00:00.123Z"
}

Pros: small events, no schema duplication. Cons: every consumer hits the producer's API to get details, recreating coupling.

2. Event-Carried State Transfer

The event carries all the state a consumer would need.

Jsonc

// Event-Carried State Transfer: full payload
{
    "type": "OrderCreated",
    "orderId": "o-42",
    "customerId": "c-7",
    "items": [
        { "sku": "abc", "qty": 2, "price": 19.99 },
        { "sku": "xyz", "qty": 1, "price": 49.99 }
    ],
    "totalAmount": 89.97,
    "currency": "USD",
    "timestamp": "2026-04-26T10:00:00.123Z"
}

Pros: consumers can build their own materialized view without calling the producer; true loose coupling. Cons: larger events, schema versioning becomes critical, data duplicated across consumers.

This is the pattern most modern EDA systems use.

3. Event Sourcing

The entire state of the system is derived from a log of events. To know the current balance of an account, replay every Credit/Debit event for that account. The event log is the source of truth; the application state is a derived view.

This is a more advanced pattern (covered in detail in the Advanced track lesson 'Event Sourcing & CQRS') and not the same thing as EDA. Use event sourcing when you need a perfect audit trail or time-travel queries; otherwise, plain event-carried state transfer is simpler.

The Event Bus

The event bus is the infrastructure that routes events from producers to subscribers. The choice of bus shapes the system's properties.

Bus	Type	Best for
Apache Kafka	Log-based	High-throughput event streams, replay, multiple consumer groups
AWS Kinesis	Log-based	AWS-native streaming; same shape as Kafka
Apache Pulsar	Log + queue hybrid	Multi-tenant messaging with both pub/sub and queue semantics
AWS SNS	Topic-based	Fan-out to many subscribers (HTTP, Lambda, SQS); fire-and-forget
AWS EventBridge	Event router with rules	Schema registry, content-based routing, AWS service events
Google Cloud Pub/Sub	Topic-based	GCP-native pub/sub with at-least-once delivery
RabbitMQ topic exchanges	Broker-based	Pub/sub for moderate scale with rich routing
NATS / NATS JetStream	Lightweight pub/sub	Low-latency, edge, or IoT scenarios

For most modern systems, Kafka (or its managed equivalent) is the default for high-throughput EDA, and EventBridge or SNS is the default for AWS-native lower-volume integration.

A Concrete Example: E-Commerce Checkout

Request/response version:

Text

---------- Request/response checkout ----------
  Client --POST /checkout--> [Order Service]
                                  |
                                  |--call--> [Inventory] reserve
                                  |--call--> [Payment] charge
                                  |--call--> [Shipping] create label
                                  |--call--> [Email] send confirmation
                                  v
                                returns 200 only after all succeed

Problems: payment service down -> entire checkout fails. Email service slow -> checkout slow. Adding a loyalty service requires changing the order service.

Event-driven version:

Text

---------- Event-driven checkout ----------
  Client --POST /checkout--> [Order Service]
                                  |
                                  |--write order to DB--
                                  |--emit 'OrderCreated' to Kafka--
                                  v
                              returns 202 (Accepted)

  [Kafka topic: orders.created]
          |
          +--> [Inventory] reserves stock; emits 'InventoryReserved'
          +--> [Payment] charges card; emits 'PaymentCompleted'
          +--> [Email] sends confirmation
          +--> [Loyalty] awards points (added later, no order-service change)
          +--> [Analytics] increments funnel metric

Properties:

Order service responds in ~10 ms (just write + emit). Customer sees 'Processing' immediately.
Each downstream service runs at its own pace and can fail independently.
New consumers (loyalty, fraud check, recommendation) plug into the existing event without any change to the order service.
The trade-off: the customer might see 'Processing' for a few seconds before all side effects complete.

Schema Design and Versioning

The second-most-important rule of EDA after 'events are immutable past-tense facts': events are a contract. Once a producer emits an event with a schema, every consumer depends on that schema. Breaking changes break consumers silently.

Versioning strategies

1. Additive changes are safe (add new optional fields). Old consumers ignore unknown fields.

2. Breaking changes require a new event type or a version field. Common patterns:

New event name: OrderCreated.v2. Both versions emitted in parallel during migration.
Version field: { "type": "OrderCreated", "schemaVersion": 2, ... }. Consumers handle both.

3. Schema registry: Confluent Schema Registry, AWS Glue Schema Registry, or EventBridge Schema Registry stores all event schemas centrally and enforces compatibility rules (forward-compatible, backward-compatible, full).

Jsonc

// AVRO schema in registry, compatibility = BACKWARD
{
    "type": "record",
    "name": "OrderCreated",
    "fields": [
        { "name": "orderId", "type": "string" },
        { "name": "customerId", "type": "string" },
        { "name": "totalAmount", "type": "double" },
        { "name": "currency", "type": "string", "default": "USD" } // optional, default for old producers
    ]
}

Naming conventions

Use past tense and namespace by domain: orders.OrderCreated, orders.OrderShipped, users.EmailUpdated. This makes the topic taxonomy self-explanatory.

Pseudocode: Producer and Consumer

JavaScript

Python

// Producer
async function createOrder(req) {
    const order = await db.orders.insert(req.body);

    await kafka.publish('orders.OrderCreated', {
        type: 'OrderCreated',
        schemaVersion: 1,
        orderId: order.id,
        customerId: order.customerId,
        totalAmount: order.totalAmount,
        currency: order.currency,
        timestamp: new Date().toISOString(),
        idempotencyKey: req.headers['x-idempotency-key'],
    });

    return { status: 'accepted', orderId: order.id };
}

// Consumer
async function onOrderCreated(event) {
    if (await db.processedEvents.find({ id: event.idempotencyKey })) return;

    await db.transaction(async (tx) => {
        await tx.processedEvents.insert({ id: event.idempotencyKey });
        await tx.inventory.reserve({ orderId: event.orderId, items: event.items });
    });

    await kafka.publish('orders.InventoryReserved', { orderId: event.orderId });
}

Operational Pitfalls

1. Event ordering

Kafka gives per-partition ordering only. If your consumer logic depends on processing events in causal order (e.g., 'OrderCreated' before 'OrderUpdated'), partition by the entity key (e.g., orderId) so all events for that entity land on one partition.

2. Idempotency is non-negotiable

Consumers will see duplicate events: producer retries, broker redelivery, consumer rebalances. Every consumer must include an idempotency check (dedupe table indexed by event ID).

3. Schema drift kills consumers

A producer team adds a required field; downstream consumers parsing strictly fail to deserialize. Use schema registries with compatibility enforcement to prevent this from making it past CI.

4. Observability is harder

A single user action (place order) now spans 5+ services and dozens of event handlers. Without correlation IDs and distributed tracing (OpenTelemetry, Jaeger, Datadog APM), debugging a failed checkout is impossible.

5. The event firehose problem

Kafka makes it easy to publish events. Teams over-publish - every state change becomes an event - and consumers drown. Be deliberate about event granularity; not every internal state change deserves a topic.

6. Eventual consistency in the UI

Users expect immediate feedback. After a 'create order' event, the order list page may not yet show the new order. Either fetch from the producer (defeats the purpose) or design the UI for eventual consistency (optimistic update + reconcile).

When NOT to Use EDA

EDA is a hammer; not every problem is a nail.

Synchronous user-facing queries (load page, fetch profile). Use request/response with caching.
Strong consistency requirements (transfer money, take a lock). Use direct calls or 2PC.
Tight latency budgets within a single workflow (sub-50ms end-to-end). EDA's queue hops add latency.
Small teams, single team, simple system. The decoupling benefit is small; the operational cost is real.

A good rule: use EDA for integration between bounded contexts, request/response within a bounded context.

Decision Matrix

Scenario	Pattern	Bus
Multi-service e-commerce checkout	EDA, event-carried state	Kafka or Kinesis
Microservices integration in AWS	EDA with rules	EventBridge
IoT telemetry ingestion	EDA, high throughput	Kafka, Kinesis, NATS
Notification fan-out (email/SMS/push)	EDA with topics	SNS or Kafka
Audit log / time-travel queries	Event sourcing	Kafka with infinite retention
Loose-coupling between teams	EDA, event-carried state	Kafka with schema registry
Sync user-facing API	Request/response, NOT EDA	n/a
Money transfer, locks, transactions	Direct synchronous calls	n/a

How to Talk About This in an Interview

Start with coupling. 'I would use event-driven architecture so that adding new consumers does not require changing the producer.'
Pick the variant. 'Event-carried state transfer means each consumer can build its own view without calling back to the producer, which is what I want for true decoupling.'
Name the bus and justify. 'Kafka for high-throughput streams, EventBridge for AWS-native cross-service integration with rule-based routing.'
Always mention idempotency and ordering. 'Consumers will see duplicate events, so I will dedupe by event ID. Per-key ordering will be preserved by partitioning Kafka topics by entity ID.'
Bring up schema versioning. 'A schema registry with backward-compatibility enforcement prevents producer changes from breaking consumers silently.'
Acknowledge the trade-offs. 'EDA shifts complexity from synchronous coordination to async observability. We need correlation IDs and distributed tracing to debug end-to-end flows.'

Quick Review

Events are immutable, past-tense facts. Producers emit; consumers react.
Three flavors: event notification (lookup needed), event-carried state transfer (default), event sourcing (advanced).
Pub/sub bus decouples producers from consumers. Adding a consumer is a deploy, not a producer change.
Pick Kafka for throughput, EventBridge for AWS-native rule routing, SNS for simple fan-out.
Idempotency, ordering, and schema versioning are mandatory.
Use EDA between bounded contexts; use direct calls within them.
Observability (correlation IDs, distributed tracing) is what makes EDA debuggable at scale.

Real-World Examples

How real systems implement this in production

Uber's event-driven trip lifecycle

Uber's trip flow is heavily event-driven. A trip request emits an event consumed by dispatch (find driver), pricing (compute fare), notifications (driver+rider updates), analytics (funnel metrics), and many other services. Each service is independent; new consumers (loyalty, surge analysis, fraud detection) plug into existing events without changing the dispatch path. They use Kafka as the central event backbone with hundreds of topics.

Trade-off: Event-driven decoupling lets dozens of teams ship independently against the same event stream. The cost is observability: tracing one trip across all services requires excellent correlation-ID propagation and a full distributed-tracing stack.

Shopify webhooks + EventBridge

Shopify uses an internal Kafka backbone for high-volume internal events and emits a curated subset as webhooks (HTTP callbacks) to merchants. AWS EventBridge enables similar patterns: services publish events, EventBridge routes them by content (event pattern matching) to Lambda, SQS, Step Functions, or HTTP targets, with a schema registry enforcing structure.

Trade-off: EventBridge with rule-based routing is a productivity multiplier in AWS-native systems but caps throughput around 10K events/sec per bus. For higher throughput, MSK (managed Kafka) or Kinesis is the right answer; EventBridge stays for cross-service integration.

Netflix Keystone CDC pipeline

Netflix captures all user-facing events (playback start, pause, error) into Kafka, then routes them via Flink to dozens of downstream consumers: real-time recommendations, A/B test analytics, alerting, ML feature pipelines. The Keystone bus carries trillions of events per day and serves as the integration backbone for the entire viewing experience.

Trade-off: A central Kafka backbone is a single source of truth for events but requires heavy investment in operations: schema management (Avro + registry), backpressure handling, multi-datacenter replication. The win is that any new analytics or ML team can subscribe to existing events without coordination.

Stripe event API

Stripe's external API is event-driven from the merchant's perspective: webhooks like `payment_intent.succeeded` and `invoice.paid` are sent to merchant endpoints. Internally, Stripe uses queues and event buses to fan out the same event to webhooks, audit logs, fraud analysis, accounting reports, and merchant dashboards.

Trade-off: Exposing events to external customers turns event schemas into a versioned public API forever. Stripe carefully versions their API and never breaks event schemas; the discipline this requires is a model for any team adopting EDA. Once your event is on the wire, it is part of your contract.

Quick Interview Phrases

Key terms to use in your answer

event-carried state transfer

publish-subscribe

loose coupling via events

schema registry

idempotent consumer

eventual consistency

Common Interview Questions

Questions you might be asked about this topic

How does event-driven architecture differ from request/response, and when would you pick each?

Request/response: producer knows the consumer, calls it directly, waits for a reply. Sync, immediate consistency, tight coupling. Best for user-facing queries, transactions, low-latency workflows. EDA: producer emits an event to a bus; consumers react independently. Async, eventually consistent, loose coupling. Best for cross-service integration, fan-out, decoupling teams. Most real systems mix both: request/response within a service or bounded context; EDA between them.

How would you design event flow for an e-commerce checkout?

What is the difference between event notification and event-carried state transfer?

How do you version event schemas without breaking consumers?

What are the operational pitfalls of EDA at scale?

Interview Tips

How to discuss this topic effectively

Lead with coupling. The whole point of EDA is that producers and consumers do not know about each other; saying that out loud is the answer interviewers want to hear.

Pick the EDA variant deliberately. Event-carried state transfer is the modern default; saying 'we emit just an ID and let consumers fetch' is a yellow flag because it recreates coupling.

Name the bus and the schema strategy together. 'Kafka with Confluent Schema Registry enforcing backward compatibility' is a much stronger answer than 'we use Kafka'.

Always acknowledge the eventual-consistency window in the user UI. Senior interviewers probe for whether you have thought about the customer experience after the producer responds 202 Accepted.

Mention what does NOT belong in EDA: synchronous user queries, strong-consistency operations (transfers, locks), tight latency budgets. Showing you know the boundary is a senior-level move.

Common Mistakes

Pitfalls to avoid in interviews

Using event notifications (just an ID) and forcing consumers to fetch from the producer

This recreates coupling: every consumer now depends on the producer's API. Use event-carried state transfer - put enough data in the event for consumers to act without callbacks. Yes, it makes events bigger; the decoupling is the whole point.

Forgetting that consumers see duplicates

Every queue and every event bus delivers at-least-once in practice. Consumer code must be idempotent: store an event-ID dedupe record before doing the work. Without this, retries and rebalances cause double-charges, double-emails, double-everything.

Treating schema changes as backend-internal

Once an event is published, it is a public contract. Adding a required field, renaming a field, or changing a type silently breaks every consumer. Use a schema registry with compatibility rules; treat schema changes like API changes.

Publishing too many events ('event firehose')

Not every internal state change deserves an event. Publish events that are meaningful to other bounded contexts (OrderCreated, OrderShipped) - not every micro-state-change (OrderRowLocked, CartFieldHover). Over-publishing creates an unmanageable taxonomy and overloaded consumers.

Ignoring observability until something breaks

Async event flows are nearly impossible to debug without correlation IDs and distributed tracing. Pass a trace ID with every event, log it on every consumer, integrate with OpenTelemetry from day one. Adding tracing after the system has 50 event types is brutal.

Back to System Design