System Design Article
Event-Driven Architecture & Pub/Sub
Difficulty: Medium
Event-driven architecture (EDA) is a style where services communicate by emitting and reacting to immutable events instead of calling each other directly. This lesson covers the publish/subscribe pattern, the difference between event notification and event-carried state transfer, the role of an event bus, and how EDA reshapes coupling, scalability, and consistency. We compare it with request/response, walk through real implementations on Kafka, Kinesis, EventBridge, and SNS, and end with the operational pitfalls (event versioning, ordering, schema drift, observability) that bite teams who adopt EDA without preparation.
Event-Driven Architecture & Pub/Sub
Event-driven architecture (EDA) is a style where services communicate by emitting and reacting to immutable events instead of calling each other directly. This lesson covers the publish/subscribe pattern, the difference between event notification and event-carried state transfer, the role of an event bus, and how EDA reshapes coupling, scalability, and consistency. We compare it with request/response, walk through real implementations on Kafka, Kinesis, EventBridge, and SNS, and end with the operational pitfalls (event versioning, ordering, schema drift, observability) that bite teams who adopt EDA without preparation.
388 views
7
What is Event-Driven Architecture?
Event-driven architecture (EDA) is a style where services communicate by emitting events that describe something that has already happened, and by reacting to events they care about. Producers do not know who consumes their events; consumers do not know who produced them.
An event is an immutable record of a fact: 'Order 42 was created at 10:00:00.123'. It is past tense; the work has already been done. This is the key difference from a command (a request to do something) or a query (a request to learn something).
---------- EDA at a glance ----------
[OrderService] --emits--> 'OrderCreated' --> [EVENT BUS]
|
+---------------------------------+--------------------+
v v v v
[InventoryService] [BillingService] [Notification] [Analytics]No service calls another. The event bus delivers the event to every interested subscriber.
EDA vs Request/Response
In request/response, the producer knows the consumer and waits for a reply.
---------- Request/response ----------
ServiceA --(call)--> ServiceB
<--(reply)--In EDA, the producer emits and forgets.
---------- Event-driven ----------
ServiceA --(emit OrderCreated)--> [BUS] --> ServiceB
--> ServiceC
--> ServiceD| Aspect | Request/Response | Event-Driven |
|---|---|---|
| Coupling | Tight (producer knows consumer) | Loose (producer knows nothing about consumers) |
| Consistency | Synchronous, immediate | Asynchronous, eventually consistent |
| Latency | Sum of all hops | Producer is fast; consumers process later |
| Failure mode | Cascade (downstream failure breaks caller) | Isolated (consumer failure does not affect producer) |
| Adding consumers | Requires producer change | Subscribe to existing events |
| Best for | Synchronous workflows, queries, user-facing | Async workflows, fan-out, integration |
Most real systems mix both. User-facing APIs are request/response; back-end orchestration and integration are event-driven.
Three Patterns Inside EDA
EDA is a family. Picking the right variant matters.
1. Event Notification (the simplest)
The event carries only an identifier and minimal context. Consumers fetch full data from the producer if they need more.
// Event Notification: small, fetch-on-demand
{
"type": "OrderCreated",
"orderId": "o-42",
"timestamp": "2026-04-26T10:00:00.123Z"
}Pros: small events, no schema duplication. Cons: every consumer hits the producer's API to get details, recreating coupling.
2. Event-Carried State Transfer
The event carries all the state a consumer would need.
// Event-Carried State Transfer: full payload
{
"type": "OrderCreated",
"orderId": "o-42",
"customerId": "c-7",
"items": [
{ "sku": "abc", "qty": 2, "price": 19.99 },
{ "sku": "xyz", "qty": 1, "price": 49.99 }
],
"totalAmount": 89.97,
"currency": "USD",
"timestamp": "2026-04-26T10:00:00.123Z"
}Pros: consumers can build their own materialized view without calling the producer; true loose coupling. Cons: larger events, schema versioning becomes critical, data duplicated across consumers.
This is the pattern most modern EDA systems use.
3. Event Sourcing
The entire state of the system is derived from a log of events. To know the current balance of an account, replay every Credit/Debit event for that account. The event log is the source of truth; the application state is a derived view.
This is a more advanced pattern (covered in detail in the Advanced track lesson 'Event Sourcing & CQRS') and not the same thing as EDA. Use event sourcing when you need a perfect audit trail or time-travel queries; otherwise, plain event-carried state transfer is simpler.
The Event Bus
The event bus is the infrastructure that routes events from producers to subscribers. The choice of bus shapes the system's properties.
| Bus | Type | Best for |
|---|---|---|
| Apache Kafka | Log-based | High-throughput event streams, replay, multiple consumer groups |
| AWS Kinesis | Log-based | AWS-native streaming; same shape as Kafka |
| Apache Pulsar | Log + queue hybrid | Multi-tenant messaging with both pub/sub and queue semantics |
| AWS SNS | Topic-based | Fan-out to many subscribers (HTTP, Lambda, SQS); fire-and-forget |
| AWS EventBridge | Event router with rules | Schema registry, content-based routing, AWS service events |
| Google Cloud Pub/Sub | Topic-based | GCP-native pub/sub with at-least-once delivery |
| RabbitMQ topic exchanges | Broker-based | Pub/sub for moderate scale with rich routing |
| NATS / NATS JetStream | Lightweight pub/sub | Low-latency, edge, or IoT scenarios |
For most modern systems, Kafka (or its managed equivalent) is the default for high-throughput EDA, and EventBridge or SNS is the default for AWS-native lower-volume integration.
A Concrete Example: E-Commerce Checkout
Request/response version:
---------- Request/response checkout ----------
Client --POST /checkout--> [Order Service]
|
|--call--> [Inventory] reserve
|--call--> [Payment] charge
|--call--> [Shipping] create label
|--call--> [Email] send confirmation
v
returns 200 only after all succeedProblems: payment service down -> entire checkout fails. Email service slow -> checkout slow. Adding a loyalty service requires changing the order service.
Event-driven version:
---------- Event-driven checkout ----------
Client --POST /checkout--> [Order Service]
|
|--write order to DB--
|--emit 'OrderCreated' to Kafka--
v
returns 202 (Accepted)
[Kafka topic: orders.created]
|
+--> [Inventory] reserves stock; emits 'InventoryReserved'
+--> [Payment] charges card; emits 'PaymentCompleted'
+--> [Email] sends confirmation
+--> [Loyalty] awards points (added later, no order-service change)
+--> [Analytics] increments funnel metricProperties:
- Order service responds in ~10 ms (just write + emit). Customer sees 'Processing' immediately.
- Each downstream service runs at its own pace and can fail independently.
- New consumers (loyalty, fraud check, recommendation) plug into the existing event without any change to the order service.
- The trade-off: the customer might see 'Processing' for a few seconds before all side effects complete.
Schema Design and Versioning
The second-most-important rule of EDA after 'events are immutable past-tense facts': events are a contract. Once a producer emits an event with a schema, every consumer depends on that schema. Breaking changes break consumers silently.
Versioning strategies
1. Additive changes are safe (add new optional fields). Old consumers ignore unknown fields.
2. Breaking changes require a new event type or a version field. Common patterns:
- New event name:
OrderCreated.v2. Both versions emitted in parallel during migration. - Version field:
{ "type": "OrderCreated", "schemaVersion": 2, ... }. Consumers handle both.
3. Schema registry: Confluent Schema Registry, AWS Glue Schema Registry, or EventBridge Schema Registry stores all event schemas centrally and enforces compatibility rules (forward-compatible, backward-compatible, full).
// AVRO schema in registry, compatibility = BACKWARD
{
"type": "record",
"name": "OrderCreated",
"fields": [
{ "name": "orderId", "type": "string" },
{ "name": "customerId", "type": "string" },
{ "name": "totalAmount", "type": "double" },
{ "name": "currency", "type": "string", "default": "USD" } // optional, default for old producers
]
}Naming conventions
Use past tense and namespace by domain: orders.OrderCreated, orders.OrderShipped, users.EmailUpdated. This makes the topic taxonomy self-explanatory.
Pseudocode: Producer and Consumer
// Producer
async function createOrder(req) {
const order = await db.orders.insert(req.body);
await kafka.publish('orders.OrderCreated', {
type: 'OrderCreated',
schemaVersion: 1,
orderId: order.id,
customerId: order.customerId,
totalAmount: order.totalAmount,
currency: order.currency,
timestamp: new Date().toISOString(),
idempotencyKey: req.headers['x-idempotency-key'],
});
return { status: 'accepted', orderId: order.id };
}
// Consumer
async function onOrderCreated(event) {
if (await db.processedEvents.find({ id: event.idempotencyKey })) return;
await db.transaction(async (tx) => {
await tx.processedEvents.insert({ id: event.idempotencyKey });
await tx.inventory.reserve({ orderId: event.orderId, items: event.items });
});
await kafka.publish('orders.InventoryReserved', { orderId: event.orderId });
}Operational Pitfalls
1. Event ordering
Kafka gives per-partition ordering only. If your consumer logic depends on processing events in causal order (e.g., 'OrderCreated' before 'OrderUpdated'), partition by the entity key (e.g., orderId) so all events for that entity land on one partition.
2. Idempotency is non-negotiable
Consumers will see duplicate events: producer retries, broker redelivery, consumer rebalances. Every consumer must include an idempotency check (dedupe table indexed by event ID).
3. Schema drift kills consumers
A producer team adds a required field; downstream consumers parsing strictly fail to deserialize. Use schema registries with compatibility enforcement to prevent this from making it past CI.
4. Observability is harder
A single user action (place order) now spans 5+ services and dozens of event handlers. Without correlation IDs and distributed tracing (OpenTelemetry, Jaeger, Datadog APM), debugging a failed checkout is impossible.
5. The event firehose problem
Kafka makes it easy to publish events. Teams over-publish - every state change becomes an event - and consumers drown. Be deliberate about event granularity; not every internal state change deserves a topic.
6. Eventual consistency in the UI
Users expect immediate feedback. After a 'create order' event, the order list page may not yet show the new order. Either fetch from the producer (defeats the purpose) or design the UI for eventual consistency (optimistic update + reconcile).
When NOT to Use EDA
EDA is a hammer; not every problem is a nail.
- Synchronous user-facing queries (load page, fetch profile). Use request/response with caching.
- Strong consistency requirements (transfer money, take a lock). Use direct calls or 2PC.
- Tight latency budgets within a single workflow (sub-50ms end-to-end). EDA's queue hops add latency.
- Small teams, single team, simple system. The decoupling benefit is small; the operational cost is real.
A good rule: use EDA for integration between bounded contexts, request/response within a bounded context.
Decision Matrix
| Scenario | Pattern | Bus |
|---|---|---|
| Multi-service e-commerce checkout | EDA, event-carried state | Kafka or Kinesis |
| Microservices integration in AWS | EDA with rules | EventBridge |
| IoT telemetry ingestion | EDA, high throughput | Kafka, Kinesis, NATS |
| Notification fan-out (email/SMS/push) | EDA with topics | SNS or Kafka |
| Audit log / time-travel queries | Event sourcing | Kafka with infinite retention |
| Loose-coupling between teams | EDA, event-carried state | Kafka with schema registry |
| Sync user-facing API | Request/response, NOT EDA | n/a |
| Money transfer, locks, transactions | Direct synchronous calls | n/a |
How to Talk About This in an Interview
- Start with coupling. 'I would use event-driven architecture so that adding new consumers does not require changing the producer.'
- Pick the variant. 'Event-carried state transfer means each consumer can build its own view without calling back to the producer, which is what I want for true decoupling.'
- Name the bus and justify. 'Kafka for high-throughput streams, EventBridge for AWS-native cross-service integration with rule-based routing.'
- Always mention idempotency and ordering. 'Consumers will see duplicate events, so I will dedupe by event ID. Per-key ordering will be preserved by partitioning Kafka topics by entity ID.'
- Bring up schema versioning. 'A schema registry with backward-compatibility enforcement prevents producer changes from breaking consumers silently.'
- Acknowledge the trade-offs. 'EDA shifts complexity from synchronous coordination to async observability. We need correlation IDs and distributed tracing to debug end-to-end flows.'
Quick Review
- Events are immutable, past-tense facts. Producers emit; consumers react.
- Three flavors: event notification (lookup needed), event-carried state transfer (default), event sourcing (advanced).
- Pub/sub bus decouples producers from consumers. Adding a consumer is a deploy, not a producer change.
- Pick Kafka for throughput, EventBridge for AWS-native rule routing, SNS for simple fan-out.
- Idempotency, ordering, and schema versioning are mandatory.
- Use EDA between bounded contexts; use direct calls within them.
- Observability (correlation IDs, distributed tracing) is what makes EDA debuggable at scale.
Real-World Examples
How real systems implement this in production
Uber's trip flow is heavily event-driven. A trip request emits an event consumed by dispatch (find driver), pricing (compute fare), notifications (driver+rider updates), analytics (funnel metrics), and many other services. Each service is independent; new consumers (loyalty, surge analysis, fraud detection) plug into existing events without changing the dispatch path. They use Kafka as the central event backbone with hundreds of topics.
Trade-off: Event-driven decoupling lets dozens of teams ship independently against the same event stream. The cost is observability: tracing one trip across all services requires excellent correlation-ID propagation and a full distributed-tracing stack.
Shopify uses an internal Kafka backbone for high-volume internal events and emits a curated subset as webhooks (HTTP callbacks) to merchants. AWS EventBridge enables similar patterns: services publish events, EventBridge routes them by content (event pattern matching) to Lambda, SQS, Step Functions, or HTTP targets, with a schema registry enforcing structure.
Trade-off: EventBridge with rule-based routing is a productivity multiplier in AWS-native systems but caps throughput around 10K events/sec per bus. For higher throughput, MSK (managed Kafka) or Kinesis is the right answer; EventBridge stays for cross-service integration.
Netflix captures all user-facing events (playback start, pause, error) into Kafka, then routes them via Flink to dozens of downstream consumers: real-time recommendations, A/B test analytics, alerting, ML feature pipelines. The Keystone bus carries trillions of events per day and serves as the integration backbone for the entire viewing experience.
Trade-off: A central Kafka backbone is a single source of truth for events but requires heavy investment in operations: schema management (Avro + registry), backpressure handling, multi-datacenter replication. The win is that any new analytics or ML team can subscribe to existing events without coordination.
Stripe's external API is event-driven from the merchant's perspective: webhooks like `payment_intent.succeeded` and `invoice.paid` are sent to merchant endpoints. Internally, Stripe uses queues and event buses to fan out the same event to webhooks, audit logs, fraud analysis, accounting reports, and merchant dashboards.
Trade-off: Exposing events to external customers turns event schemas into a versioned public API forever. Stripe carefully versions their API and never breaks event schemas; the discipline this requires is a model for any team adopting EDA. Once your event is on the wire, it is part of your contract.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Request/response: producer knows the consumer, calls it directly, waits for a reply. Sync, immediate consistency, tight coupling. Best for user-facing queries, transactions, low-latency workflows. EDA: producer emits an event to a bus; consumers react independently. Async, eventually consistent, loose coupling. Best for cross-service integration, fan-out, decoupling teams. Most real systems mix both: request/response within a service or bounded context; EDA between them.
Order service writes the order and emits OrderCreated to Kafka. Independent consumers handle: Inventory reserves stock; Payment charges the card; Shipping creates a label; Notification sends email; Loyalty awards points; Analytics tracks the funnel. Each consumer is idempotent (dedupe by orderId or event ID). The order topic is partitioned by orderId so all events for one order are processed in causal order. Producer responds 202 Accepted to the customer immediately; UI polls or subscribes for status updates. Acknowledge the eventual-consistency window.
Event notification carries only an identifier (e.g., {type: OrderCreated, orderId: 42}); consumers must call back to the producer to get details. This recreates coupling: consumers now depend on the producer's API. Event-carried state transfer puts the full payload in the event; consumers can act without callbacks, building their own materialized views. Event-carried is the modern default because it preserves the loose-coupling benefit of EDA. Trade-offs: bigger events, schema versioning matters more, data is duplicated across consumers (which is usually acceptable).
Use a schema registry (Confluent, AWS Glue, EventBridge) with compatibility enforcement (backward, forward, or full). Additive changes (new optional fields with defaults) are safe. Breaking changes require either a new event type (OrderCreated.v2) emitted in parallel during migration, or a version field consumers branch on. Reject schema changes in CI if they violate the chosen compatibility mode. Document the compatibility policy and the deprecation window for old versions.
Five biggest: (1) Ordering - Kafka gives per-partition order, so partition by entity key. (2) Idempotency - duplicates are guaranteed; consumers must dedupe by event ID. (3) Schema drift - producer changes break consumers; use schema registry with compatibility rules. (4) Observability - end-to-end debugging requires correlation IDs, distributed tracing, and event-flow visualization. (5) Event firehose - over-publishing creates an unmanageable taxonomy and overloaded consumers; be deliberate about which state changes deserve events. Add: eventual consistency in user-facing flows requires UX design (optimistic UI, reconciliation).
Interview Tips
How to discuss this topic effectively
Lead with coupling. The whole point of EDA is that producers and consumers do not know about each other; saying that out loud is the answer interviewers want to hear.
Pick the EDA variant deliberately. Event-carried state transfer is the modern default; saying 'we emit just an ID and let consumers fetch' is a yellow flag because it recreates coupling.
Name the bus and the schema strategy together. 'Kafka with Confluent Schema Registry enforcing backward compatibility' is a much stronger answer than 'we use Kafka'.
Always acknowledge the eventual-consistency window in the user UI. Senior interviewers probe for whether you have thought about the customer experience after the producer responds 202 Accepted.
Mention what does NOT belong in EDA: synchronous user queries, strong-consistency operations (transfers, locks), tight latency budgets. Showing you know the boundary is a senior-level move.
Common Mistakes
Pitfalls to avoid in interviews
Using event notifications (just an ID) and forcing consumers to fetch from the producer
This recreates coupling: every consumer now depends on the producer's API. Use event-carried state transfer - put enough data in the event for consumers to act without callbacks. Yes, it makes events bigger; the decoupling is the whole point.
Forgetting that consumers see duplicates
Every queue and every event bus delivers at-least-once in practice. Consumer code must be idempotent: store an event-ID dedupe record before doing the work. Without this, retries and rebalances cause double-charges, double-emails, double-everything.
Treating schema changes as backend-internal
Once an event is published, it is a public contract. Adding a required field, renaming a field, or changing a type silently breaks every consumer. Use a schema registry with compatibility rules; treat schema changes like API changes.
Publishing too many events ('event firehose')
Not every internal state change deserves an event. Publish events that are meaningful to other bounded contexts (OrderCreated, OrderShipped) - not every micro-state-change (OrderRowLocked, CartFieldHover). Over-publishing creates an unmanageable taxonomy and overloaded consumers.
Ignoring observability until something breaks
Async event flows are nearly impossible to debug without correlation IDs and distributed tracing. Pass a trace ID with every event, log it on every consumer, integrate with OpenTelemetry from day one. Adding tracing after the system has 50 event types is brutal.
