System Design Article
Message Queues (Kafka, RabbitMQ, SQS)
Difficulty: Medium
Message queues let one service hand work to another without waiting, smoothing traffic spikes, decoupling services, and surviving downstream outages. This lesson covers the two queue families (broker-based like RabbitMQ and SQS vs log-based like Kafka), the delivery semantics (at-most-once, at-least-once, exactly-once), the operational essentials (DLQs, consumer groups, backpressure, ordering), and the trade-offs that decide between Kafka, RabbitMQ, and SQS for any given workload. By the end you can pick a queue and defend the choice with the per-property reasoning interviewers reward.
Message Queues (Kafka, RabbitMQ, SQS)
Message queues let one service hand work to another without waiting, smoothing traffic spikes, decoupling services, and surviving downstream outages. This lesson covers the two queue families (broker-based like RabbitMQ and SQS vs log-based like Kafka), the delivery semantics (at-most-once, at-least-once, exactly-once), the operational essentials (DLQs, consumer groups, backpressure, ordering), and the trade-offs that decide between Kafka, RabbitMQ, and SQS for any given workload. By the end you can pick a queue and defend the choice with the per-property reasoning interviewers reward.
932 views
7
What is a Message Queue?
A message queue is a buffer between a producer (the service that creates work) and a consumer (the service that does it). The producer hands a message to the queue and moves on; the consumer pulls from the queue at its own pace.
---------- Producer / queue / consumer ----------
[Producer A] --message--> [ QUEUE / BROKER ] --message--> [Consumer X]
[Producer B] --message--> --message--> [Consumer Y]What this buys you:
- Decoupling: producer does not know about consumers; either side can change without breaking the other.
- Smoothing: short-term spikes in producer traffic are absorbed by the queue instead of crashing the consumer.
- Fault tolerance: if the consumer is down, messages accumulate; when it comes back, it processes the backlog.
- Async work: producers do not block waiting for slow downstream operations.
- Fan-out: one message can be delivered to many consumers (pub/sub).
The trade: latency. A queue adds at least one network hop and queuing time to the request. For workflows where the user is waiting (synchronous request/response), a queue is wrong; for everything else, it is the default.
Two Families: Broker-Based vs Log-Based
Message systems split into two architectural patterns. The choice shapes nearly every other decision.
Broker-Based Queues (RabbitMQ, SQS, Azure Service Bus)
The broker holds messages in queues and deletes them once a consumer acknowledges receipt. Each message is delivered to one consumer (point-to-point) or fanned out via exchanges/topics.
---------- Broker-based: messages deleted on ack ----------
Producer --> [Queue: msg1, msg2, msg3] --> Consumer
|
| Consumer acks msg1
v
[Queue: msg2, msg3]Properties:
- Messages are transient: once acked, they are gone.
- Each message is delivered to one consumer in a queue (fan-out via separate queues per consumer).
- Per-message metadata: priority, TTL, routing key.
- Lower throughput per broker (~10K-100K msg/sec) but flexible routing.
Best for: task queues (image processing, email sending), RPC over messaging, complex routing rules.
Log-Based Queues (Kafka, Apache Pulsar, AWS Kinesis)
The broker stores messages in an append-only log. Consumers read at their own offset and the log is retained for a configurable time (hours to forever). Many consumers can read the same log independently.
---------- Log-based: retained, offset-based reads ----------
Producer --append--> [Log: [m1][m2][m3][m4][m5][m6]...]
^ ^
consumer A consumer B
offset=2 offset=5Properties:
- Messages are persistent for the retention window (default 7 days, often 30-90).
- Each consumer (group) tracks its own offset; replay is trivial.
- Single-partition messages are strictly ordered.
- Very high throughput (~1M msg/sec per broker on commodity hardware).
- Less flexible per-message routing (partitioning by key, not arbitrary rules).
Best for: event sourcing, log aggregation, stream processing, change data capture, audit trails.
Quick Comparison
| Property | RabbitMQ / SQS (broker) | Kafka (log) |
|---|---|---|
| Storage model | Queue per consumer; delete on ack | Append-only log per topic |
| Throughput per node | ~10K-100K msg/sec | ~1M msg/sec |
| Retention | Until consumed | Hours to forever (config) |
| Replay | No (or cumbersome) | Yes (just rewind offset) |
| Per-message TTL | Yes | No (only topic-level) |
| Routing flexibility | Rich (exchanges, headers, topics) | Partition by key |
| Ordering | Per-queue best-effort | Per-partition strict |
| Fan-out | Multiple queues bound to one exchange | Multiple consumer groups on one topic |
| Use cases | Task queues, RPC, complex routing | Event streaming, logs, CDC, analytics |
Delivery Semantics
The single most important concept in message queuing. Every queue gives you one of three guarantees - know which.
At-Most-Once
The message is delivered zero or one times. Loss is possible; duplicates are not.
How it happens: producer sends the message and does not wait for an ack; consumer processes the message and does not ack it back. Any failure in transit or processing means the message is lost.
Use when: messages are non-critical (metrics, telemetry samples) or will be replaced by later messages (cache invalidations).
At-Least-Once
The message is delivered one or more times. Duplicates are possible; loss is not (assuming the broker survived).
How it happens: producer retries until it gets an ack; consumer must ack only after successfully processing. If the consumer crashes after processing but before acking, the broker redelivers.
Use when: most production systems. Pair with idempotent consumers so duplicates are harmless.
This is the default for Kafka, SQS standard, RabbitMQ with manual ack.
Exactly-Once
The message is delivered exactly one time. No loss, no duplicates.
How it happens: requires the broker to participate in a transaction with the consumer's downstream effects. Kafka EOS (Exactly-Once Semantics) achieves this between Kafka topics by combining producer idempotency, transactional writes, and consumer offset commits in one atomic unit. SQS FIFO with deduplication ID achieves it within a 5-minute window.
Caveat: 'exactly-once' is exactly-once within the system that supports it (e.g., Kafka topic to Kafka topic). Anytime the consumer writes to an external system (database, third-party API), exactly-once requires the consumer to be idempotent on that side as well. There is no exactly-once delivery to an arbitrary external system without idempotency at the destination.
Use when: financial transactions where duplicates are unacceptable AND idempotency is hard.
How to Achieve At-Least-Once + Idempotent Consumer (the practical default)
async function processOrderEvent(message) {
const orderId = message.payload.orderId;
const messageId = message.id;
// Idempotency check
const seen = await db.processedMessages.find({ id: messageId });
if (seen) return;
await db.transaction(async (tx) => {
await tx.processedMessages.insert({ id: messageId, processedAt: Date.now() });
await tx.orders.update({ id: orderId, status: 'paid' });
});
await message.ack();
}This pattern - dedupe table + transactional update + ack-after-commit - is the production-tested way to handle at-least-once with effective exactly-once outcomes.
Ordering Guarantees
RabbitMQ / SQS Standard
- Best-effort FIFO. Reordering can happen during retries, redelivery, or competing consumers.
- For strict ordering, use a single consumer per queue (limits throughput) or SQS FIFO queues (with throughput limits).
Kafka
- Strict ordering within a partition. Messages with the same partition key always land on the same partition and are consumed in order.
- No ordering across partitions. To preserve order for a logical stream (e.g., all events for a user), partition by that key (
partitionKey = userId). - Trade-off: ordering scales by parallelism. More partitions = more throughput but order only within partition.
SQS FIFO
- Strict FIFO within a 'message group'. Throughput per group is limited to 300 msg/sec (3K with batching).
- Use a fine-grained group key (per-user, per-order) so groups are independent and total throughput scales.
Consumer Groups and Partitioning
Kafka's flagship pattern: a topic is split into N partitions, and a consumer group has up to N consumers, each owning some partitions.
---------- Kafka consumer group ----------
Topic 'orders' (4 partitions: P0, P1, P2, P3)
Consumer group 'billing':
consumer 1 -> [P0, P1]
consumer 2 -> [P2, P3]
Consumer group 'analytics' (independent of billing):
consumer A -> [P0]
consumer B -> [P1]
consumer C -> [P2]
consumer D -> [P3]Properties:
- Each partition is consumed by exactly one consumer in a group.
- Adding consumers up to N gives linear throughput scaling; beyond N, extra consumers are idle.
- Different consumer groups read independently - the same message is delivered once to each group.
- Rebalances happen when consumers join or leave; partitions are reassigned.
This is the secret to Kafka's throughput. Each partition is a sequential write/read on disk; parallelism comes from many partitions, not from contention on a single queue.
Dead Letter Queues (DLQs)
A message that fails repeatedly (poison message, malformed payload, bug in consumer) cannot just spin forever. The standard pattern: after N retries, move the message to a dead letter queue.
---------- DLQ flow ----------
Main queue -> consumer
|
| retry up to 3 times
v
DLQ <- on Nth failure, move here
(operator inspects, fixes the bug, replays manually)Every production queue should have a DLQ. SQS, RabbitMQ, and Kafka (via consumer-side libraries) all support this. Monitor DLQ depth as a critical alert.
Backpressure
What if the producer pushes faster than the consumer can drain? Without bounds, the queue grows unboundedly until the broker runs out of disk.
Mitigations:
- Producer-side rate limiting: producers respect a max-publish-rate or check the queue depth before publishing.
- Bounded queues: configure max queue depth; the broker rejects new messages when full (RabbitMQ overflow strategies, SQS in-flight limits).
- Consumer scaling: auto-scale consumers based on queue depth (HPA on SQS queue depth is a standard pattern).
- Drop policies: oldest-message-drop or random-drop for non-critical streams (telemetry).
Kafka does not exert backpressure on producers by default; it just keeps writing to disk until the partition fills. This makes it great for ingest but you must monitor disk usage and consumer lag explicitly.
How to Pick: Decision Matrix
| Workload | Best fit | Why |
|---|---|---|
| High-throughput event stream (>100K msg/sec) | Kafka | Designed for log-scale throughput; per-partition ordering. |
| Task queue (image resize, email send) | RabbitMQ, SQS | Per-message TTL, priorities, easy DLQ; ack model fits worker pattern. |
| Pub/sub fan-out across many subscribers | Kafka, RabbitMQ topic exchange | Multiple consumer groups read independently. |
| RPC over messaging | RabbitMQ | Reply-to and correlation-id are first-class. |
| Event sourcing / audit log | Kafka | Long retention, replay, append-only fits perfectly. |
| Cross-region replication of events | Kafka MirrorMaker, AWS MSK | Built for high-throughput multi-region streaming. |
| AWS-native serverless workflow | SQS + Lambda | Native triggers, no infra to manage. |
| Strict per-message ordering at modest scale | SQS FIFO | First-class FIFO with dedup; capped throughput. |
| Change Data Capture pipeline | Kafka + Debezium | The standard CDC stack. |
Operational Essentials
- Monitor consumer lag (Kafka) or queue depth (SQS, RabbitMQ). Lag growing faster than it shrinks is the first sign of trouble.
- Always have a DLQ. Poison messages will happen.
- Set retention deliberately. Too short and you lose replay capability; too long and you pay for storage forever.
- Partition keys matter. The wrong key destroys ordering or creates hot partitions.
- Test consumer crashes. Kill the consumer mid-message; verify duplicates are tolerated and no messages are lost.
- Rate-limit producers. Especially in serverless; one runaway Lambda can flood a queue in seconds.
How to Talk About This in an Interview
- Lead with the family. 'For an event-streaming workload I would use Kafka; for a task queue I would use RabbitMQ or SQS. They are different categories.'
- State the delivery semantic. 'At-least-once with idempotent consumers is the production default. Exactly-once is possible within Kafka but cannot be guaranteed if the consumer writes to an external system without its own idempotency.'
- Mention partitioning by a meaningful key. 'I would partition by user_id so all events for one user land on one partition and are processed in order.'
- Bring up DLQs unprompted. 'Every queue needs a DLQ; messages that fail N times move there for manual inspection.'
- Acknowledge backpressure. 'If consumers cannot keep up, queue depth grows; we monitor it and auto-scale consumers, or rate-limit at the producer.'
- Name the operational risk. 'The biggest risk in async systems is silent failure - messages dropped on the floor with no alert. Monitoring lag and DLQ depth is non-negotiable.'
Quick Review
- Two families: broker-based (RabbitMQ, SQS - delete on ack) and log-based (Kafka - retained log).
- Three delivery semantics: at-most-once, at-least-once, exactly-once. At-least-once + idempotent consumer is the practical default.
- Kafka partitions give per-key ordering and parallelism; pick partition keys carefully.
- Consumer groups let multiple consumers split a topic's partitions; multiple groups read independently.
- DLQs catch poison messages; monitor depth.
- Pick by use case: high-throughput streaming -> Kafka; task queue -> RabbitMQ/SQS; AWS-native -> SQS+Lambda.
Real-World Examples
How real systems implement this in production
Kafka was built at LinkedIn to handle the firehose of activity events (page views, profile updates, messaging events). Today LinkedIn runs Kafka at over 7 trillion messages per day across thousands of brokers. Confluent commercialized Kafka and runs it as a managed service for many enterprises.
Trade-off: Kafka's log-based design gives extreme throughput and replay flexibility, but the operational burden is significant: ZooKeeper (or KRaft) cluster, broker tuning, partition rebalancing, retention sizing. The win is at scale; for small workloads, managed alternatives (SQS, RabbitMQ) are far cheaper to operate.
Stripe uses RabbitMQ extensively for webhook delivery to merchants. When a payment event happens, the webhook service publishes to a RabbitMQ queue; consumers fetch the merchant's webhook URL, retry on failure (with exponential backoff), and move to a DLQ after N retries. The flexible per-message TTL and priority queues fit this workload better than a log-based system.
Trade-off: RabbitMQ's flexible routing and per-message metadata are perfect for task-queue and RPC patterns but cap throughput well below Kafka. Stripe's choice reflects that webhooks are inherently bounded by external HTTP latency, so Kafka's throughput would be wasted.
Netflix's Keystone pipeline ingests trillions of events per day (playback events, UI interactions, error logs) into Kafka, then processes them with Flink for real-time analytics and ML feature pipelines. Kafka is the backbone because of its throughput and the ability for many independent consumers (analytics, monitoring, ML training) to read the same stream.
Trade-off: A single high-throughput log accessible to many independent consumer groups is exactly Kafka's sweet spot. The cost is operational: dozens of brokers, terabytes of retention, careful partition planning. For a smaller pipeline, Kinesis or managed Kafka would have been cheaper to run.
Amazon's order pipeline famously uses SQS for asynchronous processing: order placement publishes to a queue; downstream workers handle inventory, payment, shipping, and fulfillment in parallel. SQS's at-least-once delivery and visibility timeout are well-suited to retry-heavy workflows where each step is idempotent.
Trade-off: SQS is fully managed, virtually unlimited throughput on standard queues, and 'just works' for AWS-native architectures. The cost: less flexibility than Kafka or RabbitMQ for advanced patterns (exactly-once, complex routing, replay). For the order pipeline pattern, it is the right answer.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Kafka for high-throughput event streams (>100K msg/sec), event sourcing, log aggregation, anything needing replay or long retention. RabbitMQ for task queues with rich routing, per-message TTL, priorities, RPC patterns - typical worker pool with diverse jobs. SQS for AWS-native serverless workflows with no infra to run; SQS Standard for high-throughput, SQS FIFO for strict ordering at moderate scale. Consider operational cost: SQS is fully managed; Kafka has the highest operational burden; RabbitMQ is in between. Pick by workload first, then by ops capacity.
At-least-once means the broker may deliver the same message multiple times (typically due to consumer crash before ack, or broker retries). The fix is on the consumer side: make processing idempotent. Standard pattern: maintain a `processed_messages` table indexed by message ID. On each delivery: check if seen; if yes, skip; if no, do the work + insert the dedupe record in one transaction; ack. This gives effectively-exactly-once outcomes on top of at-least-once delivery. Cheaper than 'true' exactly-once and works regardless of broker.
Kafka guarantees strict ordering within a partition, not across partitions. Producers pick a partition by hashing a partition key; messages with the same key always land on the same partition and are consumed in order. To preserve order for a logical entity (user, order, account), partition by that entity's ID. Trade-off: parallelism scales by partition count, but each partition is consumed by one consumer, so throughput per logical key is bounded by single-consumer speed. Hot keys (mega-users) can saturate one partition while others sit idle.
Standard pattern: configure a maximum delivery count (e.g., 3-5). After that many failures, move the message to a Dead Letter Queue (DLQ). The main queue continues processing other messages; the DLQ holds the poison messages for manual inspection. Operators look at DLQ depth as a critical alert. Once the bug is fixed, messages can be replayed from the DLQ back to the main queue. Without this pattern, a single poison message either spins forever or blocks the queue (head-of-line blocking).
Producers often need to write to their database AND publish a message atomically (e.g., create order + emit OrderCreated event). There is no transaction across DB and message broker, so naive code drops events on partial failure. Outbox pattern: producer writes the row AND an outbox event in one local transaction. A separate worker (poll the outbox table or use CDC like Debezium) publishes the events to the broker. Guarantees at-least-once event publishing without distributed transactions. Combined with idempotent consumers, gives effectively-exactly-once.
Interview Tips
How to discuss this topic effectively
Pick the queue family before the brand. 'Log-based for streaming, broker-based for task queues' is more sophisticated than 'we use Kafka because it is fast'.
Always state the delivery semantic explicitly. At-least-once with idempotent consumers is the production-grade answer for almost every real workload.
Mention partition keys when discussing Kafka. The wrong key destroys ordering or creates hot partitions; the right key (often user_id or entity_id) gives both ordering and even load.
Bring up DLQs unprompted. Every production queue has one; talking about it shows you have operated, not just designed, async systems.
Acknowledge that exactly-once is bounded. Kafka EOS works topic-to-topic; once you write to an external DB or API, the consumer must be idempotent there too. Senior interviewers love to probe this.
Common Mistakes
Pitfalls to avoid in interviews
Treating Kafka and RabbitMQ as interchangeable
They solve different problems. Kafka is a log: high throughput, per-partition ordering, retained for replay. RabbitMQ is a broker: rich routing, per-message TTL, deleted on ack. Picking Kafka for a task queue means paying for log retention you do not need; picking RabbitMQ for an event stream means hitting throughput limits.
Promising exactly-once without idempotent consumers
True exactly-once requires the entire path - producer, broker, consumer, downstream effects - to be transactional. Kafka EOS handles topic-to-topic. Anytime your consumer writes to an external database or API, you need idempotency keys or a dedupe table at the destination, no matter what the broker promises.
Forgetting to set up a DLQ
A poison message (malformed payload, bug in consumer) without a DLQ either spins forever (consuming resources) or stops the entire queue (head-of-line blocking). Every production queue needs a DLQ with a depth alert.
Picking partition key by random ID instead of meaningful key
Partitioning by random UUID gives even load but no ordering. If your business logic needs per-entity ordering (events for a user, updates to an order), partition by that entity ID. Hot keys can be mitigated with key-suffix sharding for the few mega-entities.
Ignoring consumer lag monitoring
Consumer lag (the gap between the latest message and the consumer's offset) is the first sign of any consumer problem - bug, slow downstream, runaway producer. Without lag monitoring you only learn about the issue when the disk fills or downstream queries time out.
