System Design Article

Message Queues (Kafka, RabbitMQ, SQS)

Difficulty: Medium

Message queues let one service hand work to another without waiting, smoothing traffic spikes, decoupling services, and surviving downstream outages. This lesson covers the two queue families (broker-based like RabbitMQ and SQS vs log-based like Kafka), the delivery semantics (at-most-once, at-least-once, exactly-once), the operational essentials (DLQs, consumer groups, backpressure, ordering), and the trade-offs that decide between Kafka, RabbitMQ, and SQS for any given workload. By the end you can pick a queue and defend the choice with the per-property reasoning interviewers reward.

Message Queues (Kafka, RabbitMQ, SQS)

System Design

Medium

message-queue

kafka

rabbitmq

sqs

async-processing

pub-sub

distributed-systems

system-design

intermediate

free

932 views

What is a Message Queue?

A message queue is a buffer between a producer (the service that creates work) and a consumer (the service that does it). The producer hands a message to the queue and moves on; the consumer pulls from the queue at its own pace.

Text

---------- Producer / queue / consumer ----------
  [Producer A] --message--> [ QUEUE / BROKER ] --message--> [Consumer X]
  [Producer B] --message-->                   --message--> [Consumer Y]

What this buys you:

Decoupling: producer does not know about consumers; either side can change without breaking the other.
Smoothing: short-term spikes in producer traffic are absorbed by the queue instead of crashing the consumer.
Fault tolerance: if the consumer is down, messages accumulate; when it comes back, it processes the backlog.
Async work: producers do not block waiting for slow downstream operations.
Fan-out: one message can be delivered to many consumers (pub/sub).

The trade: latency. A queue adds at least one network hop and queuing time to the request. For workflows where the user is waiting (synchronous request/response), a queue is wrong; for everything else, it is the default.

Two Families: Broker-Based vs Log-Based

Message systems split into two architectural patterns. The choice shapes nearly every other decision.

Broker-Based Queues (RabbitMQ, SQS, Azure Service Bus)

The broker holds messages in queues and deletes them once a consumer acknowledges receipt. Each message is delivered to one consumer (point-to-point) or fanned out via exchanges/topics.

Text

---------- Broker-based: messages deleted on ack ----------
  Producer --> [Queue: msg1, msg2, msg3] --> Consumer
              |
              | Consumer acks msg1
              v
              [Queue: msg2, msg3]

Properties:

Messages are transient: once acked, they are gone.
Each message is delivered to one consumer in a queue (fan-out via separate queues per consumer).
Per-message metadata: priority, TTL, routing key.
Lower throughput per broker (~10K-100K msg/sec) but flexible routing.

Best for: task queues (image processing, email sending), RPC over messaging, complex routing rules.

Log-Based Queues (Kafka, Apache Pulsar, AWS Kinesis)

The broker stores messages in an append-only log. Consumers read at their own offset and the log is retained for a configurable time (hours to forever). Many consumers can read the same log independently.

Text

---------- Log-based: retained, offset-based reads ----------
  Producer --append--> [Log: [m1][m2][m3][m4][m5][m6]...]
                          ^                ^
                  consumer A           consumer B
                  offset=2             offset=5

Properties:

Messages are persistent for the retention window (default 7 days, often 30-90).
Each consumer (group) tracks its own offset; replay is trivial.
Single-partition messages are strictly ordered.
Very high throughput (~1M msg/sec per broker on commodity hardware).
Less flexible per-message routing (partitioning by key, not arbitrary rules).

Best for: event sourcing, log aggregation, stream processing, change data capture, audit trails.

Quick Comparison

Property	RabbitMQ / SQS (broker)	Kafka (log)
Storage model	Queue per consumer; delete on ack	Append-only log per topic
Throughput per node	~10K-100K msg/sec	~1M msg/sec
Retention	Until consumed	Hours to forever (config)
Replay	No (or cumbersome)	Yes (just rewind offset)
Per-message TTL	Yes	No (only topic-level)
Routing flexibility	Rich (exchanges, headers, topics)	Partition by key
Ordering	Per-queue best-effort	Per-partition strict
Fan-out	Multiple queues bound to one exchange	Multiple consumer groups on one topic
Use cases	Task queues, RPC, complex routing	Event streaming, logs, CDC, analytics

Delivery Semantics

The single most important concept in message queuing. Every queue gives you one of three guarantees - know which.

At-Most-Once

The message is delivered zero or one times. Loss is possible; duplicates are not.

How it happens: producer sends the message and does not wait for an ack; consumer processes the message and does not ack it back. Any failure in transit or processing means the message is lost.

Use when: messages are non-critical (metrics, telemetry samples) or will be replaced by later messages (cache invalidations).

At-Least-Once

The message is delivered one or more times. Duplicates are possible; loss is not (assuming the broker survived).

How it happens: producer retries until it gets an ack; consumer must ack only after successfully processing. If the consumer crashes after processing but before acking, the broker redelivers.

Use when: most production systems. Pair with idempotent consumers so duplicates are harmless.

This is the default for Kafka, SQS standard, RabbitMQ with manual ack.

Exactly-Once

The message is delivered exactly one time. No loss, no duplicates.

How it happens: requires the broker to participate in a transaction with the consumer's downstream effects. Kafka EOS (Exactly-Once Semantics) achieves this between Kafka topics by combining producer idempotency, transactional writes, and consumer offset commits in one atomic unit. SQS FIFO with deduplication ID achieves it within a 5-minute window.

Caveat: 'exactly-once' is exactly-once within the system that supports it (e.g., Kafka topic to Kafka topic). Anytime the consumer writes to an external system (database, third-party API), exactly-once requires the consumer to be idempotent on that side as well. There is no exactly-once delivery to an arbitrary external system without idempotency at the destination.

Use when: financial transactions where duplicates are unacceptable AND idempotency is hard.

How to Achieve At-Least-Once + Idempotent Consumer (the practical default)

JavaScript

Python

async function processOrderEvent(message) {
    const orderId = message.payload.orderId;
    const messageId = message.id;

    // Idempotency check
    const seen = await db.processedMessages.find({ id: messageId });
    if (seen) return;

    await db.transaction(async (tx) => {
        await tx.processedMessages.insert({ id: messageId, processedAt: Date.now() });
        await tx.orders.update({ id: orderId, status: 'paid' });
    });

    await message.ack();
}

This pattern - dedupe table + transactional update + ack-after-commit - is the production-tested way to handle at-least-once with effective exactly-once outcomes.

Ordering Guarantees

RabbitMQ / SQS Standard

Best-effort FIFO. Reordering can happen during retries, redelivery, or competing consumers.
For strict ordering, use a single consumer per queue (limits throughput) or SQS FIFO queues (with throughput limits).

Kafka

Strict ordering within a partition. Messages with the same partition key always land on the same partition and are consumed in order.
No ordering across partitions. To preserve order for a logical stream (e.g., all events for a user), partition by that key (partitionKey = userId).
Trade-off: ordering scales by parallelism. More partitions = more throughput but order only within partition.

SQS FIFO

Strict FIFO within a 'message group'. Throughput per group is limited to 300 msg/sec (3K with batching).
Use a fine-grained group key (per-user, per-order) so groups are independent and total throughput scales.

Consumer Groups and Partitioning

Kafka's flagship pattern: a topic is split into N partitions, and a consumer group has up to N consumers, each owning some partitions.

Text

---------- Kafka consumer group ----------
  Topic 'orders' (4 partitions: P0, P1, P2, P3)

  Consumer group 'billing':
    consumer 1 -> [P0, P1]
    consumer 2 -> [P2, P3]

  Consumer group 'analytics' (independent of billing):
    consumer A -> [P0]
    consumer B -> [P1]
    consumer C -> [P2]
    consumer D -> [P3]

Properties:

Each partition is consumed by exactly one consumer in a group.
Adding consumers up to N gives linear throughput scaling; beyond N, extra consumers are idle.
Different consumer groups read independently - the same message is delivered once to each group.
Rebalances happen when consumers join or leave; partitions are reassigned.

This is the secret to Kafka's throughput. Each partition is a sequential write/read on disk; parallelism comes from many partitions, not from contention on a single queue.

Dead Letter Queues (DLQs)

A message that fails repeatedly (poison message, malformed payload, bug in consumer) cannot just spin forever. The standard pattern: after N retries, move the message to a dead letter queue.

Text

---------- DLQ flow ----------
  Main queue -> consumer
     |
     | retry up to 3 times
     v
  DLQ <- on Nth failure, move here
  (operator inspects, fixes the bug, replays manually)

Every production queue should have a DLQ. SQS, RabbitMQ, and Kafka (via consumer-side libraries) all support this. Monitor DLQ depth as a critical alert.

Backpressure

What if the producer pushes faster than the consumer can drain? Without bounds, the queue grows unboundedly until the broker runs out of disk.

Mitigations:

Producer-side rate limiting: producers respect a max-publish-rate or check the queue depth before publishing.
Bounded queues: configure max queue depth; the broker rejects new messages when full (RabbitMQ overflow strategies, SQS in-flight limits).
Consumer scaling: auto-scale consumers based on queue depth (HPA on SQS queue depth is a standard pattern).
Drop policies: oldest-message-drop or random-drop for non-critical streams (telemetry).

Kafka does not exert backpressure on producers by default; it just keeps writing to disk until the partition fills. This makes it great for ingest but you must monitor disk usage and consumer lag explicitly.

How to Pick: Decision Matrix

Workload	Best fit	Why
High-throughput event stream (>100K msg/sec)	Kafka	Designed for log-scale throughput; per-partition ordering.
Task queue (image resize, email send)	RabbitMQ, SQS	Per-message TTL, priorities, easy DLQ; ack model fits worker pattern.
Pub/sub fan-out across many subscribers	Kafka, RabbitMQ topic exchange	Multiple consumer groups read independently.
RPC over messaging	RabbitMQ	Reply-to and correlation-id are first-class.
Event sourcing / audit log	Kafka	Long retention, replay, append-only fits perfectly.
Cross-region replication of events	Kafka MirrorMaker, AWS MSK	Built for high-throughput multi-region streaming.
AWS-native serverless workflow	SQS + Lambda	Native triggers, no infra to manage.
Strict per-message ordering at modest scale	SQS FIFO	First-class FIFO with dedup; capped throughput.
Change Data Capture pipeline	Kafka + Debezium	The standard CDC stack.

Operational Essentials

Monitor consumer lag (Kafka) or queue depth (SQS, RabbitMQ). Lag growing faster than it shrinks is the first sign of trouble.
Always have a DLQ. Poison messages will happen.
Set retention deliberately. Too short and you lose replay capability; too long and you pay for storage forever.
Partition keys matter. The wrong key destroys ordering or creates hot partitions.
Test consumer crashes. Kill the consumer mid-message; verify duplicates are tolerated and no messages are lost.
Rate-limit producers. Especially in serverless; one runaway Lambda can flood a queue in seconds.

How to Talk About This in an Interview

Lead with the family. 'For an event-streaming workload I would use Kafka; for a task queue I would use RabbitMQ or SQS. They are different categories.'
State the delivery semantic. 'At-least-once with idempotent consumers is the production default. Exactly-once is possible within Kafka but cannot be guaranteed if the consumer writes to an external system without its own idempotency.'
Mention partitioning by a meaningful key. 'I would partition by user_id so all events for one user land on one partition and are processed in order.'
Bring up DLQs unprompted. 'Every queue needs a DLQ; messages that fail N times move there for manual inspection.'
Acknowledge backpressure. 'If consumers cannot keep up, queue depth grows; we monitor it and auto-scale consumers, or rate-limit at the producer.'
Name the operational risk. 'The biggest risk in async systems is silent failure - messages dropped on the floor with no alert. Monitoring lag and DLQ depth is non-negotiable.'

Quick Review

Two families: broker-based (RabbitMQ, SQS - delete on ack) and log-based (Kafka - retained log).
Three delivery semantics: at-most-once, at-least-once, exactly-once. At-least-once + idempotent consumer is the practical default.
Kafka partitions give per-key ordering and parallelism; pick partition keys carefully.
Consumer groups let multiple consumers split a topic's partitions; multiple groups read independently.
DLQs catch poison messages; monitor depth.
Pick by use case: high-throughput streaming -> Kafka; task queue -> RabbitMQ/SQS; AWS-native -> SQS+Lambda.

Real-World Examples

How real systems implement this in production

LinkedIn / Confluent Kafka

Kafka was built at LinkedIn to handle the firehose of activity events (page views, profile updates, messaging events). Today LinkedIn runs Kafka at over 7 trillion messages per day across thousands of brokers. Confluent commercialized Kafka and runs it as a managed service for many enterprises.

Trade-off: Kafka's log-based design gives extreme throughput and replay flexibility, but the operational burden is significant: ZooKeeper (or KRaft) cluster, broker tuning, partition rebalancing, retention sizing. The win is at scale; for small workloads, managed alternatives (SQS, RabbitMQ) are far cheaper to operate.

Stripe webhook fan-out (RabbitMQ)

Stripe uses RabbitMQ extensively for webhook delivery to merchants. When a payment event happens, the webhook service publishes to a RabbitMQ queue; consumers fetch the merchant's webhook URL, retry on failure (with exponential backoff), and move to a DLQ after N retries. The flexible per-message TTL and priority queues fit this workload better than a log-based system.

Trade-off: RabbitMQ's flexible routing and per-message metadata are perfect for task-queue and RPC patterns but cap throughput well below Kafka. Stripe's choice reflects that webhooks are inherently bounded by external HTTP latency, so Kafka's throughput would be wasted.

Netflix Keystone (Kafka + Flink)

Netflix's Keystone pipeline ingests trillions of events per day (playback events, UI interactions, error logs) into Kafka, then processes them with Flink for real-time analytics and ML feature pipelines. Kafka is the backbone because of its throughput and the ability for many independent consumers (analytics, monitoring, ML training) to read the same stream.

Trade-off: A single high-throughput log accessible to many independent consumer groups is exactly Kafka's sweet spot. The cost is operational: dozens of brokers, terabytes of retention, careful partition planning. For a smaller pipeline, Kinesis or managed Kafka would have been cheaper to run.

AWS SQS at Amazon retail

Amazon's order pipeline famously uses SQS for asynchronous processing: order placement publishes to a queue; downstream workers handle inventory, payment, shipping, and fulfillment in parallel. SQS's at-least-once delivery and visibility timeout are well-suited to retry-heavy workflows where each step is idempotent.

Trade-off: SQS is fully managed, virtually unlimited throughput on standard queues, and 'just works' for AWS-native architectures. The cost: less flexibility than Kafka or RabbitMQ for advanced patterns (exactly-once, complex routing, replay). For the order pipeline pattern, it is the right answer.

Quick Interview Phrases

Key terms to use in your answer

at-least-once with idempotent consumer

consumer group

partitioning by key

dead letter queue

consumer lag

exactly-once semantics

Common Interview Questions

Questions you might be asked about this topic

When would you pick Kafka vs RabbitMQ vs SQS for a new project?

Kafka for high-throughput event streams (>100K msg/sec), event sourcing, log aggregation, anything needing replay or long retention. RabbitMQ for task queues with rich routing, per-message TTL, priorities, RPC patterns - typical worker pool with diverse jobs. SQS for AWS-native serverless workflows with no infra to run; SQS Standard for high-throughput, SQS FIFO for strict ordering at moderate scale. Consider operational cost: SQS is fully managed; Kafka has the highest operational burden; RabbitMQ is in between. Pick by workload first, then by ops capacity.

Explain at-least-once delivery and how to make it work in practice.

How does Kafka guarantee ordering?

How do you handle a poison message that crashes your consumer every time?

What is the outbox pattern and how does it relate to message queues?

Interview Tips

How to discuss this topic effectively

Pick the queue family before the brand. 'Log-based for streaming, broker-based for task queues' is more sophisticated than 'we use Kafka because it is fast'.

Always state the delivery semantic explicitly. At-least-once with idempotent consumers is the production-grade answer for almost every real workload.

Mention partition keys when discussing Kafka. The wrong key destroys ordering or creates hot partitions; the right key (often user_id or entity_id) gives both ordering and even load.

Bring up DLQs unprompted. Every production queue has one; talking about it shows you have operated, not just designed, async systems.

Acknowledge that exactly-once is bounded. Kafka EOS works topic-to-topic; once you write to an external DB or API, the consumer must be idempotent there too. Senior interviewers love to probe this.

Common Mistakes

Pitfalls to avoid in interviews

Treating Kafka and RabbitMQ as interchangeable

They solve different problems. Kafka is a log: high throughput, per-partition ordering, retained for replay. RabbitMQ is a broker: rich routing, per-message TTL, deleted on ack. Picking Kafka for a task queue means paying for log retention you do not need; picking RabbitMQ for an event stream means hitting throughput limits.

Promising exactly-once without idempotent consumers

True exactly-once requires the entire path - producer, broker, consumer, downstream effects - to be transactional. Kafka EOS handles topic-to-topic. Anytime your consumer writes to an external database or API, you need idempotency keys or a dedupe table at the destination, no matter what the broker promises.

Forgetting to set up a DLQ

A poison message (malformed payload, bug in consumer) without a DLQ either spins forever (consuming resources) or stops the entire queue (head-of-line blocking). Every production queue needs a DLQ with a depth alert.

Picking partition key by random ID instead of meaningful key

Partitioning by random UUID gives even load but no ordering. If your business logic needs per-entity ordering (events for a user, updates to an order), partition by that entity ID. Hot keys can be mitigated with key-suffix sharding for the few mega-entities.

Ignoring consumer lag monitoring

Consumer lag (the gap between the latest message and the consumer's offset) is the first sign of any consumer problem - bug, slow downstream, runaway producer. Without lag monitoring you only learn about the issue when the disk fills or downstream queries time out.

Back to System Design