System Design Article
Design a Rate Limiter
Difficulty: Medium
Design a distributed rate limiter that protects an API platform from abuse and uneven load while staying fast and accurate at 1B requests per day. The interview centerpiece is choosing among the five canonical algorithms (fixed window, sliding window log, sliding window counter, token bucket, leaky bucket) and explaining how to make the chosen one atomic across a Redis cluster. We cover where to place the limiter (edge, gateway, in-process), per-IP vs per-user vs per-API-key keys, returning 429 with Retry-After, the hot key problem, and fail-open vs fail-closed under cache outages.
Design a Rate Limiter
Design a distributed rate limiter that protects an API platform from abuse and uneven load while staying fast and accurate at 1B requests per day. The interview centerpiece is choosing among the five canonical algorithms (fixed window, sliding window log, sliding window counter, token bucket, leaky bucket) and explaining how to make the chosen one atomic across a Redis cluster. We cover where to place the limiter (edge, gateway, in-process), per-IP vs per-user vs per-API-key keys, returning 429 with Retry-After, the hot key problem, and fail-open vs fail-closed under cache outages.
737 views
21
Requirements
Functional Requirements
- Reject excess requests: when a client exceeds the configured limit, return HTTP 429 with a
Retry-Afterheader. - Multiple key dimensions: limit per IP, per authenticated user, per API key, and per endpoint. The same request often hits several limits in sequence (a 100 req/min IP limit AND a 10 req/sec endpoint limit).
- Configurable rules: ops team should set rules per route without redeploying. For example,
POST /paymentsallows 5 req/sec per user;GET /searchallows 100 req/sec per user. - Distributed: the same client is rate-limited consistently no matter which API server handles the request.
- Inform the client: response headers expose remaining quota (
X-RateLimit-Remaining,X-RateLimit-Reset) so well-behaved clients can self-throttle.
Out of Scope (state explicitly)
- Quota billing or paid overages (separate metering system).
- DDoS protection at L3/L4 (handled by upstream WAF / scrubbing center).
- Bot detection beyond simple counts (separate ML service).
Non-Functional Requirements
- Low latency overhead: the limiter adds < 5 ms p99 to a request. If it adds 50 ms, the cure is worse than the disease.
- High accuracy: error within ~1% of the configured limit. Off by 50% would let abusers through or lock out legitimate users.
- Highly available: 99.99%. If the rate-limit store dies, the API must keep serving (fail-open) rather than block all traffic.
- Scale: 1B requests/day across 50M users; ~12K req/sec average, ~50K req/sec peak.
- Cheap to operate: the rate limiter cannot cost more than a small fraction of API hosting.
Back-of-the-Envelope Estimation
Traffic
---------- Traffic estimation ----------
Daily requests: 1B
Average req/sec: ~12K
Peak req/sec (3x avg): ~36K
Unique users / day: 50M
Active API keys: 500K
Unique IPs / day: ~80M (NAT, mobile carriers inflate this)Memory for the Counter Store
With sliding window counter, each (key, window) pair stores a single integer.
---------- Counter memory ----------
Unique active keys per minute window: ~10M (users + IPs + API-keys + endpoints)
Per key entry: 64 bytes (key string + counter + TTL + Redis overhead)
Total per window: ~640 MB
Keep last 2 windows for the sliding calc: ~1.3 GB
Fits in a single Redis node easily; we shard for HA, not capacity.Network and Throughput
---------- Redis throughput ----------
50K req/sec peak * 1 round trip per request: 50K Redis ops/sec
A single Redis instance handles ~100K ops/sec on commodity hardware.
We still shard across 3-6 nodes for fault tolerance, not for throughput.Hot Key Risk
---------- Hot key estimation ----------
A single popular API key (Stripe webhooks, GitHub Actions org) can drive 5K req/sec.
That's 10% of total Redis throughput from one key, which lands on a single shard.
Mitigation: split hot keys across multiple counters; reconcile in aggregate.High-Level Design
---------- High-level architecture ----------
+-----------+
| Client |
+-----------+
|
v
+-------------------+
| CDN / Edge | (L7; cheap pre-check on IP)
+-------------------+
|
v
+-------------------+
| API Gateway | (rate-limit middleware)
+-------------------+
| \
| \---> +--------------------+
| | Redis Cluster | (counters, atomic Lua)
| +--------------------+
| ^
| |
| +--------------------+
| | Rules Service | (per-route limit config)
v +--------------------+
+-------------------+
| Backend Services |
+-------------------+Request Flow
- CDN edge does a coarse per-IP pre-check (reject obvious floods cheaply).
- API gateway middleware computes the candidate keys:
ip:1.2.3.4,user:42,apikey:sk_live_xxx:GET:/v1/charges. - For each key, run an atomic Lua script on the Redis shard owning that key. The script returns
(allowed, remaining, reset_ms). - If any key denies, respond
429 Too Many Requestswith headers and the strictestRetry-After. - Otherwise forward to the backend; on response add
X-RateLimit-*headers from the recorded state.
API Surface
// Internal RPC the gateway calls per key
POST /ratelimit/check
{
"key": "user:42:GET:/v1/charges",
"limit": 100,
"window_ms": 60000,
"weight": 1
}
// Response
{
"allowed": true,
"remaining": 73,
"reset_ms": 41200,
"retry_after_s": 0
}// 429 response to client
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 12
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714142400
{
"error": "rate_limit_exceeded",
"message": "Too many requests for user:42 on /v1/charges. Retry in 12s."
}Detailed Design
The two interesting components are the algorithm choice and the atomic Redis Lua script.
The Five Canonical Algorithms
| Algorithm | Memory per key | Burst handling | Accuracy | Notes |
|---|---|---|---|---|
| Fixed Window Counter | 1 int | Allows 2x burst at boundaries | Low | Simple but unsafe; clients learn to abuse the boundary |
| Sliding Window Log | O(N) where N = requests in window | Exact | Highest | Memory blows up for chatty clients |
| Sliding Window Counter | 2 ints | Smooth | ~1% error | The pragmatic default |
| Token Bucket | 2 floats (tokens, last_refill) | Allows controlled bursts | High | Best when bursts are desired (clients pre-buying capacity) |
| Leaky Bucket | 1 queue | Smoothed output, no burst | High | Best when downstream needs uniform input rate |
Fixed Window Counter
Count requests per fixed clock window (e.g., per calendar minute). Reset at the boundary.
---------- The boundary problem ----------
Limit: 10 req/min
Window [12:00:00, 12:01:00): user sends 10 requests at 12:00:59
Window [12:01:00, 12:02:00): user sends 10 requests at 12:01:00
=> 20 requests in 2 seconds, 2x the limit, all allowed.Good enough only when a 2x boundary burst is acceptable.
Sliding Window Counter (recommended default)
Keep counters for the current window AND the previous window. Estimate the count over the rolling window by weighting the previous window's count by the fraction that overlaps.
---------- Sliding window counter math ----------
Now: 12:00:42
Window size: 60 s
Previous window: [11:59:00, 12:00:00) had 80 requests
Current window: [12:00:00, 12:01:00) has 10 requests so far
Overlap fraction: 18/60 = 0.30 of previous window is still in the rolling 60s
Estimated count: 80 * 0.30 + 10 = 34
Allow if 34 < limit.Memory cost: 2 integers per key. Accuracy error: bounded by traffic uniformity within the previous window. Empirically within ~1% on real traffic.
Token Bucket
Each key has a bucket of capacity tokens. Tokens refill at refill_rate per second. Each request consumes one token; if the bucket is empty, reject.
---------- Token bucket state ----------
tokens float (current count)
last_refill_ms int64 (ms since epoch of last refill calc)
On check (now_ms, weight):
elapsed = now_ms - last_refill_ms
tokens = min(capacity, tokens + elapsed * refill_rate / 1000)
if tokens >= weight:
tokens -= weight
last_refill_ms = now_ms
return ALLOW
else:
return DENY (retry_after = (weight - tokens) / refill_rate)Best when clients should be able to burst (e.g., a batch import that submits 50 requests then waits 10 minutes is fine).
Leaky Bucket
FIFO queue of fixed size; processed at a constant leak_rate. Requests join the queue if there's space, otherwise are dropped. Output rate is constant regardless of input bursts.
Useful when the downstream service needs a smoothed input (e.g., calling a third-party API with a hard rate limit). Less common in pure user-facing rate limiting.
Sliding Window Log
Keep a sorted set of timestamps for each key. On each check, drop entries older than window and check the count.
---------- Sliding window log cost ----------
Client sends 1000 req/min. Sorted set holds 1000 timestamps per key.
At 10M active keys, log mode would need ~10B timestamps in memory.
Avoid for high-volume APIs.Great for audit-level accuracy but memory is prohibitive at scale.
Atomic Redis Lua Script for Sliding Window Counter
The naive client-side approach (GET counter, compare, INCR) has a race condition: two requests can both see 9 < 10 and both INCR to 10 and 11, allowing one over the limit. Solve with an atomic Lua script that runs server-side as a single Redis operation.
---------- Lua script (rate_limit.lua) ----------
KEYS[1] = current bucket key e.g. user:42:1714142400 (epoch minute)
KEYS[2] = previous bucket key e.g. user:42:1714142340
ARGV[1] = limit (e.g. 100)
ARGV[2] = window_ms (60000)
ARGV[3] = elapsed_in_current_window_ms (e.g. 18000)
ARGV[4] = weight (1)
ARGV[5] = ttl_seconds (120)
local cur = tonumber(redis.call('GET', KEYS[1]) or '0')
local prev = tonumber(redis.call('GET', KEYS[2]) or '0')
local frac = 1.0 - (tonumber(ARGV[3]) / tonumber(ARGV[2]))
local estimated = math.floor(prev * frac + cur)
if estimated + tonumber(ARGV[4]) > tonumber(ARGV[1]) then
return {0, math.floor(estimated), tonumber(ARGV[2]) - tonumber(ARGV[3])}
end
redis.call('INCRBY', KEYS[1], tonumber(ARGV[4]))
redis.call('EXPIRE', KEYS[1], tonumber(ARGV[5]))
return {1, math.floor(estimated + tonumber(ARGV[4])), tonumber(ARGV[2]) - tonumber(ARGV[3])}The script runs on the Redis shard owning that key (the key is the shard hash input), so it stays atomic for that key even in a clustered deployment.
Calling the Limiter from the Gateway
import Redis from 'ioredis';
const redis = new Redis.Cluster([{ host: 'redis', port: 6379 }]);
const SCRIPT_SHA = await redis.script('LOAD', RATE_LIMIT_LUA);
async function checkLimit(key, limit, windowMs, weight = 1) {
const nowMs = Date.now();
const window = Math.floor(nowMs / windowMs);
const elapsed = nowMs - window * windowMs;
const curKey = `${key}:${window}`;
const prevKey = `${key}:${window - 1}`;
const ttl = Math.ceil(windowMs / 1000) * 2;
const [allowed, used, resetMs] = await redis.evalsha(
SCRIPT_SHA, 2, curKey, prevKey, limit, windowMs, elapsed, weight, ttl
);
return { allowed: allowed === 1, used, resetMs };
}Where to Run the Limiter
| Placement | Pros | Cons | Use when |
|---|---|---|---|
| CDN / Edge (Cloudflare, Fastly) | Cheap, blocks before reaching origin | Coarse keys (IP only); hard to share state across edges | Coarse abuse / DDoS-adjacent floods |
| API Gateway (Kong, Envoy, custom) | Sees auth context; can compose multiple limits | Adds 1-3 ms RPC to Redis | Default for production APIs |
| In-process middleware on each backend | No extra hop | Local state diverges across replicas | Single-instance services or last-resort guardrail |
In practice we layer them: a coarse edge limit catches obvious floods; the gateway enforces precise per-user/per-endpoint limits; backends keep a small in-process safety net.
Hot Key Mitigation
When one API key drives 5K req/sec on the same Redis shard:
- Sub-bucket sharding: split
user:42intouser:42:s0,user:42:s1, ...,user:42:s7. Hash request to a sub-bucket. Sum sub-bucket counts when checking. Spreads load across 8 shards at the cost of slight overshoot at the boundary. - Local pre-check: the gateway holds an in-memory token bucket per key with the same limit; if the local bucket allows, decrement local AND only consult Redis every 100 ms. Reduces Redis load 100x for hot keys at the cost of slight overshoot during the bucket's local interval.
- Adaptive limits: on detecting a hot key, raise its limit slightly while alerting ops; the tail of legitimate traffic gets through while alarms fire.
Data Model
Redis Keys (counter store)
---------- Redis key structure ----------
ratelimit:{user:42}:GET:/v1/charges:1714142400 -> integer counter, TTL 120 s
ratelimit:{ip:1.2.3.4}:1714142400 -> integer counter, TTL 120 s
ratelimit:{apikey:sk_live_xxx}:1714142400 -> integer counter, TTL 120 sThe {user:42} hash tag forces all windows for the same key onto the same Redis shard, which is required for the atomic Lua script.
Postgres (rules table)
CREATE TABLE rate_limit_rules (
id BIGSERIAL PRIMARY KEY,
scope VARCHAR(32) NOT NULL, -- 'user', 'ip', 'apikey'
route VARCHAR(255), -- nullable: applies to all routes if null
method VARCHAR(8),
limit_count INT NOT NULL,
window_ms INT NOT NULL,
weight_field VARCHAR(64), -- request size, etc.
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMPTZ NOT NULL,
updated_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_rules_lookup ON rate_limit_rules (scope, route, method) WHERE is_active;Rules live in Postgres for durability; the gateway caches the rule set in memory and reloads every 30 seconds.
Audit Log (Kafka topic)
Every rate-limit decision (especially denials) is published to a Kafka topic for analytics, abuse detection, and later replay.
---------- Kafka audit event ----------
{
"ts": 1714142400123,
"key": "user:42:GET:/v1/charges",
"limit": 100,
"used": 101,
"allowed": false,
"client_ip": "1.2.3.4",
"user_agent": "...",
"trace_id": "abc123"
}Scaling and Bottlenecks
From 50K req/sec to 500K req/sec
- Shard Redis by key hash. Each key is pinned to one shard via the hash tag, so the atomic script still works.
- Local pre-check for hot keys: amplifies effective Redis throughput 100x for the keys that need it.
- Pipeline multiple key checks: if a request triggers 4 limit checks, send them as one Redis pipeline call rather than 4 round trips. Cuts gateway-to-Redis latency 4x.
When Redis is unavailable
The API must keep serving. Two policies:
- Fail-open (default for most products): on Redis timeout, allow the request and log a warning. Risk: brief abuse window during the outage.
- Fail-closed: deny the request. Use only for highly sensitive endpoints (login, payment) where abuse risk dominates downtime risk.
A gateway can have per-route fail-mode: search endpoints fail-open, payment endpoints fail-closed.
Clock Skew
The Lua script reads now_ms from the gateway and the script's window is computed gateway-side. If gateway clocks drift more than a few seconds, the same window number is used inconsistently. Mitigations: NTP-sync all gateways; use Redis's TIME command to get a single authoritative clock per shard.
Multi-Region Limits
A user from Europe shouldn't get a separate quota from a user from the US. Two options:
- Global limit, regional Redis with replication: each region has a Redis cluster; updates replicate cross-region asynchronously. Tolerates ~1-2% over-limit during replication lag, very low latency per request.
- Local-then-global: local Redis for the fast path; a slower aggregator reconciles totals every minute and pushes adjusted limits.
Most production rate limiters live with eventual consistency; exact global limits are rarely worth the latency cost.
Cost Profile
A 6-node Redis cluster on r6g.large instances handles 500K req/sec with headroom. At spot pricing that's ~$300/month. The gateway middleware adds CPU but no separate hosting cost. Total cost is a small fraction of the API tier itself.
Trade-offs and Alternatives
Sliding Window Counter vs Token Bucket
They solve different problems. Sliding window enforces 'no more than N requests per minute, smoothly distributed.' Token bucket enforces 'no more than N requests per minute, but bursts up to capacity are fine.' Pick token bucket when clients legitimately burst (batch jobs, webhooks fanning out). Pick sliding window when fairness across the window matters more.
Exact (Sliding Window Log) vs Approximate (Sliding Window Counter)
The log is exact but costs O(requests in window) memory per key. At 10M active keys with 100 req/min average, that's 1B entries in memory: not viable. The counter is ~99% accurate at O(1) memory: the right trade for everything except audit-grade billing meters.
Distributed Redis vs In-Process Only
In-process is faster (no network hop) but each replica enforces independently, so a 100 req/min limit becomes a 100*N req/min limit with N replicas. Use in-process only for single-instance services or as a defense-in-depth backup behind a real distributed limiter.
Fail-Open vs Fail-Closed
Fail-open keeps the API running during a Redis outage, accepting brief abuse. Fail-closed protects against abuse, accepting downtime. Make the default fail-open for read endpoints and fail-closed for write/payment endpoints. Configure per route.
Why Lua Script and Not Redis MULTI/EXEC?
MULTI/EXEC is atomic across multiple commands but doesn't allow conditional logic ('only INCR if count < limit'). Lua scripts give you that conditional atomicity in one round trip, which is exactly what rate limiting needs.
Why Not Use a Specialized Service Like Envoy's Global Rate Limit?
Envoy's gRPC rate-limit service is excellent and removes the need to build your own. The trade-off is extensibility: a custom Lua-on-Redis solution is easy to extend with custom logic (e.g., 'free users get half the quota during peak hours'). At small scale, just use Envoy. At scale with custom rules, build the gateway middleware.
Why Not Just Use NGINX limit_req_zone?
NGINX has built-in per-IP rate limiting using a leaky bucket. It works for one NGINX instance; coordinating across N instances is awkward (each enforces independently). Good for edge IP throttling; insufficient for per-user quotas across a fleet of gateways.
Real-World Examples
How real systems implement this in production
Stripe enforces rate limits per API key with separate buckets for read and write operations (currently 100 read req/sec, 100 write req/sec by default). They expose remaining quota in `X-RateLimit-*` headers and use 429 with Retry-After. Their docs encourage exponential backoff and idempotency keys so retried requests are safe.
Trade-off: Stripe deliberately separates read and write buckets so a runaway data export doesn't starve checkout calls. The lesson: one limiter per resource class beats a single global limit when traffic patterns differ across endpoints.
GitHub allows 5,000 req/hour for authenticated users and 60 req/hour for anonymous IPs, with a separate Search API at 30 req/min. They implement primary rate limits (predictable, per-user) and secondary rate limits (heuristic, abuse detection) that fire for sustained high concurrency.
Trade-off: GitHub's two-tier system balances predictability with abuse control. The lesson: a fixed primary quota is great for normal users, but you also need a heuristic layer for behavior that looks like abuse but technically respects the quota (e.g., 5,000 sequential requests at 1.4 per second).
Cloudflare offers per-IP and per-host rate limits enforced at their edge POPs. State is replicated across data centers within seconds, so a flood from one IP is throttled globally within ~10 s. They use a sliding-window counter under the hood and let users define rules in a DSL.
Trade-off: Cloudflare accepts ~10 s of cross-POP replication lag in exchange for very low per-request latency at the edge. The lesson: at edge scale, eventual consistency for rate-limit state is the right call; perfect synchronization is too expensive.
AWS API Gateway exposes per-API-key throttling via usage plans (a steady-state rate plus a burst capacity), backed by a token bucket. Limits are enforced before the request reaches your Lambda or backend. Customers also get a per-account limit on top of per-key limits.
Trade-off: AWS's token-bucket model is great for legitimate bursty workloads (data syncs, webhook fanouts) but gives clients a 'savings account' of unused capacity that can drain quickly. The lesson: token bucket is the right algorithm when bursting is a feature, not a bug; sliding window counter is better when smoothness matters more than burst tolerance.
Quick Interview Phrases
Key terms to use in your answer
Common Interview Questions
Questions you might be asked about this topic
Client sends GET /v1/charges. CDN edge does a coarse per-IP pre-check (throws out obvious flood). Request reaches API gateway; middleware computes candidate keys (user:42, ip:1.2.3.4, apikey:sk_xxx, plus the per-route variant). For each key, gateway calls EVALSHA on Redis with the sliding-window-counter Lua script (atomic on the shard owning that key). If any returns allowed=0, gateway returns 429 with Retry-After and the strictest reset time. Otherwise gateway forwards to backend; on the response it injects X-RateLimit-* headers from the recorded state. Asynchronously, the gateway publishes a Kafka audit event (key, allowed/denied, used/limit) for analytics and abuse detection.
The log is exact: it stores every request timestamp, drops old ones, and counts. But memory is O(requests in window) per key; at 10M active keys averaging 100 req/min, that's ~1B timestamps in memory, far too expensive. The sliding window counter stores 2 integers per key (current and previous window) and computes a weighted estimate. Memory is O(1) per key, accuracy is ~99% on real-world traffic (the only error comes from non-uniform distribution within the previous window). For everything except billing-grade audit, the counter is the right trade-off.
That key lives on one Redis shard, so 5K ops/sec hit one node. Three mitigations layered: (1) sub-bucket sharding splits user:42 into 8 sub-keys (user:42:s0..s7); each request hashes to a sub-bucket and the gateway sums them on check; this spreads load 8x with a small overshoot at boundaries. (2) Local pre-check: each gateway holds an in-memory token bucket for that key with the configured limit; if local allows, gateway only consults Redis every 100 ms; cuts Redis load ~100x for hot keys with brief overshoot. (3) Adaptive raise: detect-and-alert flow that quietly raises the limit a bit while ops investigates; preserves legitimate users while alarms fire.
Two policies, configurable per route. Fail-open: gateway treats Redis errors as 'allow' and logs a warning; the API stays up at the cost of accepting brief abuse during the outage; this is the default for read-heavy endpoints (search, browse). Fail-closed: gateway treats Redis errors as 'deny' and returns 503 (not 429); used for sensitive write endpoints (login, payment) where abuse risk dominates downtime risk. We monitor Redis health independently and alert on either failure mode triggering. Long-term mitigation: Redis Cluster with replicas and automatic failover so the outage is measured in seconds, not minutes.
Layer all three. Edge (Cloudflare/Fastly): coarse per-IP throttling that catches floods before they reach our infrastructure; cheap and very fast but can only key on what the edge sees (IP, headers). Gateway (Kong/Envoy/our own): the precise enforcement layer with auth context; sees user IDs and API keys, enforces per-user/per-endpoint limits with the Lua-on-Redis script; default location for production rules. Backend in-process: a small last-resort guardrail per service (e.g., circuit-breaker style) so a misconfigured gateway can't take down the database tier. Each layer protects against a different failure mode.
Interview Tips
How to discuss this topic effectively
Lead with the algorithm comparison table. Saying 'I default to sliding window counter because it gives ~1% accuracy at O(1) memory; token bucket if bursts are desired' frames the conversation in the right vocabulary.
Always mention the boundary problem of fixed window. It's the cheapest demonstration that you understand why naive counters fail.
Bring up the atomic Lua script unprompted. Two-step GET/INCR has a race condition that lets clients squeak past the limit; an interviewer expects you to see that.
Discuss the hot key problem before being asked. A single popular API key on one Redis shard is the realistic bottleneck; sub-bucket sharding or local pre-check is the answer.
Articulate fail-open vs fail-closed and tie it to endpoint sensitivity. Saying 'reads fail-open, writes/payments fail-closed' shows you've thought about real availability trade-offs.
Common Mistakes
Pitfalls to avoid in interviews
Using a fixed window counter without acknowledging the boundary burst
Fixed window allows up to 2x the limit at the boundary (10 requests at 12:00:59 plus 10 at 12:01:00). Either upgrade to sliding window counter or explicitly call out the boundary burst as acceptable for your use case.
Implementing the check as GET then conditional INCR from the application
Two requests can both see count=9, both pass the check, and both INCR to 10 and 11, allowing one over the limit. Wrap the read-decide-write in a Redis Lua script so it executes atomically on the shard.
Forgetting to pin all keys for one user to one Redis shard
If the current-window key and previous-window key for the same user land on different shards, the Lua script can't see both atomically. Use a Redis hash tag like `{user:42}` so all related keys hash to the same slot.
Failing closed by default when Redis is unreachable
A Redis outage shouldn't take down the API. Default to fail-open with structured logging and alarms; switch to fail-closed only for high-risk endpoints (login, payments) where abuse risk dominates.
Designing one limiter per service replica without coordination
If 10 backend replicas each enforce a 100 req/min limit independently, the effective limit is 1000 req/min. Centralize the counters in Redis (or another shared store) and have all replicas consult it.
