System Design Article

Design a Rate Limiter

Difficulty: Medium

Design a distributed rate limiter that protects an API platform from abuse and uneven load while staying fast and accurate at 1B requests per day. The interview centerpiece is choosing among the five canonical algorithms (fixed window, sliding window log, sliding window counter, token bucket, leaky bucket) and explaining how to make the chosen one atomic across a Redis cluster. We cover where to place the limiter (edge, gateway, in-process), per-IP vs per-user vs per-API-key keys, returning 429 with Retry-After, the hot key problem, and fail-open vs fail-closed under cache outages.

Design a Rate Limiter

System Design

Medium

design-rate-limiter

case-study

ecommerce-marketplace

rate-limiter

token-bucket

leaky-bucket

sliding-window

fixed-window

lua-script

throttling

redis

api-gateway

system-design

intermediate

free

737 views

Requirements

Functional Requirements

Reject excess requests: when a client exceeds the configured limit, return HTTP 429 with a Retry-After header.
Multiple key dimensions: limit per IP, per authenticated user, per API key, and per endpoint. The same request often hits several limits in sequence (a 100 req/min IP limit AND a 10 req/sec endpoint limit).
Configurable rules: ops team should set rules per route without redeploying. For example, POST /payments allows 5 req/sec per user; GET /search allows 100 req/sec per user.
Distributed: the same client is rate-limited consistently no matter which API server handles the request.
Inform the client: response headers expose remaining quota (X-RateLimit-Remaining, X-RateLimit-Reset) so well-behaved clients can self-throttle.

Out of Scope (state explicitly)

Quota billing or paid overages (separate metering system).
DDoS protection at L3/L4 (handled by upstream WAF / scrubbing center).
Bot detection beyond simple counts (separate ML service).

Non-Functional Requirements

Low latency overhead: the limiter adds < 5 ms p99 to a request. If it adds 50 ms, the cure is worse than the disease.
High accuracy: error within ~1% of the configured limit. Off by 50% would let abusers through or lock out legitimate users.
Highly available: 99.99%. If the rate-limit store dies, the API must keep serving (fail-open) rather than block all traffic.
Scale: 1B requests/day across 50M users; ~12K req/sec average, ~50K req/sec peak.
Cheap to operate: the rate limiter cannot cost more than a small fraction of API hosting.

Back-of-the-Envelope Estimation

Traffic

Text

---------- Traffic estimation ----------
Daily requests:              1B
Average req/sec:             ~12K
Peak req/sec (3x avg):       ~36K
Unique users / day:          50M
Active API keys:             500K
Unique IPs / day:            ~80M (NAT, mobile carriers inflate this)

Memory for the Counter Store

With sliding window counter, each (key, window) pair stores a single integer.

Text

---------- Counter memory ----------
Unique active keys per minute window: ~10M (users + IPs + API-keys + endpoints)
Per key entry: 64 bytes (key string + counter + TTL + Redis overhead)
Total per window: ~640 MB
Keep last 2 windows for the sliding calc: ~1.3 GB
Fits in a single Redis node easily; we shard for HA, not capacity.

Network and Throughput

Text

---------- Redis throughput ----------
50K req/sec peak * 1 round trip per request: 50K Redis ops/sec
A single Redis instance handles ~100K ops/sec on commodity hardware.
We still shard across 3-6 nodes for fault tolerance, not for throughput.

Hot Key Risk

Text

---------- Hot key estimation ----------
A single popular API key (Stripe webhooks, GitHub Actions org) can drive 5K req/sec.
That's 10% of total Redis throughput from one key, which lands on a single shard.
Mitigation: split hot keys across multiple counters; reconcile in aggregate.

High-Level Design

Text

---------- High-level architecture ----------
   +-----------+
   |  Client   |
   +-----------+
        |
        v
   +-------------------+
   |  CDN / Edge       |  (L7; cheap pre-check on IP)
   +-------------------+
        |
        v
   +-------------------+
   |  API Gateway      |  (rate-limit middleware)
   +-------------------+
        |               \
        |                \---> +--------------------+
        |                      |  Redis Cluster     |  (counters, atomic Lua)
        |                      +--------------------+
        |                              ^
        |                              |
        |                      +--------------------+
        |                      |  Rules Service     |  (per-route limit config)
        v                      +--------------------+
   +-------------------+
   |  Backend Services |
   +-------------------+

Request Flow

CDN edge does a coarse per-IP pre-check (reject obvious floods cheaply).
API gateway middleware computes the candidate keys: ip:1.2.3.4, user:42, apikey:sk_live_xxx:GET:/v1/charges.
For each key, run an atomic Lua script on the Redis shard owning that key. The script returns (allowed, remaining, reset_ms).
If any key denies, respond 429 Too Many Requests with headers and the strictest Retry-After.
Otherwise forward to the backend; on response add X-RateLimit-* headers from the recorded state.

API Surface

Jsonc

// Internal RPC the gateway calls per key
POST /ratelimit/check
{
    "key": "user:42:GET:/v1/charges",
    "limit": 100,
    "window_ms": 60000,
    "weight": 1
}

// Response
{
    "allowed": true,
    "remaining": 73,
    "reset_ms": 41200,
    "retry_after_s": 0
}

Jsonc

// 429 response to client
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 12
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714142400

{
    "error": "rate_limit_exceeded",
    "message": "Too many requests for user:42 on /v1/charges. Retry in 12s."
}

Detailed Design

The two interesting components are the algorithm choice and the atomic Redis Lua script.

The Five Canonical Algorithms

Algorithm	Memory per key	Burst handling	Accuracy	Notes
Fixed Window Counter	1 int	Allows 2x burst at boundaries	Low	Simple but unsafe; clients learn to abuse the boundary
Sliding Window Log	O(N) where N = requests in window	Exact	Highest	Memory blows up for chatty clients
Sliding Window Counter	2 ints	Smooth	~1% error	The pragmatic default
Token Bucket	2 floats (tokens, last_refill)	Allows controlled bursts	High	Best when bursts are desired (clients pre-buying capacity)
Leaky Bucket	1 queue	Smoothed output, no burst	High	Best when downstream needs uniform input rate

Fixed Window Counter

Count requests per fixed clock window (e.g., per calendar minute). Reset at the boundary.

Text

---------- The boundary problem ----------
Limit: 10 req/min
Window [12:00:00, 12:01:00): user sends 10 requests at 12:00:59
Window [12:01:00, 12:02:00): user sends 10 requests at 12:01:00
=> 20 requests in 2 seconds, 2x the limit, all allowed.

Good enough only when a 2x boundary burst is acceptable.

Sliding Window Counter (recommended default)

Keep counters for the current window AND the previous window. Estimate the count over the rolling window by weighting the previous window's count by the fraction that overlaps.

Text

---------- Sliding window counter math ----------
Now:               12:00:42
Window size:       60 s
Previous window:   [11:59:00, 12:00:00) had 80 requests
Current window:    [12:00:00, 12:01:00) has 10 requests so far
Overlap fraction:  18/60 = 0.30 of previous window is still in the rolling 60s
Estimated count:   80 * 0.30 + 10 = 34
Allow if 34 < limit.

Memory cost: 2 integers per key. Accuracy error: bounded by traffic uniformity within the previous window. Empirically within ~1% on real traffic.

Token Bucket

Each key has a bucket of capacity tokens. Tokens refill at refill_rate per second. Each request consumes one token; if the bucket is empty, reject.

Text

---------- Token bucket state ----------
tokens          float    (current count)
last_refill_ms  int64    (ms since epoch of last refill calc)

On check (now_ms, weight):
    elapsed = now_ms - last_refill_ms
    tokens = min(capacity, tokens + elapsed * refill_rate / 1000)
    if tokens >= weight:
        tokens -= weight
        last_refill_ms = now_ms
        return ALLOW
    else:
        return DENY (retry_after = (weight - tokens) / refill_rate)

Best when clients should be able to burst (e.g., a batch import that submits 50 requests then waits 10 minutes is fine).

Leaky Bucket

FIFO queue of fixed size; processed at a constant leak_rate. Requests join the queue if there's space, otherwise are dropped. Output rate is constant regardless of input bursts.

Useful when the downstream service needs a smoothed input (e.g., calling a third-party API with a hard rate limit). Less common in pure user-facing rate limiting.

Sliding Window Log

Keep a sorted set of timestamps for each key. On each check, drop entries older than window and check the count.

Text

---------- Sliding window log cost ----------
Client sends 1000 req/min. Sorted set holds 1000 timestamps per key.
At 10M active keys, log mode would need ~10B timestamps in memory.
Avoid for high-volume APIs.

Great for audit-level accuracy but memory is prohibitive at scale.

Atomic Redis Lua Script for Sliding Window Counter

The naive client-side approach (GET counter, compare, INCR) has a race condition: two requests can both see 9 < 10 and both INCR to 10 and 11, allowing one over the limit. Solve with an atomic Lua script that runs server-side as a single Redis operation.

Text

---------- Lua script (rate_limit.lua) ----------
KEYS[1] = current bucket key  e.g. user:42:1714142400 (epoch minute)
KEYS[2] = previous bucket key e.g. user:42:1714142340
ARGV[1] = limit               (e.g. 100)
ARGV[2] = window_ms           (60000)
ARGV[3] = elapsed_in_current_window_ms (e.g. 18000)
ARGV[4] = weight              (1)
ARGV[5] = ttl_seconds         (120)

local cur  = tonumber(redis.call('GET', KEYS[1]) or '0')
local prev = tonumber(redis.call('GET', KEYS[2]) or '0')
local frac = 1.0 - (tonumber(ARGV[3]) / tonumber(ARGV[2]))
local estimated = math.floor(prev * frac + cur)
if estimated + tonumber(ARGV[4]) > tonumber(ARGV[1]) then
    return {0, math.floor(estimated), tonumber(ARGV[2]) - tonumber(ARGV[3])}
end
redis.call('INCRBY', KEYS[1], tonumber(ARGV[4]))
redis.call('EXPIRE', KEYS[1], tonumber(ARGV[5]))
return {1, math.floor(estimated + tonumber(ARGV[4])), tonumber(ARGV[2]) - tonumber(ARGV[3])}

The script runs on the Redis shard owning that key (the key is the shard hash input), so it stays atomic for that key even in a clustered deployment.

Calling the Limiter from the Gateway

JavaScript

Python

import Redis from 'ioredis';
const redis = new Redis.Cluster([{ host: 'redis', port: 6379 }]);
const SCRIPT_SHA = await redis.script('LOAD', RATE_LIMIT_LUA);

async function checkLimit(key, limit, windowMs, weight = 1) {
    const nowMs = Date.now();
    const window = Math.floor(nowMs / windowMs);
    const elapsed = nowMs - window * windowMs;
    const curKey = `${key}:${window}`;
    const prevKey = `${key}:${window - 1}`;
    const ttl = Math.ceil(windowMs / 1000) * 2;
    const [allowed, used, resetMs] = await redis.evalsha(
        SCRIPT_SHA, 2, curKey, prevKey, limit, windowMs, elapsed, weight, ttl
    );
    return { allowed: allowed === 1, used, resetMs };
}

Where to Run the Limiter

Placement	Pros	Cons	Use when
CDN / Edge (Cloudflare, Fastly)	Cheap, blocks before reaching origin	Coarse keys (IP only); hard to share state across edges	Coarse abuse / DDoS-adjacent floods
API Gateway (Kong, Envoy, custom)	Sees auth context; can compose multiple limits	Adds 1-3 ms RPC to Redis	Default for production APIs
In-process middleware on each backend	No extra hop	Local state diverges across replicas	Single-instance services or last-resort guardrail

In practice we layer them: a coarse edge limit catches obvious floods; the gateway enforces precise per-user/per-endpoint limits; backends keep a small in-process safety net.

Hot Key Mitigation

When one API key drives 5K req/sec on the same Redis shard:

Sub-bucket sharding: split user:42 into user:42:s0, user:42:s1, ..., user:42:s7. Hash request to a sub-bucket. Sum sub-bucket counts when checking. Spreads load across 8 shards at the cost of slight overshoot at the boundary.
Local pre-check: the gateway holds an in-memory token bucket per key with the same limit; if the local bucket allows, decrement local AND only consult Redis every 100 ms. Reduces Redis load 100x for hot keys at the cost of slight overshoot during the bucket's local interval.
Adaptive limits: on detecting a hot key, raise its limit slightly while alerting ops; the tail of legitimate traffic gets through while alarms fire.

Data Model

Redis Keys (counter store)

Text

---------- Redis key structure ----------
ratelimit:{user:42}:GET:/v1/charges:1714142400  -> integer counter, TTL 120 s
ratelimit:{ip:1.2.3.4}:1714142400               -> integer counter, TTL 120 s
ratelimit:{apikey:sk_live_xxx}:1714142400       -> integer counter, TTL 120 s

The {user:42} hash tag forces all windows for the same key onto the same Redis shard, which is required for the atomic Lua script.

Postgres (rules table)

SQL

CREATE TABLE rate_limit_rules (
    id           BIGSERIAL PRIMARY KEY,
    scope        VARCHAR(32) NOT NULL,        -- 'user', 'ip', 'apikey'
    route        VARCHAR(255),                -- nullable: applies to all routes if null
    method       VARCHAR(8),
    limit_count  INT NOT NULL,
    window_ms    INT NOT NULL,
    weight_field VARCHAR(64),                 -- request size, etc.
    is_active    BOOLEAN DEFAULT TRUE,
    created_at   TIMESTAMPTZ NOT NULL,
    updated_at   TIMESTAMPTZ NOT NULL
);

CREATE INDEX idx_rules_lookup ON rate_limit_rules (scope, route, method) WHERE is_active;

Rules live in Postgres for durability; the gateway caches the rule set in memory and reloads every 30 seconds.

Audit Log (Kafka topic)

Every rate-limit decision (especially denials) is published to a Kafka topic for analytics, abuse detection, and later replay.

Text

---------- Kafka audit event ----------
{
  "ts": 1714142400123,
  "key": "user:42:GET:/v1/charges",
  "limit": 100,
  "used": 101,
  "allowed": false,
  "client_ip": "1.2.3.4",
  "user_agent": "...",
  "trace_id": "abc123"
}

Scaling and Bottlenecks

From 50K req/sec to 500K req/sec

Shard Redis by key hash. Each key is pinned to one shard via the hash tag, so the atomic script still works.
Local pre-check for hot keys: amplifies effective Redis throughput 100x for the keys that need it.
Pipeline multiple key checks: if a request triggers 4 limit checks, send them as one Redis pipeline call rather than 4 round trips. Cuts gateway-to-Redis latency 4x.

When Redis is unavailable

The API must keep serving. Two policies:

Fail-open (default for most products): on Redis timeout, allow the request and log a warning. Risk: brief abuse window during the outage.
Fail-closed: deny the request. Use only for highly sensitive endpoints (login, payment) where abuse risk dominates downtime risk.

A gateway can have per-route fail-mode: search endpoints fail-open, payment endpoints fail-closed.

Clock Skew

The Lua script reads now_ms from the gateway and the script's window is computed gateway-side. If gateway clocks drift more than a few seconds, the same window number is used inconsistently. Mitigations: NTP-sync all gateways; use Redis's TIME command to get a single authoritative clock per shard.

Multi-Region Limits

A user from Europe shouldn't get a separate quota from a user from the US. Two options:

Global limit, regional Redis with replication: each region has a Redis cluster; updates replicate cross-region asynchronously. Tolerates ~1-2% over-limit during replication lag, very low latency per request.
Local-then-global: local Redis for the fast path; a slower aggregator reconciles totals every minute and pushes adjusted limits.

Most production rate limiters live with eventual consistency; exact global limits are rarely worth the latency cost.

Cost Profile

A 6-node Redis cluster on r6g.large instances handles 500K req/sec with headroom. At spot pricing that's ~$300/month. The gateway middleware adds CPU but no separate hosting cost. Total cost is a small fraction of the API tier itself.

Trade-offs and Alternatives

Sliding Window Counter vs Token Bucket

They solve different problems. Sliding window enforces 'no more than N requests per minute, smoothly distributed.' Token bucket enforces 'no more than N requests per minute, but bursts up to capacity are fine.' Pick token bucket when clients legitimately burst (batch jobs, webhooks fanning out). Pick sliding window when fairness across the window matters more.

Exact (Sliding Window Log) vs Approximate (Sliding Window Counter)

The log is exact but costs O(requests in window) memory per key. At 10M active keys with 100 req/min average, that's 1B entries in memory: not viable. The counter is ~99% accurate at O(1) memory: the right trade for everything except audit-grade billing meters.

Distributed Redis vs In-Process Only

In-process is faster (no network hop) but each replica enforces independently, so a 100 req/min limit becomes a 100*N req/min limit with N replicas. Use in-process only for single-instance services or as a defense-in-depth backup behind a real distributed limiter.

Fail-Open vs Fail-Closed

Fail-open keeps the API running during a Redis outage, accepting brief abuse. Fail-closed protects against abuse, accepting downtime. Make the default fail-open for read endpoints and fail-closed for write/payment endpoints. Configure per route.

Why Lua Script and Not Redis MULTI/EXEC?

MULTI/EXEC is atomic across multiple commands but doesn't allow conditional logic ('only INCR if count < limit'). Lua scripts give you that conditional atomicity in one round trip, which is exactly what rate limiting needs.

Why Not Use a Specialized Service Like Envoy's Global Rate Limit?

Envoy's gRPC rate-limit service is excellent and removes the need to build your own. The trade-off is extensibility: a custom Lua-on-Redis solution is easy to extend with custom logic (e.g., 'free users get half the quota during peak hours'). At small scale, just use Envoy. At scale with custom rules, build the gateway middleware.

Why Not Just Use NGINX `limit_req_zone`?

NGINX has built-in per-IP rate limiting using a leaky bucket. It works for one NGINX instance; coordinating across N instances is awkward (each enforces independently). Good for edge IP throttling; insufficient for per-user quotas across a fleet of gateways.

Real-World Examples

How real systems implement this in production

Stripe API

Stripe enforces rate limits per API key with separate buckets for read and write operations (currently 100 read req/sec, 100 write req/sec by default). They expose remaining quota in `X-RateLimit-*` headers and use 429 with Retry-After. Their docs encourage exponential backoff and idempotency keys so retried requests are safe.

Trade-off: Stripe deliberately separates read and write buckets so a runaway data export doesn't starve checkout calls. The lesson: one limiter per resource class beats a single global limit when traffic patterns differ across endpoints.

GitHub REST API

GitHub allows 5,000 req/hour for authenticated users and 60 req/hour for anonymous IPs, with a separate Search API at 30 req/min. They implement primary rate limits (predictable, per-user) and secondary rate limits (heuristic, abuse detection) that fire for sustained high concurrency.

Trade-off: GitHub's two-tier system balances predictability with abuse control. The lesson: a fixed primary quota is great for normal users, but you also need a heuristic layer for behavior that looks like abuse but technically respects the quota (e.g., 5,000 sequential requests at 1.4 per second).

Cloudflare Edge Rate Limiting

Cloudflare offers per-IP and per-host rate limits enforced at their edge POPs. State is replicated across data centers within seconds, so a flood from one IP is throttled globally within ~10 s. They use a sliding-window counter under the hood and let users define rules in a DSL.

Trade-off: Cloudflare accepts ~10 s of cross-POP replication lag in exchange for very low per-request latency at the edge. The lesson: at edge scale, eventual consistency for rate-limit state is the right call; perfect synchronization is too expensive.

AWS API Gateway with Usage Plans

AWS API Gateway exposes per-API-key throttling via usage plans (a steady-state rate plus a burst capacity), backed by a token bucket. Limits are enforced before the request reaches your Lambda or backend. Customers also get a per-account limit on top of per-key limits.

Trade-off: AWS's token-bucket model is great for legitimate bursty workloads (data syncs, webhook fanouts) but gives clients a 'savings account' of unused capacity that can drain quickly. The lesson: token bucket is the right algorithm when bursting is a feature, not a bug; sliding window counter is better when smoothness matters more than burst tolerance.

Quick Interview Phrases

Key terms to use in your answer

sliding window counter with atomic Lua script

fail-open for reads, fail-closed for writes

hot key sub-bucket sharding

edge for IP throttling, gateway for per-user limits

Retry-After header on 429 responses

local pre-check amplifies Redis throughput

Common Interview Questions

Questions you might be asked about this topic

Walk me through the lifecycle of a request hitting your rate limiter.

Client sends GET /v1/charges. CDN edge does a coarse per-IP pre-check (throws out obvious flood). Request reaches API gateway; middleware computes candidate keys (user:42, ip:1.2.3.4, apikey:sk_xxx, plus the per-route variant). For each key, gateway calls EVALSHA on Redis with the sliding-window-counter Lua script (atomic on the shard owning that key). If any returns allowed=0, gateway returns 429 with Retry-After and the strictest reset time. Otherwise gateway forwards to backend; on the response it injects X-RateLimit-* headers from the recorded state. Asynchronously, the gateway publishes a Kafka audit event (key, allowed/denied, used/limit) for analytics and abuse detection.

Why sliding window counter and not sliding window log?

How does your design handle a single API key driving 5K req/sec?

What happens if Redis goes down?

Where would you place the limiter: edge, gateway, or backend?

Interview Tips

How to discuss this topic effectively

Lead with the algorithm comparison table. Saying 'I default to sliding window counter because it gives ~1% accuracy at O(1) memory; token bucket if bursts are desired' frames the conversation in the right vocabulary.

Always mention the boundary problem of fixed window. It's the cheapest demonstration that you understand why naive counters fail.

Bring up the atomic Lua script unprompted. Two-step GET/INCR has a race condition that lets clients squeak past the limit; an interviewer expects you to see that.

Discuss the hot key problem before being asked. A single popular API key on one Redis shard is the realistic bottleneck; sub-bucket sharding or local pre-check is the answer.

Articulate fail-open vs fail-closed and tie it to endpoint sensitivity. Saying 'reads fail-open, writes/payments fail-closed' shows you've thought about real availability trade-offs.

Common Mistakes

Pitfalls to avoid in interviews

Using a fixed window counter without acknowledging the boundary burst

Fixed window allows up to 2x the limit at the boundary (10 requests at 12:00:59 plus 10 at 12:01:00). Either upgrade to sliding window counter or explicitly call out the boundary burst as acceptable for your use case.

Implementing the check as GET then conditional INCR from the application

Two requests can both see count=9, both pass the check, and both INCR to 10 and 11, allowing one over the limit. Wrap the read-decide-write in a Redis Lua script so it executes atomically on the shard.

Forgetting to pin all keys for one user to one Redis shard

If the current-window key and previous-window key for the same user land on different shards, the Lua script can't see both atomically. Use a Redis hash tag like `{user:42}` so all related keys hash to the same slot.

Failing closed by default when Redis is unreachable

A Redis outage shouldn't take down the API. Default to fail-open with structured logging and alarms; switch to fail-closed only for high-risk endpoints (login, payments) where abuse risk dominates.

Designing one limiter per service replica without coordination

If 10 backend replicas each enforce a 100 req/min limit independently, the effective limit is 1000 req/min. Centralize the counters in Redis (or another shared store) and have all replicas consult it.

Back to System Design

Design a Rate Limiter

Requirements

Functional Requirements

Out of Scope (state explicitly)

Non-Functional Requirements

Back-of-the-Envelope Estimation

Traffic

Memory for the Counter Store

Network and Throughput

Hot Key Risk

High-Level Design

Request Flow

API Surface

Detailed Design

The Five Canonical Algorithms

Fixed Window Counter

Sliding Window Counter (recommended default)

Token Bucket

Leaky Bucket

Sliding Window Log

Atomic Redis Lua Script for Sliding Window Counter

Calling the Limiter from the Gateway

Where to Run the Limiter

Hot Key Mitigation

Data Model

Redis Keys (counter store)

Postgres (rules table)

Audit Log (Kafka topic)

Scaling and Bottlenecks

From 50K req/sec to 500K req/sec

When Redis is unavailable

Clock Skew

Multi-Region Limits

Cost Profile

Trade-offs and Alternatives

Sliding Window Counter vs Token Bucket

Exact (Sliding Window Log) vs Approximate (Sliding Window Counter)

Distributed Redis vs In-Process Only

Fail-Open vs Fail-Closed

Why Lua Script and Not Redis MULTI/EXEC?

Why Not Use a Specialized Service Like Envoy's Global Rate Limit?

Why Not Just Use NGINX limit_req_zone?

Real-World Examples

Quick Interview Phrases

Common Interview Questions

Interview Tips

Common Mistakes

Why Not Just Use NGINX `limit_req_zone`?