Community Article

Rate Limiting: Token Bucket vs Sliding Window

Token bucket is the right default. Sliding window log is correct but expensive. Fixed window is the algorithm I would not ship.

Rate Limiting: Token Bucket vs Sliding Window

Token bucket is the right default. Sliding window log is correct but expensive. Fixed window is the algorithm I would not ship.

rate-limiting

token-bucket

sliding-window

api-design

system-design

By @adityadesai

February 11, 2026

Updated May 18, 2026

198 views

4.2 (12)

The first question I ask when a team says they want to add rate limiting is not "how many requests per second?" It is "what burst do you want to allow?" Those questions feel similar but they are not, and the difference is what makes a rate limiter work or fail in production.

A rate limit of 100 requests per second can mean five different things depending on the algorithm. Can a client send 100 requests in one millisecond and then nothing for 999 milliseconds? Can they send 50 in the first second and 150 in the second? Can they send a steady 1.6 every 16ms? The same nominal limit produces different behavior, and the burst behavior is what users actually feel.

This article walks through the four algorithms that show up in interviews and production code (fixed window, sliding window log, sliding window counter, token bucket), names what each one does badly, and ends with the one I reach for first.

The four algorithms, briefly

Algorithm comparison
  fixed window      O(1) check, low memory, but allows 2x burst at boundary
  sliding window log    O(N) check, high memory, perfectly fair
  sliding window counter   O(1) check, low memory, approximate fairness
  token bucket      O(1) check, low memory, explicit burst control

The story underneath the table: fixed window is too coarse, sliding window log is too expensive, sliding window counter is a pragmatic approximation, and token bucket explicitly models burst as a first-class concept. Most production systems should default to token bucket.

Fixed window: the one to avoid

Fixed window is the simplest. Pick a window length (say, one minute), count requests within the window, reject if the count exceeds the limit. At the start of each window, reset the count.

class FixedWindow:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window = window_seconds
        self.counts = {}  # window_start -> count

    def allow(self, client_id: str, now: float) -> bool:
        bucket = (client_id, int(now // self.window))
        self.counts[bucket] = self.counts.get(bucket, 0) + 1
        return self.counts[bucket] <= self.limit

The bug is the boundary. If the limit is 100 requests per minute, a client can send 100 requests at second 59 of one minute and another 100 at second 0 of the next minute. That is 200 requests in two seconds, with the limit policy saying "100 per minute."

The standard fix attempt is to shorten the window (100 per minute -> 10 per six seconds). That helps but does not eliminate the boundary effect; it just makes the window smaller. The real fix is a different algorithm.

I have inherited fixed-window limiters in two production systems. Both got bug reports about "the limit doesn't actually work" within months of deployment. Both ended up replaced.

Sliding window log: correct but expensive

Sliding window log keeps a list of timestamps, one per request, within the last window. On each request, drop expired timestamps, count the rest, allow if the count is under the limit.

class SlidingWindowLog:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window = window_seconds
        self.logs = {}  # client_id -> list[timestamp]

    def allow(self, client_id: str, now: float) -> bool:
        log = self.logs.setdefault(client_id, [])
        cutoff = now - self.window
        while log and log[0] <= cutoff:
            log.pop(0)
        if len(log) >= self.limit:
            return False
        log.append(now)
        return True

This is exact. Every request gets the right answer. The cost is memory and CPU: the log can grow as large as the limit, and the pop(0) from a Python list is O(N) (use a deque to make it O(1)). For a per-client limit of 100, you store 100 timestamps per client. For a million clients, that is 100 million timestamps. At eight bytes per timestamp that is 800 MB just for the log structure.

When I would actually use this: low-volume APIs where fairness matters more than scale (a per-tenant fairness gate where the per-tenant limit is small), or per-IP rate limits where the limit is small enough that the per-client log stays cheap.

When I would not: high-volume APIs with high per-client limits.

Sliding window counter: the pragmatic compromise

Sliding window counter tries to get most of the fairness of the log with most of the efficiency of the fixed window. The trick: weight the count from the previous window by how much of the previous window is still inside the rolling window.

class SlidingWindowCounter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window = window_seconds
        self.counts = {}  # (client_id, window_start) -> count

    def allow(self, client_id: str, now: float) -> bool:
        current_start = int(now // self.window) * self.window
        previous_start = current_start - self.window
        elapsed_in_current = now - current_start
        prev_weight = max(0.0, 1.0 - elapsed_in_current / self.window)

        current = self.counts.get((client_id, current_start), 0)
        previous = self.counts.get((client_id, previous_start), 0)

        weighted = current + previous * prev_weight
        if weighted >= self.limit:
            return False

        self.counts[(client_id, current_start)] = current + 1
        return True

The intuition: ten seconds into a sixty-second window, the rolling window stretches across the last 50 seconds of the previous window plus 10 seconds of the current window. The count is the current window's count plus 50/60 of the previous window's count.

This is what most production rate limiters in CDNs and API gateways implement. It is approximate (the previous window's requests are assumed uniform, which they often are not), but the approximation is good enough that the boundary issue mostly goes away. Memory is two counters per client, regardless of the limit value.

I have used this in CDN layers and at the API gateway. It is the right balance for high-volume traffic where memory matters and exact fairness is not the goal.

Token bucket: the one I default to

Token bucket models the rate limit as a bucket of tokens. Each request costs one token. The bucket refills at a steady rate. The bucket has a maximum capacity, which controls the maximum burst.

class TokenBucket:
    def __init__(self, refill_rate: float, capacity: float):
        self.refill_rate = refill_rate  # tokens per second
        self.capacity = capacity        # max tokens
        self.state = {}                 # client_id -> (tokens, last_refill)

    def allow(self, client_id: str, now: float) -> bool:
        tokens, last_refill = self.state.get(client_id, (self.capacity, now))
        elapsed = now - last_refill
        tokens = min(self.capacity, tokens + elapsed * self.refill_rate)
        if tokens < 1:
            self.state[client_id] = (tokens, now)
            return False
        self.state[client_id] = (tokens - 1, now)
        return True

What I like:

Burst is explicit. The capacity is a separate parameter from the rate. If you want "10 requests per second sustained, allow up to 50-request bursts", you set refill_rate=10, capacity=50. There is no algorithm-level confusion about what the limit means.
Memory is constant per client. Two floats: the current token count and the last refill time. For a million clients, sixteen million bytes. Fine.
The check is O(1). No log to scan. No window arithmetic.
It composes cleanly. A request can cost more than one token (a heavy operation might cost 5 tokens). Different clients can have different bucket sizes (premium tier gets a bigger bucket). The algorithm is the same.

A subtle thing about token bucket I did not get right my first time: the bucket starts full, not empty. A new client's first request should succeed even if their refill rate is one per second. If the bucket starts at zero and refills at one per second, the first request would have to wait one second to get a token. That is hostile UX. Initialize each new bucket at full capacity so first-time use feels instant. The bucket only goes below capacity if a client is actually pushing its limit.

The Stripe-and-friends pattern uses token bucket at the API gateway, with per-API-key buckets, with the bucket state stored in Redis for cross-instance sharing. The Lua script that does the atomic check-and-decrement against a Redis hash is a standard implementation pattern; I have seen variants of it in five or six different production codebases.

Storing token bucket state in Redis

The single-instance code above is fine until you have more than one instance of your API. Two instances will each have their own bucket, doubling the effective rate. The fix is to push state into Redis (or another shared store) and use an atomic operation.

-- KEYS[1]: bucket key
-- ARGV[1]: now (seconds, float)
-- ARGV[2]: refill_rate (tokens/sec)
-- ARGV[3]: capacity
-- ARGV[4]: cost (usually 1)
local data = redis.call('HMGET', KEYS[1], 'tokens', 'last')
local tokens = tonumber(data[1]) or tonumber(ARGV[3])
local last = tonumber(data[2]) or tonumber(ARGV[1])
local elapsed = tonumber(ARGV[1]) - last
tokens = math.min(tonumber(ARGV[3]), tokens + elapsed * tonumber(ARGV[2]))
local cost = tonumber(ARGV[4])
if tokens < cost then
    redis.call('HSET', KEYS[1], 'tokens', tokens, 'last', ARGV[1])
    redis.call('EXPIRE', KEYS[1], 3600)
    return 0
end
redis.call('HSET', KEYS[1], 'tokens', tokens - cost, 'last', ARGV[1])
redis.call('EXPIRE', KEYS[1], 3600)
return 1

The Lua script runs atomically inside Redis. Multiple application instances can call it concurrently and the bucket math stays consistent. The EXPIRE is a slow-leak prevention: if a client goes silent, Redis reclaims the bucket key after an hour.

A small operational note about this Lua script: it works because Redis runs scripts atomically with respect to other commands on the same instance. With Redis Cluster, the bucket key needs to live on a single node; if you naively rotate clients across cluster nodes, the atomicity breaks. The standard fix is a hash tag in the key ({client-prefix}:bucket:42) that pins all of one client's keys to the same node. If you are running standalone Redis, this is a non-issue. If you are running cluster, plan for it.

Two more practical notes. First, the EXPIRE value should be larger than capacity / refill_rate, the time it takes a fully-drained bucket to fill back up. If EXPIRE is shorter, you reclaim the key while the client is still being throttled, and the next request gets a fresh full bucket (effectively forgiving the rate limit). One hour is generous and almost always safe. Second, the script returns 0 or 1 for reject/accept; in production I extend this to also return remaining tokens and seconds-until-next-token, which the API gateway uses to fill in X-RateLimit-Remaining and Retry-After response headers. Clients with good HTTP libraries respect those headers and self-throttle, which is the cheapest form of rate limiting there is.

A note on Redis as the rate-limit backend: it is fast (sub-millisecond round trips), it survives a restart with reasonable durability (with appendonly enabled), and most languages have decent client libraries. Network latency is the main cost; for ultra-low-latency paths, you might want a per-instance bucket plus a slower correction loop, but for almost all APIs the Redis round trip is cheaper than the actual handler.

Leaky bucket: the cousin worth knowing about

Leaky bucket is sometimes presented as a separate algorithm. Mechanically, it is token bucket with a different metaphor: requests are water dripping into a bucket, the bucket leaks at a steady rate, the bucket has a capacity. If the bucket overflows, requests are rejected.

The math is the same as token bucket; the conceptual difference is whether you think of "adding requests" or "removing tokens". I treat them as one algorithm with two narrations. Some frameworks (notably the Linux kernel's traffic shaper) use leaky bucket as a queueing model where excess requests are delayed rather than rejected. That is a useful variant, but it is a queueing decision layered on top of the same bucket math.

What "100 requests per second" actually buys you, by algorithm

Same nominal limit, different burst behavior:

Limit: 100 requests / second
  Fixed window     boundary burst: up to 200 in 2 seconds
  Sliding window log   exact: never more than 100 in any rolling second
  Sliding window counter   approximate: usually within ~10% of exact
  Token bucket (capacity=100)   sustained 100/s, with up to 100-request bursts
  Token bucket (capacity=10)    sustained 100/s, with up to 10-request bursts

The token bucket rows expose the burst dimension explicitly. If you tell me you want 100 requests per second, I have to ask: what burst is acceptable? The other algorithms hide that question behind their implementation details.

Where rate limiting goes wrong, organizationally

Three failures I have watched:

The team picks a number based on what feels reasonable rather than measuring. "100 requests per second per API key" sounds fine until a real customer's normal usage pattern hits 200 and they get throttled. Always measure first; the limit should be calibrated against observed usage plus a margin, not against gut feel.
The team ships rate limiting without a way for clients to know they have been limited. The HTTP standard says: respond with 429 Too Many Requests and include a Retry-After header. If you do not include Retry-After, the client retries immediately and the same request gets rejected again. Always include it.
The team treats rate limiting as a security feature. It is not, primarily. It protects backends from accidental overload and prevents one customer's traffic from drowning another's. It does not prevent a determined attacker; an attacker just rotates through many keys or IPs. If you are trying to stop a DDoS, rate limiting is one layer of defense, not the only one.

What I default to and why

For most APIs I would default to token bucket with a capacity that is roughly twice the sustained rate (so legitimate small bursts pass and large bursts get throttled). I would store state in Redis if the API has more than one instance. I would respond 429 with Retry-After. I would publish the limits in the API documentation so clients can implement client-side throttling as well; surprising people with a hidden limit is a bad UX. I would skip fixed window entirely, use sliding window counter only at the CDN layer where memory pressure matters, and use sliding window log only when fairness over a small limit really matters and the storage cost is acceptable. The algorithm choice is not the hard part; the calibration of the limit value, the burst capacity, and the response shape is what makes the rate limiter a useful feature instead of a customer-experience fail.

Back to Articles