Community Article

Rate Limiting on the Edge with a Redis Token Bucket

Token bucket as a single Redis Lua script, evaluated atomically, deployed near the edge. The implementation, the failure modes, and what I would actually ship today.

Rate Limiting on the Edge with a Redis Token Bucket

Token bucket as a single Redis Lua script, evaluated atomically, deployed near the edge. The implementation, the failure modes, and what I would actually ship today.

rate-limiting

token-bucket

redis

api-design

reliability

By @antonmorgan

March 17, 2026

Updated May 20, 2026

463 views

4.4 (10)

A rate limiter is one of those components that looks like a hundred-line job until you actually ship one. The naive version (a hashmap of counters with a setInterval reset) lasts about as long as it takes for the first deployment to roll across two regions. Two hours later your counters are halved per region, your customers are getting throttled at 50% of the budget you promised, and someone is asking why staging behaves differently from production.

I have shipped rate limiters in three companies, each at a different scale, and the pattern that has held up best is a Redis-backed token bucket evaluated as a single Lua script, deployed as close to the edge as the rest of the request path allows. This article is the implementation, not the algorithm comparison. I am assuming you already know that token bucket allows bursts, sliding window log is more accurate but expensive, and fixed window has the boundary problem. The question I want to answer is: how do you actually build the Redis-backed version, and where do the production failure modes live?

Why one Lua script and not five Redis commands

The naive client-side implementation is something like:

// DON'T do this in production
const tokens = await redis.get(`rl:${key}`);
const lastRefill = await redis.get(`rl:${key}:lastRefill`);
// ... compute new tokens
await redis.set(`rl:${key}`, newTokens);
await redis.set(`rl:${key}:lastRefill`, now);

Five round trips. Worse, the read-then-write is non-atomic: two concurrent requests for the same key can both read tokens=10, both compute tokens=9, and both write tokens=9. You just leaked a token, and at scale you leak many.

The fix is to evaluate the entire bucket update as a single Redis command. Redis supports this via Lua scripting (EVAL / EVALSHA). The script runs server-side, atomically, in a single network round trip.

A token bucket Lua script that has worked for me:

-- KEYS[1] = bucket key (e.g. "rl:user:123")
-- ARGV[1] = capacity (max tokens)
-- ARGV[2] = refill_rate (tokens per second)
-- ARGV[3] = now (ms since epoch, from caller)
-- ARGV[4] = cost (tokens this request consumes)

local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])

local data = redis.call('HMGET', KEYS[1], 'tokens', 'last_refill')
local tokens = tonumber(data[1])
local last_refill = tonumber(data[2])

if tokens == nil then
    tokens = capacity
    last_refill = now
end

-- Refill based on elapsed time
local elapsed_ms = math.max(0, now - last_refill)
local refilled = elapsed_ms * refill_rate / 1000
tokens = math.min(capacity, tokens + refilled)
last_refill = now

local allowed = 0
if tokens >= cost then
    tokens = tokens - cost
    allowed = 1
end

redis.call('HMSET', KEYS[1], 'tokens', tokens, 'last_refill', last_refill)
redis.call('PEXPIRE', KEYS[1], 60000)  -- 1 minute idle TTL

return { allowed, tokens }

The script returns { allowed (0|1), tokens_remaining }. The caller decides what to do with allowed = 0 (return 429, queue the request, drop it silently, etc).

Three details worth calling out:

The caller passes now in. Doing the time read inside the script via redis.call('TIME') is more accurate but loses replication-safety. Lua scripts that read non-deterministic data cannot be replicated reliably across primary and replica. Passing now from the caller keeps the script deterministic.
TTL on the key. Without the PEXPIRE, you accumulate a hash entry per ever-rate-limited-key forever. The TTL ensures idle keys expire. Pick a TTL longer than the longest your bucket would take to refill from empty to full (in this example, 1 minute is enough for typical buckets; tune for yours).
The script is loaded once via SCRIPT LOAD, then called via EVALSHA. This avoids resending the script body on every call. A typical client library handles this caching for you (the redis.eval() wrapper falls back to EVAL if the cached SHA is missing).

What "the edge" actually means here

"Edge" is a marketing word with at least three engineering meanings. For rate limiting I find it useful to be specific.

Three layers a rate limiter can sit at
  1. CDN edge / edge worker  (Cloudflare Workers, Vercel Edge, Fastly Compute)
       - Closest to the client, lowest latency
       - State: Workers KV, Durable Objects, regional Redis
  2. API gateway              (Kong, Tyk, AWS API Gateway, Envoy filters)
       - In your infra, before your services
       - State: shared Redis cluster
  3. Application middleware   (in your Node/Go/Python service)
       - In-process or with shared Redis
       - State: process-local cache, Redis, or both

The latency win of running at layer 1 is real. A 429 returned 20ms from the user is much less expensive than one returned 200ms after you have already touched five services. The cost is that layer 1 sees less context: it might know the user id from a cookie, but it usually does not know the resource id, which limits how granular your rate limit can be.

My pragmatic default: a coarse rate limit at the edge (e.g., 1000 requests per minute per IP, 100 per second per user) plus a per-resource rate limit at the application layer (e.g., 10 mutations per minute per document). The edge layer absorbs the obvious abuse; the app layer enforces the business rules.

Single-region Redis vs multi-region: the consistency knob

For a single-region deployment, a single Redis cluster (with replicas for HA, but reads going to the primary so the bucket state is consistent) is fine. The Lua script runs on the primary, the data is correct, the latency is low.

Multi-region is harder. Three approaches I have used:

Replicate the bucket state across regions, with eventual consistency. Each region has its own Redis. Writes happen locally; replication is async (Redis CRDT, Redis Enterprise's Active-Active, or some custom tail-cut log). This is the lowest-latency option but the bucket can over-spend during replication lag. Acceptable for "soft" rate limits where the goal is fairness, not a precise cap.

Centralize the bucket in one region, accept the cross-region round trip. Each edge POPs to a single regional Redis. Latency is dominated by the round trip to that region. For a US-East user with a US-East Redis, this is 1ms; for an APAC user, it is 150ms. Accurate, but the latency hit is real.

Sticky-region rate limits. Hash the rate-limit key (user id, IP) to a region, and route those keys to that region's Redis. Each region holds a slice of the keyspace. Cross-region traffic only happens when a user sends to a region that does not own their key, which is rare if you route by user. This is what I have ended up with twice; it gets you the latency of local Redis with the consistency of single-region.

Sticky-region rate limit
  hash(user_id) % regions == 0  -> route rate limit eval to us-east-1
  hash(user_id) % regions == 1  -> route to eu-west-1
  hash(user_id) % regions == 2  -> route to ap-northeast-1

The trade-off is that a user gets pinned to a region for rate-limit purposes, which can hurt latency for users far from their pinned region. In practice the rate-limit eval is a small fraction of total request latency, so the impact is bounded.

What to return when you reject

A rejected request should return 429 Too Many Requests with informative headers. The standardized headers (RFC 6585 for the status, draft RFCs for the headers) are:

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 30

{
    "error": "rate_limit_exceeded",
    "message": "Too many requests. Retry in 30 seconds.",
    "retryAfter": 30
}

The Retry-After header tells well-behaved clients (and SDKs) how long to wait before retrying. Many HTTP clients honor it automatically. The RateLimit-* headers (the IETF draft is gradually being adopted) tell clients the current state without forcing them to wait for a 429.

I include the limit headers on every response, not just 429s. Clients that watch them can self-throttle; clients that ignore them get the 429 when they cross the line. Both behaviors are valid; both are supported.

The precision-vs-latency knob

Token bucket via Redis Lua gives you accurate-enough rate limiting at low millisecond latency per check. If you need higher precision (say, you are rate-limiting a paid API where every excess request costs you money), there are tighter algorithms (sliding window log, with one Redis sorted set per key). The cost is memory: a sliding window log stores the timestamp of every request in the window.

For most use cases, token bucket is the right answer. The bucket allows brief bursts (which most clients send), the long-run rate is bounded by the refill rate, and the memory per key is constant (two integers). I have only reached for sliding window log when the rate limit was a hard SLA promise to a paying customer, and even then mostly to verify token bucket was inside the SLA.

Failure modes I have hit

A few real production stories.

Redis unavailable. Your rate limiter's data layer is down. Three options: fail open (allow the request), fail closed (reject all requests), or use a local fallback (in-process counter that is wildly inaccurate but better than nothing). I have used fail open by default with a bypass-counter to detect the abuse case if it persists. Fail closed is the wrong default: a Redis blip becomes a customer-facing outage.

Clock skew across edge nodes. If you pass now from the edge node, and edge nodes have skewed clocks (especially across regions), the bucket state can briefly go to weird values. I have seen tokens go negative briefly because two adjacent calls used now values 200ms apart in different directions. The fix is to clamp tokens to [0, capacity] after computation in the script.

Burst at deployment. A new deployment empties in-process caches; every request misses the local cache and hits Redis. If your rate limiter has a per-instance fallback, this is fine. If it does not, Redis sees a sudden 50x increase in evaluations. Provision Redis for 2-3x your steady-state QPS, not 1x.

Hot keys. A single user (or a single API endpoint) accounts for 30% of all rate-limit evaluations. Redis primary serves all writes for that key, and you cannot horizontally shard within a key. The fix is to shard by request id within the key (cluster-mode Redis with multiple slots per logical key) or to use a higher-throughput algorithm specifically for that key. I have used the second approach: identify the top 5 hottest keys, give them their own bucket implementation backed by an in-memory CRDT.

What I would build today

Rate limiter I would build today
  Layer 1 (edge):
    Cloudflare Workers + Workers KV / Durable Objects
    Coarse limits: per-IP, per-user, per-API-key
    Logic: token bucket via Durable Object (transactional, single-instance per key)
  Layer 2 (application):
    Node/Go service with shared Redis (cluster mode in production)
    Fine limits: per-resource, per-action
    Logic: token bucket via Lua script (the one above)
    Fail mode: fail open with alerts after 30s of Redis unavailability
  Observability:
    Prometheus counter for accepts, rejects, by limit name
    Trace span tagged with rate_limit_check.allowed and tokens_remaining
    Slack alert if reject rate > 1% sustained for 5 minutes

Pretty boring, in the good way. Each layer does one thing. The edge handles the high-volume coarse stuff. The app layer handles the business-logic-aware stuff. Redis is shared and replicated. The Lua script is the only place the bucket math lives.

The detail that took me longest to internalize

The detail I want to emphasize, because I think it is the one that catches most teams: a rate limiter has to be cheaper than the thing it is protecting. If your rate limit check costs 50ms, and the API it protects costs 30ms, you have a problem. The check should be 1-2ms in the worst case. That is why Redis Lua works (single round trip, server-side execution) and why "check the database for usage stats" does not (multi-table joins, contention with real workload).

Rate limiting in 2026 is mostly a solved problem if you accept the constraints: token bucket as the algorithm, Redis Lua as the implementation, regional sharding as the multi-region story, IETF headers as the wire format. Where teams still trip up is on the edges of those constraints (multi-region, hot keys, fallback when Redis is down) and on the integration (which layer of the stack runs the limiter, and what context is available there). The honest implementation answer for most teams is: copy the Lua script in this article into your codebase, deploy it behind your existing Redis, and spend the saved engineering time on the harder problems that are unique to your product.

Back to Articles