Backend Engineering

Token Bucket Rate Limiting for APIs: Design, Math, and Production Patterns

A practical guide to token bucket rate limiting for APIs: concepts, math, parameters, production implementations, and tuning tips.

ASOasis

Apr 24, 2026

7 min read

Token Bucket Rate Limiting for APIs: Design, Math, and Production Patterns

Image used for representation purposes only.

Why APIs Need Rate Limiting

Modern APIs must remain fast, fair, and resilient under unpredictable traffic. Rate limiting is the control valve that protects upstream services, budgets capacity across tenants, and absorbs bursts without cascading failures. Among the common approaches—fixed window, sliding window, leaky bucket—the token bucket algorithm stands out for allowing short bursts while still enforcing a long‑term average rate.

This article explains how token bucket rate limiting works, the math behind it, practical parameter choices, production‑ready implementations (in‑memory and Redis), and how to communicate limits to clients.

The Token Bucket, In One Picture

Imagine a bucket that holds tokens. The bucket has a maximum capacity B (tokens).
Tokens drip into the bucket at a steady rate R (tokens per second).
Each request costs C tokens (typically 1). If the bucket has ≥ C tokens, the request is allowed and tokens are removed. If not, the request is rejected or queued.
When the bucket is full, excess tokens overflow and are lost. This caps burst size to B.

Result: Clients may burst up to B requests at once, but over time they are limited to R requests per second on average.

Token Bucket vs. Other Algorithms

Fixed window: Simple but suffers edge effects at window boundaries (bursts at window edges). Token bucket provides smoother enforcement and controlled bursts.
Sliding window: More accurate than fixed windows but typically heavier to compute (needs request logs or approximations). Token bucket is O(1) and state‑light.
Leaky bucket (queue): Smooths output to a constant rate but does not allow bursts; token bucket does.

If you need: predictable long‑term average plus short bursts, use token bucket. If you need: strict smoothing with no bursts, prefer leaky bucket. If you need precise counts per rolling window, consider sliding window/GCRA.

The Math You Actually Need

Parameters:

B: bucket capacity (max tokens). Controls worst‑case burst you will allow.
R: refill rate (tokens/sec). Controls sustained throughput.
C: cost per request (tokens). Often 1, but variable costs let you price expensive endpoints higher.

Refill:

Let t_now be current time, t_last the last time the bucket was updated.
New tokens = R × (t_now − t_last).
tokens = min(B, tokens + New tokens).

Decision:

If tokens ≥ C: allow, tokens -= C, t_last = t_now.
Else: reject (HTTP 429) or optionally delay until enough tokens accrue.

Tip: Use monotonic clocks to avoid time jumps. Store tokens as floating points or fixed‑point integers to avoid drift.

Choosing Good Parameters

Start with your SLOs: “Sustain 100 rps, allow up to 1,000 burst.” Then set R = 100, B = 1,000.
Convert human‑readable quotas: “6,000/min with 600 burst” → R = 100/sec, B = 600.
For high fan‑out or expensive endpoints, set C > 1 or run multiple buckets per key to isolate risk.
Keep B reasonably sized; huge bursts can still overload downstream, even if average is capped.

Single‑Process Implementation (Lazy Refill)

Lazy refill updates tokens only when a request arrives—no background timer needed.

import time

class TokenBucket:
    def __init__(self, rate_per_sec: float, capacity: float, cost: float = 1.0):
        self.rate = rate_per_sec
        self.capacity = capacity
        self.cost = cost
        self.tokens = capacity  # start full to allow initial burst
        self.last = time.monotonic()

    def allow(self, now: float | None = None) -> bool:
        if now is None:
            now = time.monotonic()
        # Refill
        elapsed = max(0.0, now - self.last)
        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
        self.last = now
        if self.tokens >= self.cost:
            self.tokens -= self.cost
            return True
        return False

# Example usage
bucket = TokenBucket(rate_per_sec=100, capacity=600)
if bucket.allow():
    # process request
    pass
else:
    # return 429 with Retry-After
    pass

Notes:

Use time.monotonic() to avoid wall‑clock jumps.
For variable‑cost endpoints, pass cost as a parameter to allow().

Distributed Rate Limiting With Redis (Atomic Lua)

In multi‑instance deployments, you need a shared store for per‑key limits (user ID, API key, IP). Redis with a Lua script provides atomic read‑modify‑write.

Core state per key:

tokens (float or scaled integer)
last_refill_ts (seconds)

Atomic script idea:

Read tokens and last_refill_ts.
Compute refill based on now.
If tokens ≥ cost: decrement and allow; else deny.
Write back tokens and last_refill_ts; set TTL to clean up idle keys.

-- KEYS[1] = key
-- ARGV[1] = now (seconds as float)
-- ARGV[2] = rate_per_sec
-- ARGV[3] = capacity
-- ARGV[4] = cost
-- ARGV[5] = ttl_seconds

local key = KEYS[1]
local now = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local capacity = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local ttl = tonumber(ARGV[5])

local data = redis.call('HMGET', key, 'tokens', 'last')
local tokens = tonumber(data[1])
local last = tonumber(data[2])

if tokens == nil then
  tokens = capacity
  last = now
else
  local elapsed = math.max(0, now - last)
  tokens = math.min(capacity, tokens + elapsed * rate)
  last = now
end

local allowed = 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
end

redis.call('HMSET', key, 'tokens', tokens, 'last', last)
redis.call('EXPIRE', key, ttl)

return {allowed, tokens}

On the application side, call EVALSHA with the user’s key. If allowed == 1, proceed; else respond 429.

Tips:

Scale floats to integers (e.g., multiply by 1000) if you need deterministic precision.
Use a short TTL (e.g., a few bucket seconds) so idle keys are evicted naturally.
Co‑locate the limiter with the gateway to reduce latency; consider connection pooling.

Handling Bursts, Queues, and Backpressure

Pure token bucket rejects when empty; users retry later. That’s ideal for public APIs where queuing untrusted traffic is risky.
Internal services may optionally queue briefly: compute the delay until enough tokens accrue: delay = max(0, (C − tokens)/R). Use bounded queues and shed load when full.
Protect downstreams: cap B to what the slowest dependency can absorb.

Keys, Scopes, and Fairness

Key design: per API key, per user, per tenant, per IP, or combinations (e.g., tenant:user). Choose the dimension that matches your fairness policy.
Hierarchical quotas: enforce per‑user and per‑tenant buckets; both must allow.
Priority traffic: maintain higher R/B for premium plans or apply separate buckets per priority class.
Expensive endpoints: higher C to discourage hot paths from starving others.

Communicating Limits to Clients

Use clear, consistent headers and standard status codes.

Status: 429 Too Many Requests.
Retry-After: seconds until next attempt, or an HTTP date.
X-RateLimit-Limit: max allowed in the window or the configured capacity/rate descriptor.
X-RateLimit-Remaining: tokens remaining (rounded down).
X-RateLimit-Reset: time when tokens are expected to reach full or when remaining becomes > 0.

Example error body:

{
  "error": "rate_limited",
  "message": "Quota exceeded. Retry later.",
  "retry_after": 2.4
}

Monitoring and Tuning

Track these metrics per key group and globally:

Allowed/denied counts; 429 rate
Tokens_remaining distribution (p50/p95)
Estimated wait time (if queuing)
Latency and error budgets of dependent services
Hot keys (keys consuming disproportionate tokens)

Tuning playbook:

If 429s spike: increase R or B carefully, or lower C on non‑critical endpoints.
If downstreams overload: lower B first (limit bursts), then R if needed.
For noisy neighbors: move to hierarchical quotas or per‑endpoint costs.

Dealing With Distributed Systems Realities

Clock skew: use a single time source (e.g., Redis time or NTP‑disciplined clocks). Slight skew is usually tolerable due to min(B, …) clamp.
Partial failures: fail‑closed or fail‑open? Public APIs typically fail‑closed to protect the system. Internal services may fail‑open for resiliency with circuit breakers.
Sharding: hash keys across Redis clusters; keep all state for a specific key on one shard to avoid cross‑node races.
Caching: for very hot keys, local shadow buckets can reduce round‑trips; reconcile with the shared store to avoid drift (use small local capacities).

Testing Your Limiter

Unit tests: refill math, boundary conditions, float precision, cost > 1.
Property tests: sustained rate never exceeds R over long horizons; bursts never exceed B.
Load tests: step, spike, and soak tests. Verify downstream latency and error budgets.
Chaos tests: inject clock skew, Redis latency, packet loss.

Common Pitfalls

Using wall time instead of monotonic time → negative elapsed after NTP adjustments.
Letting tokens go negative → unbounded debt and long recovery.
Huge bucket sizes that pass burst stress to downstream databases.
Missing TTLs in Redis → unbounded key growth.
Returning vague 429s without Retry-After → inefficient client backoff.

Quick Design Checklist

Define fairness keys and quotas per plan.
Choose R, B, and C per endpoint or bundle.
Implement atomic state updates (Lua, transactions) and set TTLs.
Add clear 429 responses and rate‑limit headers.
Instrument metrics; build dashboards and alerts.
Load test with real traffic patterns; tune B before R.

Conclusion

The token bucket algorithm offers an elegant balance: strict average‑rate enforcement with controlled bursts. With careful parameter selection, atomic distributed storage, clear client signaling, and robust monitoring, it scales from a single process to a global API gateway while protecting performance and fairness.

API Gateway Design Patterns: A Practical, High‑Performance Guide

A practical guide to API gateway design patterns: when to use them, trade-offs, and reference configs for secure, scalable microservices and edge APIs.

ASOasis

Mar 12, 2026

Building Social Media Insights with AI Sentiment Analysis APIs: Architecture, Metrics, and Code

How to integrate AI sentiment analysis APIs into social media stacks—architecture, metrics, sample code, and best practices for reliable, real-time insights.

ASOasis

Apr 23, 2026

GraphQL Error Handling Best Practices: Clear, Secure, and Resilient APIs

A practical guide to GraphQL error handling: schema design, HTTP codes, partial data, masking, client patterns, observability, and examples.

ASOasis

Apr 23, 2026

Token Bucket Rate Limiting for APIs: Design, Math, and Production Patterns

Why APIs Need Rate Limiting

The Token Bucket, In One Picture

Token Bucket vs. Other Algorithms

The Math You Actually Need

Choosing Good Parameters

Single‑Process Implementation (Lazy Refill)

Distributed Rate Limiting With Redis (Atomic Lua)

Handling Bursts, Queues, and Backpressure

Keys, Scopes, and Fairness

Communicating Limits to Clients

Monitoring and Tuning

Dealing With Distributed Systems Realities

Testing Your Limiter

Common Pitfalls

Quick Design Checklist

Conclusion

Tags

Related Posts

API Gateway Design Patterns: A Practical, High‑Performance Guide

Building Social Media Insights with AI Sentiment Analysis APIs: Architecture, Metrics, and Code

GraphQL Error Handling Best Practices: Clear, Secure, and Resilient APIs

Services

Products

Company

Legal