Back to Blog
Engineering

How to Add Rate Limiting to a FastAPI App: Token Bucket vs. Sliding Window

Rate limiting is the seatbelt you install before the crash. Without it, a single misbehaving client — a runaway retry loop, a scraper, someone brute-forcing your login, or just one customer's batch job — can saturate your database, exhaust a metered downstream API, run up a cloud bill, and take the service down for everyone else. With it, that same flood hits a wall, gets a polite “slow down,” and your other users never notice. This is a practical, vendor-neutral guide to adding rate limiting to a FastAPI app: the two algorithms worth knowing, where to enforce them, what to key on, and the mistake that quietly defeats the whole thing the moment you run a second process.

Why Rate-Limit at All

Four reasons, and most apps have all four:

The Two Algorithms That Matter

You'll see a half-dozen rate-limiting algorithms in the wild, but two cover almost every real need, and they make opposite trade-offs about bursts.

Token bucket is the burst-friendly one. Picture a bucket that holds up to capacity tokens and refills at a steady refill_rate per second. Each request spends one token; an empty bucket means rejection. Because a full bucket can be spent all at once, it permits short bursts up to the bucket size while still enforcing the average rate over time:

import time
from dataclasses import dataclass

@dataclass
class TokenBucket:
    capacity: int          # max tokens = the biggest burst you allow
    refill_rate: float     # tokens added per second = the steady-state rate
    tokens: float = 0.0
    updated: float = 0.0

    def allow(self) -> bool:
        now = time.monotonic()
        # Add tokens for the time elapsed, never exceeding capacity
        self.tokens = min(self.capacity, self.tokens + (now - self.updated) * self.refill_rate)
        self.updated = now
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

# 10-request burst, refilling at 5 requests/second sustained
bucket = TokenBucket(capacity=10, refill_rate=5)

Sliding window is the strict one. Instead of allowing bursts, it counts how many requests happened in the trailing window — say the last 60 seconds — and rejects anything over the cap. The simplest faithful version keeps a timestamp log and trims anything older than the window:

import time
from collections import deque

class SlidingWindow:
    def __init__(self, limit: int, window_seconds: float):
        self.limit = limit
        self.window = window_seconds
        self.hits: deque[float] = deque()

    def allow(self) -> bool:
        now = time.monotonic()
        # Drop timestamps that have aged out of the window
        while self.hits and self.hits[0] <= now - self.window:
            self.hits.popleft()
        if len(self.hits) < self.limit:
            self.hits.append(now)
            return True
        return False

Why not the naive fixed window — a counter that resets every 60 seconds on the clock? Because it allows a double-rate burst across the boundary: a client can fire the full limit at 0:59 and the full limit again at 1:00, landing 2× your cap in two seconds. The sliding window exists specifically to close that gap. (In production the timestamp-log version is usually swapped for a sliding-window counter in Redis — the same idea with O(1) memory — which we'll get to.)

Which one to pick

Use token bucket when natural, bursty traffic is legitimate and you care about the average rate — a dashboard that fires ten calls on load, then idles. Use sliding window when the cap is a hard promise per period — “100 requests per minute, full stop,” the shape most public API plans are sold in. When unsure, token bucket is the more forgiving default for first-party clients.

Where to Enforce It in FastAPI

FastAPI gives you two clean insertion points, and they're good at different jobs. A dependency is per-route and granular — perfect for putting a tight limit on exactly the login endpoint. Middleware is global and runs on every request — perfect for a baseline ceiling across the whole app. A dependency reads naturally:

from fastapi import Depends, FastAPI, HTTPException, Request

app = FastAPI()

def rate_limit(request: Request) -> None:
    key = client_key(request)              # per-IP or per-user — see below
    if not limiter_for(key).allow():
        raise HTTPException(
            status_code=429,
            detail="Too Many Requests",
            headers={"Retry-After": "1"},
        )

@app.post("/login", dependencies=[Depends(rate_limit)])
async def login(...):
    ...

Many apps run both: a permissive global middleware floor that catches obvious abuse everywhere, plus a strict dependency on the sensitive handful of endpoints. Keep your health-check and readiness routes exempt — nothing is more embarrassing than rate-limiting your own load balancer into marking the service unhealthy.

What to Key On: Per-IP vs Per-User

A limiter is only as good as the key it counts against. The three common choices:

Don't Trust X-Forwarded-For Blindly

Behind a proxy or load balancer, request.client.host is the proxy's IP, so you reach for the X-Forwarded-For header instead — but a client can send that header too. If you trust it raw, an attacker rotates a fake IP on every request and your per-IP limit counts to infinity. Only read the value your own trusted proxy appends (configure the trusted-host count, or use Starlette's ProxyHeadersMiddleware), never the client-supplied portion.

Going Distributed: Why In-Memory Breaks

The classes above store state in a process. That's fine in development with one worker — and quietly wrong in production. The moment you run multiple Uvicorn workers or more than one instance behind a load balancer, each process keeps its own bucket, so a “100/minute” limit silently becomes “100/minute per worker.” Four workers, four instances, and your real limit is 16× what you intended.

The fix is a shared store every instance talks to — almost always Redis — with the counter updated atomically so two simultaneous requests can't both read “99” and both proceed. A sliding-window counter is a couple of atomic operations:

# Atomic per-key counter with a TTL. INCR returns the new value;
# the first request in a window also sets the expiry.
async def allow(redis, key: str, limit: int, window_seconds: int) -> bool:
    count = await redis.incr(key)
    if count == 1:
        await redis.expire(key, window_seconds)
    return count <= limit

For a token bucket in Redis, fold the read-refill-write into a single Lua script (or a MULTI transaction) so the whole check is atomic — doing it in three round-trips reintroduces the race you came to Redis to avoid. Plenty of maintained libraries package exactly this; the value of writing it once yourself is understanding what they're doing when the limit misbehaves.

Return a Proper 429

When you reject, do it like a good API citizen. The status code is 429 Too Many Requests, and the response should tell the client when to come back so well-behaved callers can back off instead of hammering:

from fastapi import Response

def too_many_requests(retry_after: int, limit: int, remaining: int, reset: int) -> Response:
    return Response(
        status_code=429,
        content='{"detail": "Too Many Requests"}',
        media_type="application/json",
        headers={
            "Retry-After": str(retry_after),         # seconds until a request will succeed
            "RateLimit-Limit": str(limit),
            "RateLimit-Remaining": str(remaining),
            "RateLimit-Reset": str(reset),           # seconds until the window resets
        },
    )

Sending the RateLimit-* headers on successful responses too is a kindness: clients can read their remaining budget and self-throttle before they ever hit the wall. One more decision to make on purpose: do you fail open or closed if Redis itself is down? Failing open keeps the app serving (at the cost of the limit); failing closed protects the resource (at the cost of availability). For a security-critical endpoint, closed; for a general API, usually open — just decide it deliberately rather than discovering your default during an outage.

A rate-limiting setup you can trust

1. Pick the algorithm per need — token bucket for bursty, sliding window for hard caps. 2. Key on the authenticated user where you can, IP where you can't, and read the client IP only from a trusted proxy header. 3. Enforce via a dependency on sensitive routes, optionally a global middleware floor. 4. Move to a shared atomic store (Redis) the moment you run more than one process. 5. Return 429 with Retry-After and RateLimit-* headers. 6. Exempt health checks. 7. Decide fail-open vs fail-closed up front. 8. Test it with a burst that exceeds the limit and assert the 429s.

The Bottom Line

Rate limiting is a small amount of code standing between you and a large class of incidents — abuse, brute-force, runaway cost, and the noisy neighbor. Choose token bucket when bursts are legitimate and sliding window when the cap is a promise; count against the user when you know who they are and the IP when you don't; reach for Redis the instant a second process exists; and always tell the client how to back off. Wire it in early, and the worst traffic you ever receive becomes a 429 in a log instead of a page at 3 a.m.

Rate Limiting, Already Wired In

Configurable rate limiting — with sane defaults on the auth endpoints and Redis-backed counters for multi-instance deploys — ships pre-built in ShipKit, our production-ready FastAPI boilerplate. See inside ShipKit's architecture for how it fits together.

Explore ShipKit
BW

Brandon Wigley

Founder of Wigley Studios. Building developer tools since 2018.

Previous: Inside PostPilot All Articles