Rate limiting is the seatbelt you install before the crash. Without it, a single misbehaving client — a runaway retry loop, a scraper, someone brute-forcing your login, or just one customer's batch job — can saturate your database, exhaust a metered downstream API, run up a cloud bill, and take the service down for everyone else. With it, that same flood hits a wall, gets a polite “slow down,” and your other users never notice. This is a practical, vendor-neutral guide to adding rate limiting to a FastAPI app: the two algorithms worth knowing, where to enforce them, what to key on, and the mistake that quietly defeats the whole thing the moment you run a second process.
Why Rate-Limit at All
Four reasons, and most apps have all four:
- Protect resources. Your database, your worker pool, and any downstream service you call all have a ceiling. A limit keeps one caller from consuming the headroom everyone shares.
- Fairness. On a multi-tenant or public API, one heavy user shouldn't be able to starve the rest. Limiting is how “noisy neighbor” stops being your on-call problem.
- Security. The login, password-reset, and OTP endpoints are brute-force targets. A tight limit on those turns credential-stuffing from a threat into a non-event.
- Cost control. When a request costs you money — an LLM call, an SMS, a third-party lookup — a limit is the difference between a bad day and a bad invoice.
The Two Algorithms That Matter
You'll see a half-dozen rate-limiting algorithms in the wild, but two cover almost every real need, and they make opposite trade-offs about bursts.
Token bucket is the burst-friendly one. Picture a bucket that holds up to capacity tokens and refills at a steady refill_rate per second. Each request spends one token; an empty bucket means rejection. Because a full bucket can be spent all at once, it permits short bursts up to the bucket size while still enforcing the average rate over time:
import time
from dataclasses import dataclass
@dataclass
class TokenBucket:
capacity: int # max tokens = the biggest burst you allow
refill_rate: float # tokens added per second = the steady-state rate
tokens: float = 0.0
updated: float = 0.0
def allow(self) -> bool:
now = time.monotonic()
# Add tokens for the time elapsed, never exceeding capacity
self.tokens = min(self.capacity, self.tokens + (now - self.updated) * self.refill_rate)
self.updated = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
# 10-request burst, refilling at 5 requests/second sustained
bucket = TokenBucket(capacity=10, refill_rate=5)
Sliding window is the strict one. Instead of allowing bursts, it counts how many requests happened in the trailing window — say the last 60 seconds — and rejects anything over the cap. The simplest faithful version keeps a timestamp log and trims anything older than the window:
import time
from collections import deque
class SlidingWindow:
def __init__(self, limit: int, window_seconds: float):
self.limit = limit
self.window = window_seconds
self.hits: deque[float] = deque()
def allow(self) -> bool:
now = time.monotonic()
# Drop timestamps that have aged out of the window
while self.hits and self.hits[0] <= now - self.window:
self.hits.popleft()
if len(self.hits) < self.limit:
self.hits.append(now)
return True
return False
Why not the naive fixed window — a counter that resets every 60 seconds on the clock? Because it allows a double-rate burst across the boundary: a client can fire the full limit at 0:59 and the full limit again at 1:00, landing 2× your cap in two seconds. The sliding window exists specifically to close that gap. (In production the timestamp-log version is usually swapped for a sliding-window counter in Redis — the same idea with O(1) memory — which we'll get to.)
Which one to pick
Use token bucket when natural, bursty traffic is legitimate and you care about the average rate — a dashboard that fires ten calls on load, then idles. Use sliding window when the cap is a hard promise per period — “100 requests per minute, full stop,” the shape most public API plans are sold in. When unsure, token bucket is the more forgiving default for first-party clients.
Where to Enforce It in FastAPI
FastAPI gives you two clean insertion points, and they're good at different jobs. A dependency is per-route and granular — perfect for putting a tight limit on exactly the login endpoint. Middleware is global and runs on every request — perfect for a baseline ceiling across the whole app. A dependency reads naturally:
from fastapi import Depends, FastAPI, HTTPException, Request
app = FastAPI()
def rate_limit(request: Request) -> None:
key = client_key(request) # per-IP or per-user — see below
if not limiter_for(key).allow():
raise HTTPException(
status_code=429,
detail="Too Many Requests",
headers={"Retry-After": "1"},
)
@app.post("/login", dependencies=[Depends(rate_limit)])
async def login(...):
...
Many apps run both: a permissive global middleware floor that catches obvious abuse everywhere, plus a strict dependency on the sensitive handful of endpoints. Keep your health-check and readiness routes exempt — nothing is more embarrassing than rate-limiting your own load balancer into marking the service unhealthy.
What to Key On: Per-IP vs Per-User
A limiter is only as good as the key it counts against. The three common choices:
- Per-user / per-API-key. The best key when the request is authenticated — it's fair (a user can't get more capacity by changing networks) and precise. Key off the user id or API key from the validated token.
- Per-IP. Your only option for anonymous traffic like login and signup. Workable, but imperfect: an entire office behind one NAT shares an IP, and mobile carriers rotate them. Use it for unauthenticated endpoints, prefer the user key everywhere else.
- Per-endpoint. Layered on top of either — the login route deserves a far tighter limit than a read-only list endpoint.
Don't Trust X-Forwarded-For Blindly
Behind a proxy or load balancer, request.client.host is the proxy's IP, so you reach for the X-Forwarded-For header instead — but a client can send that header too. If you trust it raw, an attacker rotates a fake IP on every request and your per-IP limit counts to infinity. Only read the value your own trusted proxy appends (configure the trusted-host count, or use Starlette's ProxyHeadersMiddleware), never the client-supplied portion.
Going Distributed: Why In-Memory Breaks
The classes above store state in a process. That's fine in development with one worker — and quietly wrong in production. The moment you run multiple Uvicorn workers or more than one instance behind a load balancer, each process keeps its own bucket, so a “100/minute” limit silently becomes “100/minute per worker.” Four workers, four instances, and your real limit is 16× what you intended.
The fix is a shared store every instance talks to — almost always Redis — with the counter updated atomically so two simultaneous requests can't both read “99” and both proceed. A sliding-window counter is a couple of atomic operations:
# Atomic per-key counter with a TTL. INCR returns the new value;
# the first request in a window also sets the expiry.
async def allow(redis, key: str, limit: int, window_seconds: int) -> bool:
count = await redis.incr(key)
if count == 1:
await redis.expire(key, window_seconds)
return count <= limit
For a token bucket in Redis, fold the read-refill-write into a single Lua script (or a MULTI transaction) so the whole check is atomic — doing it in three round-trips reintroduces the race you came to Redis to avoid. Plenty of maintained libraries package exactly this; the value of writing it once yourself is understanding what they're doing when the limit misbehaves.
Return a Proper 429
When you reject, do it like a good API citizen. The status code is 429 Too Many Requests, and the response should tell the client when to come back so well-behaved callers can back off instead of hammering:
from fastapi import Response
def too_many_requests(retry_after: int, limit: int, remaining: int, reset: int) -> Response:
return Response(
status_code=429,
content='{"detail": "Too Many Requests"}',
media_type="application/json",
headers={
"Retry-After": str(retry_after), # seconds until a request will succeed
"RateLimit-Limit": str(limit),
"RateLimit-Remaining": str(remaining),
"RateLimit-Reset": str(reset), # seconds until the window resets
},
)
Sending the RateLimit-* headers on successful responses too is a kindness: clients can read their remaining budget and self-throttle before they ever hit the wall. One more decision to make on purpose: do you fail open or closed if Redis itself is down? Failing open keeps the app serving (at the cost of the limit); failing closed protects the resource (at the cost of availability). For a security-critical endpoint, closed; for a general API, usually open — just decide it deliberately rather than discovering your default during an outage.
A rate-limiting setup you can trust
1. Pick the algorithm per need — token bucket for bursty, sliding window for hard caps. 2. Key on the authenticated user where you can, IP where you can't, and read the client IP only from a trusted proxy header. 3. Enforce via a dependency on sensitive routes, optionally a global middleware floor. 4. Move to a shared atomic store (Redis) the moment you run more than one process. 5. Return 429 with Retry-After and RateLimit-* headers. 6. Exempt health checks. 7. Decide fail-open vs fail-closed up front. 8. Test it with a burst that exceeds the limit and assert the 429s.
The Bottom Line
Rate limiting is a small amount of code standing between you and a large class of incidents — abuse, brute-force, runaway cost, and the noisy neighbor. Choose token bucket when bursts are legitimate and sliding window when the cap is a promise; count against the user when you know who they are and the IP when you don't; reach for Redis the instant a second process exists; and always tell the client how to back off. Wire it in early, and the worst traffic you ever receive becomes a 429 in a log instead of a page at 3 a.m.
Rate Limiting, Already Wired In
Configurable rate limiting — with sane defaults on the auth endpoints and Redis-backed counters for multi-instance deploys — ships pre-built in ShipKit, our production-ready FastAPI boilerplate. See inside ShipKit's architecture for how it fits together.
Explore ShipKit