Rate Limiter — System Design from First Principles

Step I·Clarify Scope

A limiter is a tax on every request.

Most systems can be slow on the cold path. A rate limiter has no cold path — every request that enters the API touches it first. Scope is therefore an exercise in keeping that touch impossibly cheap.

Q · Functional Boundary

What are we actually building?

Given (user, endpoint) and the current moment, return allow or deny — and update the count. Per-user, per-IP, per-endpoint limits. Not billing tiers, not abuse detection, not adaptive throttling. Those are products built on a rate limiter — not the rate limiter itself.

Q · Read / Write Ratio

Which direction does the traffic flow?

One-to-one. Every request reads the current count and writes the new one. The limiter is the traffic. Whatever latency you spend on the limiter is added to every API call behind it — a 5 ms limiter on a 30 ms endpoint is a 17 % tax you charged yourself.

Q · Consistency Need

Must the count be exact?

Approximate is fine for nearly everything: ads, search, public APIs. If a user briefly slips two requests over the limit during a race, no one is harmed. The exceptions — login attempts, payment retries, fraud signals — are security boundaries, and for those we trade latency for precision deliberately.

Why this matters

The first instinct is to make the limiter fair — global counts, perfect precision, instant updates. That instinct is wrong. Fairness costs coordination, coordination costs milliseconds, and milliseconds on every request add up to the limiter being slower than what it protects. The senior move is to accept small overages so the limiter can keep up.

The senior move

Name what you are not limiting. "This is per-user, per-endpoint, per-region. Cross-region global quota is a separate design with a separate budget." Saying this out loud at the start removes 80 % of the ambiguity that bogs the rest of the interview down.

Step II·Establish Scale

A budget measured in microseconds.

A rate limiter that takes 10 milliseconds is a worse-than-useless rate limiter. The latency goal is not a comfort target — it is the spec. Run the arithmetic and the constraints fall out by themselves.

Workload

1 M requests / sec in front of a busy API · the limiter sees all of it

Decision rate

1 M allow/deny calls / sec sustained · 3 M peak

Counter writes

1 M increments / sec · same as the decision rate

Storage

≈ 10 GB for users × rules × active windows · all in RAM

Latency target

< 1 ms p99 (allow/deny round-trip)

// the arithmetic that shapes the design

Per-request budget
→ ~ 1 ms p99 across the whole hop

SSD seek alone
→ 100 µs — already 10 % of the budget. Cannot land on disk.

Cross-AZ round-trip
→ ~ 1 ms — eats the entire budget if every request makes one.

Per-Redis-node throughput
→ ~ 100 K decisions/sec · 1 M QPS → 10-node shard.

The number to remember

One millisecond. The limiter must be faster than anything it protects. Every design choice from here is a fight against that single number.

III

Step III·Identify the Walls

Which physical wall are we hitting first?

A rate limiter does almost nothing per request — read a counter, compare, increment, return. The question is not what to compute. The question is how many of those nothings the fleet can do per second, and at what tail latency.

Compute
& I/O

Storage
Hierarchy

Coordination
Cost

Time &
Ordering

Compute & I/O dominates. One million decisions per second, each under a millisecond, each touching a shared counter. The CPU work is trivial — a compare and an increment. The hard part is sustaining that rate without queuing, without contention, without any tail-latency event that punches through the budget. This is a throughput-and-tail-latency problem, not a computation problem.

Storage Hierarchy sits second, not first, because the dataset itself is small — ten gigabytes of counters fits comfortably in one machine's RAM. Storage matters as a floor: the counters must live in memory; SSD is not an option. But once that floor is met, storage stops pressing.

Coordination Cost shows up only when a limit must hold globally across regions — and most do not. Time & Ordering matters only at the algorithm level (sliding windows need timestamps), and even then a sloppy clock costs a few extra requests, not correctness.

Step IV·Map Patterns

Each axis unlocks its own patterns.

Three patterns carry the system. The first is the algorithm — what counts as a request. The second is the substrate — where the count lives. The third is the trick — how the limiter avoids becoming the thing it protects against.

Pattern 01 Wall II · Compute & I/O

Token bucket

Picture an arcade with a token machine on the wall. The machine drops one token into your cup every few seconds, up to a small maximum it can hold. Every game costs one token. A kid who has been waiting can play several games back-to-back — burning the saved-up tokens — then has to slow down to the drip rate. Burst when you can, average when you must. The machine never has to ask anyone's permission.

For each (user, endpoint) the limiter keeps two numbers: the token count and the last-refill timestamp. On each request, the system advances the timestamp, adds the tokens that would have dripped in since the last visit (capped at the bucket size), and — if at least one token remains — spends it and allows the request. Two numbers. One read. One write. No history to scan, no log to truncate.

Wall II rewards constant-time work. The token bucket is O(1) in both compute and memory — the decision touches a fixed amount of state regardless of how many requests have arrived. The cousin algorithm, the leaky bucket, smooths bursts to zero instead of allowing them; it is stricter, but token buckets win nearly every real interview because clients do burst, and refusing to admit that costs goodwill.

The fixed-window counter is the seductive shortcut: one counter per minute, increment-and-compare. It looks simpler — and it is — but it permits a notorious boundary burst. A client can spend its full quota in the last second of one window and another full quota in the first second of the next: 2× the rate, for one fleeting second. Token bucket has no such edge.

Pattern 02 Wall I · Storage Hierarchy

Atomic counters,
in RAM

A stadium turnstile. One mechanism, one click, one count. Two fans cannot squeeze through at the same instant — the turnstile is physically single-file, and that is exactly why the count is right. You do not need locks when the geometry of the thing already forbids overlap.

The bucket state lives in Redis — an in-memory key/value store that runs each command to completion before starting the next. Because Redis processes one command at a time, primitives like INCR (atomic increment) are race-free without any locking. To keep multi-step token-bucket logic equally race-free, the whole sequence is bundled into a Lua script — a small program Redis runs in one shot. The script reads both numbers, computes the refill, decides, writes back. One round-trip, no race. Each Redis node sustains roughly 100 K decisions per second; we shard by user-ID across ten nodes to absorb the million-QPS workload.

Wall I sets the price list. RAM at a hundred nanoseconds is breathable; SSD at a hundred microseconds is already a tenth of our budget — and that is the access alone, before contention. With ten gigabytes of counter state and a one-millisecond budget, the only ledge to stand on is RAM. The atomicity is free, because Redis is single-threaded; the speed is free, because the counters never leave memory.

Pattern 03 Wall III · Coordination Cost

Local counters,
then reconcile

Border checkpoints with a shared daily list. Each guard checks travelers on the spot, against the copy of the list they hold. Every few minutes, the booths phone in their counts to headquarters, which updates the master and broadcasts the new totals back. Most decisions are made locally and instantly; the central count is allowed to lag a little. No traveler waits for a long-distance call.

Each API node keeps an in-process counter for the users it has recently seen. Allow/deny is answered from local RAM in microseconds, with no network hop at all. Every 100 ms — or every N requests — the node batches its deltas to the central Redis cluster, pulls back the current global count, and adjusts. If Redis is unreachable, the node keeps deciding from its last-known state: fail-open. The cost is approximation — during a sync gap, a hot user can briefly exceed the limit by roughly (nodes × local budget). For a 100-RPS user across 50 nodes, that's a handful of extra requests per second. For ad APIs, this is invisible. For payment APIs, this is unacceptable, and the trade flips.

Wall III is about what it costs N machines to agree. Forcing every request through a single central counter turns the limiter into a coordination bottleneck — every decision a network round-trip, every shard a hotspot, every Redis pause a cascade. By pushing the decision out to the request-handling node itself, we convert one million coordinated operations per second into one million local operations plus a small reconciliation stream. The limiter never becomes the slowest hop in the request.

Step V·State the Tradeoffs

Every choice has a cost.

The patterns above are not free. Each one trades one property for another. Saying the trade out loud is the difference between a design and a decision.

Latency vs. Global precision

An exact global limit — "no user exceeds 100 RPS anywhere on Earth" — requires every region to consult a single source of truth on every request. Cross-region round-trips are 50–100 ms; that is fifty to a hundred times our entire latency budget. The pragmatic answer is per-region or per-node approximation, with a small reconciliation stream. Choose precision only when the limit is a security boundary — login throttling, payment retries, fraud counters — and the cost of being wrong outweighs the cost of being slow.

Algorithm precision vs. Memory cost

The sliding window log stores every request's timestamp and re-counts on each call — the most accurate algorithm there is, but memory grows with traffic. Fixed windows use one counter per minute — cheap, but they suffer the 2× boundary burst. Token buckets hold two numbers per user — no log, no boundary edge. For nearly every real workload, token bucket is the answer; sliding-window log is reserved for security-grade limits where you must be able to point at exact timestamps. Memory is the price of being able to do that.

Fail-open vs. Fail-closed

If the limiter itself crashes — Redis is unreachable, a shard is down — the system must still answer something. Fail-open lets every request through: the API absorbs the surge, possibly cascading into overload, but stays available. Fail-closed blocks everything: the API takes an outage, but no abuse leaks. Public APIs almost universally fail-open — being briefly un-rate-limited beats being down. Payment and authentication APIs fail-closed — the cost of one abuse window exceeds the cost of a short outage. The choice is not a default; it is a deliberate policy stated upfront.

System № 02 · Thesis Landing

And so Rate Limiter is,
at the bottom, a particular treaty
negotiated against
Compute & I/O.

Given our primary constraint is Compute & I/O — the limiter sits on the hot path of every request, one million per second, under a one-millisecond budget — we use token buckets stored as atomic counters in a sharded Redis cluster, fronted by per-node local approximations that reconcile every 100 ms. The cost is small overages during reconciliation and a deliberate fail-open posture; the win is sub-millisecond decisions that scale linearly with the API itself. Global precision is sacrificed because cross-region coordination is slower than the system the limiter exists to protect.

❦ · ❦ · ❦