Start by shrinking the problem.
Interviewers leave scope vague on purpose. Three concrete answers — boundary, ratio, consistency — pre-select which walls will dominate. Say no early. Say no often.
Two actions. Create a short code from a long URL. Redirect a short code back to its long URL. No analytics, no auth, no expiry, no custom aliases for v1. Anything else is a feature, not the system.
Brutally read-heavy. A hundred reads for every write is conservative — viral links push it higher. Treat this as a read-optimization problem wearing a write's clothing: the create path can be slow; the redirect path cannot.
Eventual is fine. If a just-created link takes fifty milliseconds to propagate to every region, no human notices. We will spend that latitude — generously — on the read path.
A system built without first collapsing scope optimizes for every imaginable workload — which means optimizing for none. Every "what if" you entertain in step 1 is a constraint you carry through steps 2–5.
After answering these three, restate the problem in one sentence. "This is a read-heavy, eventually-consistent system whose main job is to redirect short codes to long URLs." If you can't say that sentence, you haven't scoped it yet.
Numbers, not adjectives.
A 100 RPS system and a 1M RPS system are not the same system made bigger. They are different designs. Concrete numbers tell you which axis hits its wall first.
→ ~ 100 : 1
→ 3 TB
→ app + lookup + 301 ≤ 50 ms
→ +100 µs / req, fine in isolation, fragile under load.
Writes are a rounding error. The whole system is a redirect cache with a small write-attached database hanging off the side.
Which physical wall are we hitting first?
A URL Shortener is, at its core, a single read path: look up a shortcode, return a URL. At seventy thousand lookups per second, the question is not how fast you can compute — it is where the answer lives.
Hierarchy
& I/O
Ordering
Cost
Storage Hierarchy dominates. Three terabytes of index must be reachable inside a fifty-millisecond budget. RAM at a hundred nanoseconds is air. SSD at a hundred microseconds is breathable, but only if you shard wide and keep the request path short. The primary wall is not chosen — it is revealed.
Compute & I/O sits second, not first: 70K QPS is sustained but not exotic. Stateless edge nodes scale linearly. The system's CPU does almost nothing except hand the request a pointer.
Coordination Cost registers nearly zero — no node need consult another per redirect. Time & Ordering matters only on writes, where short codes must be unique without round-tripping a central registry.
Each axis unlocks its own patterns.
Three patterns carry the system. Each is a workaround for a specific wall. Together, they make seventy thousand redirects per second feel like nothing.
KV Store
in RAM
Imagine a room where every address in the city is printed on a card — and the cards are kept not in a filing cabinet but held in your hands, fanned open, instantly readable. No drawer to open. No index to consult. The answer is already there.
The full mapping — shortcode to long URL, roughly three terabytes at five hundred million pairs — lives in a sharded Redis cluster. Not on SSD. Not on a spinning disk. In RAM. The working set is the whole set; there are no cache misses, because there is no slower tier to miss into. On a create, the mapping is written to a durable database and simultaneously warmed into the cache — so the very next redirect finds it without hesitation.
Wall I sets the price list. L1 cache: one nanosecond. RAM: one hundred nanoseconds. SSD: one hundred microseconds. At seventy thousand redirects per second across a sharded fleet, SSD is workable but brittle — every contended block, every garbage-collection pause, every tail-latency event arrives at the user. RAM erases all three. The storage hierarchy is not a scale. It is a cliff. The entire design is a decision about which ledge to stand on.
HTTP 301 & the edge cache
The best way to answer a question is to teach the asker — once — so they never ask again. A great reference librarian does not fetch the same book for the same patron twice. She shows him where it lives. After that, he goes himself.
The redirect service returns HTTP 301 — Permanent Redirect — with a long Cache-Control: max-age. The browser caches the destination. The user's ISP caches it. Corporate proxies cache it. Popular shortcodes are pinned at a CDN edge. Every subsequent visit from the same client, to the same code, never reaches the origin. The server has delegated its work to the edge of the network — permanently, at zero ongoing cost.
Wall II asks: how many operations per second can the fleet sustain? The 301 is the most elegant answer to that constraint — push the work to the client, to the ISP, to infrastructure you do not operate. At a hundred reads per write, making reads disappear is the only move that scales without a ceiling.
This is the production tradeoff worth naming: 301 is cacheable for the lifetime of the cache. If the URL is later disabled — abuse, takedown, or the user simply edits the destination — the browser's cached 301 will still resolve to the old target. Many real-world shorteners therefore use 302 Found instead, accepting more origin traffic for the right to revoke. The choice is a thermostat, not a switch: short max-age + 301, or 302 with origin-side hot cache.
Counter, then base 62
A ticket machine with a number that only goes forward. No two tickets share a number. No one must ask what the last ticket was — the machine knows, because the machine is the guarantee. The counter and the identity are the same thing.
Hand each new link the next whole number from a counter and uniqueness comes for free: a number that only climbs can never repeat. The trouble is length — the billionth link would carry a ten-digit number, and we promised a short code. So we spell the same number in a denser alphabet. Everyday counting uses ten symbols, the digits zero through nine; here we use sixty-two — those ten digits plus both cases of the alphabet, the characters that ride through a URL untouched. More symbols in each slot pack more numbers into fewer slots, and that is all base 62 is: the same count, written compactly. Seven slots hold ~3.5 trillion codes; six hold ~57 billion. At 500 M new links a month, seven characters are forever.
But who owns the counter? Put it on one machine and every new link must stop to ask it for the next number — that machine becomes the single part whose failure halts everything (a single point of failure), and a queue everyone waits in. So we ask it less often. Each web server takes out a lease — numbers 3,000 through 3,999 are mine — and hands those out from its own memory with a local + 1, touching the shared counter not once per link but once per thousand. Consulted that rarely, the counter can afford to be careful: it is kept on a handful of machines that must agree on the last block handed out before granting the next, so one machine dying never forgets a number or hands the same block to two servers. That agreement procedure has a name — Raft — but the idea beneath it is plain: several copies that are never permitted to disagree.
Wall IV is the wall of ordering and uniqueness, and its law here is blunt: to promise that no two codes ever collide, something must be agreed upon. The lease agrees on a number. A second road agrees on time instead — each server builds an ID by gluing together the current millisecond, its own machine number, and a small counter for IDs minted inside that same millisecond. No two servers share a machine number, so the IDs cannot collide, and no server ever asks anyone's permission. Twitter named this Snowflake. Its weakness is the clock it leans on: servers keep their clocks aligned over the network through a service called NTP, and should two clocks drift apart — or one tick backward — a server may re-stamp a millisecond it already used and mint a duplicate that nothing catches. A number you must ask for, or a clock you must trust: Wall IV lets you escape one, never both.
Every choice has a cost.
The patterns above are not free. Each one trades one property for another. Saying the trade out loud is the difference between a design and a decision.
Keeping the full index in memory costs roughly $300 / TB / month in a cloud Redis tier; SSD is closer to $20. At three terabytes that's about $10K / year extra. Trivial against the latency win — and trivial against the engineering cost of building a tiered cache that gets the same p99 under load.
Edge and browser caching means a deleted or repointed URL may still resolve for up to its TTL. For abuse-takedown and analytics this matters. The lever is the max-age header — short TTLs (minutes) keep takedowns fast at the cost of more origin traffic; long TTLs (days) maximize cache hit rate at the cost of revocation latency. A common compromise: a short edge TTL + an origin override path for abuse cases.
Wall IV won't hand over uniqueness for free — something must be coordinated. Choosing the counter means accepting a sliver of coordination: one lease covers a thousand links, so the price is a single quick request per thousand, plus a few numbers discarded when a server restarts mid-block — harmless, since codes must be unique, not consecutive. Choosing the clock — the Snowflake build of timestamp + machine number + counter — means accepting that correctness now rides on every clock staying in step; let two drift into the same millisecond and you mint a duplicate no one sees. Most teams take the counter: skipped numbers are cheap, silent collisions are not. The clock wins only when even one request per thousand is one too many.
And so URL Shortener is,
at the bottom, a particular treaty
negotiated against
Storage Hierarchy.
Given our primary constraint is Storage Hierarchy — a 3 TB index serving 70K QPS reads under a 50 ms budget — we keep the full mapping in a sharded Redis cluster fronted by a CDN. The cost is RAM spend; the win is hundred-nanosecond lookups at infinite horizontal scale. Writes use a leased-block counter to avoid coordination on every create, and a 301 redirect with a tuned max-age pushes the read path to the edge. At a hundred-to-one read/write ratio, consistency is eventual — and nobody, in the time it takes to resolve a redirect, ever notices.