The CDN bug that taught me to take cache keys seriously: a marketing team's logged-in user dashboard was being cached and served to other logged-in users. Different users were seeing each other's profile data on the home page for the first three seconds after navigation, then JavaScript would fetch the right data and overwrite the screen. Three seconds of data leakage between accounts is enough to be a security incident. The root cause was a single line of CDN configuration: the cache key did not include the session cookie. From the CDN's perspective, every request to /dashboard was the same request, so the response from the first user got served to the next thousand.
We fixed it in five minutes (add the session cookie to the cache key, set Cache-Control: private on logged-in pages, deploy). The conversation that came out of it took five weeks: how does a CDN actually work, what is in a cache key, what is an origin shield, and why are the defaults dangerous? This article is the version of that conversation I would write for the next team.
My stance: the cache key is the most consequential CDN configuration value, more consequential than TTL or geographic distribution. The TTL controls how long a wrong answer persists; the cache key controls whether the answer is right in the first place. Most CDN incidents I have seen trace back to a cache key that did not include enough request properties to disambiguate users.
What a CDN actually does
A CDN is a network of cache servers distributed close to users. When a user requests an asset (an image, a JavaScript bundle, an HTML page), the request hits the nearest CDN edge node. If the node has a fresh copy in its cache, it serves that copy directly without contacting the origin (your application servers). If not, it fetches from the origin, stores the response, and serves it. Subsequent requests for the same asset within the TTL get served from cache.
The wins are real and well-known: lower latency for users (the cache is geographically close), lower load on origin servers, lower bandwidth costs. The hidden cost is the cache itself: every cached response is correct only if the cache key uniquely identifies the response. A bad cache key means user A's response gets served to user B.
The anatomy of a cache key
A cache key is the string the CDN uses to look up a cached response. By default it is something like host + path + query string. So https://example.com/dashboard?id=42 and https://example.com/dashboard?id=43 get different cache entries because the query strings differ.
What is missing from the default key:
If your response varies by any of these and they are not in the cache key, the CDN will serve the wrong response. The dashboard incident was exactly this: the response varied by session cookie, but the cookie was not in the cache key.
The fix is to either include the cookie in the key (CloudFront's behavior policy, Cloudflare's cache rules, Fastly's VCL) or to bypass the cache for cookied requests entirely (Cache-Control: private tells the CDN not to cache).
A subtler form of the cache-key problem is query parameter normalization. /products?id=42&color=red and /products?color=red&id=42 are the same request semantically, but a naive cache key treats them as different entries. Worse, /products?id=42&utm_source=newsletter and /products?id=42&utm_source=ads are the same response (the marketing tracking parameter does not change the content), but separate cache entries again. Both forms hurt hit rate. The fix is query-parameter normalization in the CDN config: sort the parameters alphabetically, drop irrelevant tracking parameters before computing the key, and treat empty values consistently. Most CDNs support this with a few lines of configuration. Without it, your hit rate can be 30% lower than it should be on URL-heavy traffic.
The Vary header: a weak gesture toward correctness
HTTP has a built-in mechanism for this: the Vary response header. Vary: Accept-Language tells caches that the response varies by the Accept-Language request header, so the cache should treat the same URL with different Accept-Language values as different entries.
Most CDNs respect Vary, but with caveats:
Vary: *(vary by everything) is treated as "do not cache" by most CDNs.Vary: Cookieis technically valid but explodes the cache key space (every distinct cookie value becomes a separate entry); most CDNs explicitly do not honor it.- CDNs often have their own configuration that overrides
Vary(CloudFront's "forward headers to origin" setting, Cloudflare's cache rules).
I treat Vary as a hint, not as a contract. The CDN-side configuration is what actually controls the cache key. Vary is useful for downstream caches (browsers, intermediate proxies) but not load-bearing for the CDN itself.
TTL choices and what they actually mean
The TTL (max-age) is the time after which a cached response is stale. The CDN can return a stale response (with a stale indicator) or revalidate against the origin (If-None-Match, If-Modified-Since).
Hashed filenames are the trick that makes long TTLs safe for static assets. A bundle named app.a3f2c1b9.js is content-addressed: changing the content changes the filename, so a long TTL on the old filename is harmless because nobody requests the old filename anymore. This is what build tools (webpack, vite, esbuild) do by default and it is a major reason single-page apps can ship aggressive caching.
For HTML, TTL is a trade-off between freshness and origin load. Five minutes is a good default for marketing pages; longer than that and editorial changes feel slow to propagate. Less than that and you lose most of the CDN benefit.
For API responses, my default is to not cache them at all. The exceptions are public endpoints (a public catalog API) where the response is the same for every user and can tolerate seconds-of-staleness. Anything user-specific should be Cache-Control: private (cache in the user's browser only, not in shared caches).
Origin shields: the cache layer behind the cache
A common CDN feature is the origin shield: a designated cache layer that sits between the edge nodes and your origin. Every cache miss from any edge goes through the shield. The shield caches the origin's response and serves it back to the edge that asked for it; subsequent misses from other edges hit the shield instead of the origin.
The shield's job is to absorb cache miss traffic. Without a shield, a cold object is fetched from origin once per edge node. With twenty edge nodes and one cold object, that is twenty origin hits. With a shield, it is one origin hit. For high-traffic sites, this is the difference between origin handling 10,000 RPS and 200 RPS during a cache flush.
The trade-off is that the shield adds a hop for cache-miss requests, increasing miss latency by a few milliseconds. For mostly-cache-hit traffic, that latency is invisible (the hit path does not go through the shield). For miss-heavy traffic, the shield is paying for itself by reducing origin load.
Most large CDNs offer this as a configuration option (CloudFront's Origin Shield, Cloudflare's Tiered Cache, Fastly's Origin Shield). Enabling it is a two-line config change with a real win for any site that has more than a handful of edge regions. I would enable it by default unless I had a specific reason not to.
Five ways CDN configs break
Five failure modes I have seen:
- Cache key missing a relevant request property. The dashboard incident is the canonical case. Audit your cache rules: for every endpoint that returns user-specific data, the cache key must include something user-identifying or the endpoint must be marked uncacheable.
- Cache key including an irrelevant request property. If the cache key includes the User-Agent header, every browser version gets a separate cache entry. Hit rate plummets, origin load rises. The fix is to normalize User-Agent into broader buckets (mobile vs desktop, by major version) or omit it from the key.
- TTL too long for the freshness requirement. A marketing page with a one-day TTL feels slow when content is updated; users see stale data for up to a day. The fix is shorter TTL plus an explicit cache invalidation API call on content publish.
- Cache invalidation that does not actually purge. Most CDNs offer a purge API; some are eventually consistent and take minutes to propagate. If you publish content and immediately tell the CDN to purge, the purge may not be effective for a few minutes. Plan for it; do not assume purges are instant.
- Cookies leaking through the cache. Default cache configurations often forward all cookies but do not include them in the cache key. This means the cached response includes one user's cookie in the
Set-Cookieheader, served to other users. The fix is to strip cookies from cached responses (Cache-Control: no-storefor any response that sets cookies) or to bypass the cache for cookied requests.
One more failure mode worth calling out: the cache stampede. A popular asset's TTL expires, and a thousand concurrent requests all miss the cache simultaneously. Without any protection, the CDN forwards a thousand parallel fetches to the origin. The origin sees a sudden 1000x spike and may fall over. The standard mitigation is request coalescing (some CDNs call it "request collapsing" or "single connection"): when many concurrent requests arrive for the same uncached URL, only one is forwarded to the origin and the rest wait for the response. Most modern CDNs do this automatically; older or self-managed setups (Varnish without vcl_hit/vcl_miss tuning) may not. Verify your CDN's behavior under stampede before you find out the hard way.
Cache-Control directives that actually matter
The Cache-Control response header is how the origin tells the CDN (and browsers) how to cache. The directives I use most:
stale-while-revalidate is underused and worth highlighting. It tells the CDN: "if the cached entry is stale, serve it anyway and revalidate in the background." The user gets a fast response (no waiting for revalidation); the cache gets refreshed for the next user. This is the mechanism behind the snappy feel of well-tuned content sites.
What I would set up for a fresh site
If I were configuring a CDN for a new site today:
- Static assets with hashed filenames:
Cache-Control: public, max-age=31536000, immutable. One year, cached aggressively. - HTML for marketing pages:
Cache-Control: public, max-age=300, stale-while-revalidate=86400. Five-minute fresh window, day-long stale window during which the CDN serves stale and revalidates in the background. - API responses (public):
Cache-Control: public, max-age=30, stale-while-revalidate=600if the data tolerates 30-second staleness; otherwiseno-store. - API responses (private):
Cache-Control: private, no-store. Do not let shared caches near user-specific data. - Origin shield: enabled.
- Cache key:
host + path + normalized-query, with explicit per-route overrides where headers or cookies matter. - Purge API: integrated into the content publishing pipeline.
That configuration takes about an hour to set up on most CDNs and prevents most of the incidents I described above.
A position to defend
CDNs are sold as performance products and they deliver on that, but the configuration surface that determines correctness (cache keys, Cache-Control, purge semantics) is the part teams skip. You can run a CDN with default settings and have it work well for static content; you cannot run a CDN with default settings and have it work safely for dynamic content with cookies or session data. Any team adding a CDN to an authenticated app should treat the cache configuration with the same review rigor as a database migration: incorrect changes have user-visible blast radius and "we'll fix it later" is not a real plan when "later" is after a data leak. Start with private, no-store for everything authenticated and grow the cacheable surface deliberately, not the other way around.
