Most teams I have joined treated feature flags as a single concept and then drowned in flag debt within eighteen months. Once we started naming the patterns separately, treating each one's lifecycle differently, and writing the cleanup step into the original ticket, the system stopped feeling like a graveyard and started feeling like infrastructure.
This article is the three flag patterns I keep reusing across teams: the release flag, the kill switch, and the experiment flag. They sound interchangeable. They are not. Different lifetime, different rollback rule, different audit, different cleanup discipline.
Why "feature flag" is too generic a word
A feature flag, as a primitive, is a runtime conditional whose value comes from somewhere outside the deployed binary. That's it. Everything else, who controls it, how long it lives, whether you can lose it without noticing, depends on the pattern. When teams treat the primitive as the entire concept, three failure modes show up:
- Flags that were meant to last a week stay in the codebase for two years, branching code paths that nobody remembers the purpose of.
- Kill switches that were rarely tested fail when actually flipped because the "off" path has rotted.
- Experiment flags pollute the production code with branches that the experiment owner has long since left the company over.
None of those are feature-flag failures. They are pattern-confusion failures. Each pattern needs its own playbook.
Pattern 1: the release flag (short-lived, code-owned)
A release flag wraps a new feature so I can ship the code dark, turn it on for a small percentage, then for everyone, and then delete the flag.
Lifetime: one to four weeks. The flag exists for the time between merging the feature and being confident in it. After that, both the flag and the old code path are deleted.
The shape, in the code I would actually ship:
The rollout sequence I follow for every release flag:
- Ship the code with the flag default-off. Production runs the old path. The new code is exercised by tests in CI.
- Turn the flag on for internal users (employees, beta testers). Watch error rates and latency on the new path for at least 24 hours.
- Ramp to 1%, 5%, 25%, 50%, 100% over a few days. At each step, check the relevant metrics on the new path against the old.
- After the flag has been at 100% for a week with no regressions, delete the flag and the old code path in a single PR.
Step 4 is the one that fails. The feature is launched, the flag works, the team has moved on, and the conditional sits in the code forever. The fix that has worked for me: the original feature ticket is not closed until the cleanup PR is merged. The team treats the cleanup as part of the feature, not as a follow-up.
The rollback rule for release flags is the simplest of the three patterns: turn the flag off. The old path is still in the code, still tested, still working. I have rolled back release flags twice in production, both times with sub-minute mean time to recovery. That is the entire reason this pattern exists.
Pattern 2: the kill switch (long-lived, ops-owned)
A kill switch is a flag that lets me disable a feature, an integration, or a code path without a deploy, when something goes wrong in production.
Lifetime: forever. Or at least, as long as the underlying feature exists. Kill switches are not for ramping new code; they are for damping operational fires.
The canonical example is integration with a flaky third party. If the upstream is timing out and degrading my service, I want to be able to flip the flag, fall back to a cached or default value, and not be paged at 3am while the upstream sorts itself out.
Three disciplines that make kill switches actually work in an incident:
The off path is exercised regularly. I run a weekly chaos test that flips each kill switch off in staging, runs the smoke suite, and flips it back on. If the off path has rotted (the code no longer compiles, the fallback returns the wrong shape, a downstream consumer chokes on the missing field), I find out on a Tuesday afternoon, not at 3am during an actual incident.
The default value is fail-safe. If the flag service itself is down and my service cannot read the flag, what does the code do? For kill switches, it should do whatever degrades least. Defaulting to "feature on" is usually right (the third party is probably fine), but for some kill switches (particularly anything that protects from runaway cost) the safe default is "feature off". The default: true in the code above is explicit; I would not rely on the flag library's implicit default.
The on/off transition is logged and audited. Every flip of a kill switch is a high-cardinality event: who flipped it, when, why, what alert was firing. Most flag platforms log this for free; if yours does not, wrap it. The audit log is the thing the postmortem cites three weeks later.
The rollback rule for a kill switch is structural: it is the rollback. It is the thing I reach for in an incident; rolling it back means flipping it the other way.
Pattern 3: the experiment flag (lifetime equal to the experiment)
An experiment flag splits traffic between two or more variants of a feature so I can compare metrics. A B test, multi-arm bandit, gradual personalization rollout. The flag is owned by the data team or the product team running the test.
Lifetime: equal to the test, plus one cleanup PR. Most A B tests run for two to six weeks. After the winning variant is decided, the losing branches and the flag are deleted in the cleanup PR.
Two non-obvious things about experiment flags:
Bucket by stable identifier, never by request. If I bucket by userId, the same user always gets the same variant on every page they visit, every session, every device. If I bucket by request, the user sees variant A on one page and variant B on the next, which both invalidates the experiment and confuses the user. The bucketing function is usually a hash of userId + experimentName modulo 100, mapped to variants by their assigned weights.
Experiment flags are not kill switches. I have seen teams use the experiment-flag system for incident response ("the new variant is broken, drop its weight to zero"). It works, but it muddles the audit. After the incident, was the experiment cancelled or did it complete? Was the weight change part of the test or the rollback? The answer is usually "both, and we lost track". Use a kill switch in front of the experiment flag if you need an incident-response off button.
The cleanup rule for experiment flags is the strictest of the three: when the test ends, the losing variants and the flag itself are deleted before the next test in the same area starts. The thing that goes wrong here is overlapping experiments on the same surface: the team decides to keep the losing variant "in case we want to revisit", a new experiment lands on the same code, and now the page is testing four things at once and nobody can interpret the results.
A side-by-side that lives on my wiki
The value of the table is not the rows. It is forcing yourself to pick one column for every flag you create. If you cannot say which pattern a flag fits, the flag should not exist yet.
The cleanup discipline that finally worked
Flag debt is the single biggest problem with this whole system. Three rules I have settled on for keeping it under control:
Rule 1: every flag has an expiry date written into its definition. Most flag platforms support this metadata. Release flags get 30 days, experiment flags get the test duration plus 14 days for cleanup, kill switches get "never" but require a justification field. A weekly job lists flags past their expiry; the originating team has one sprint to clean them up or extend the date with a reason.
Rule 2: the original feature ticket is not closed until the cleanup PR is merged. Project management cares about the ticket-closing metric; this hooks the cleanup work into a metric the team already cares about. I have seen teams flip from "flag debt is a thing we mean to clean up" to "flag debt is part of definition of done" with literally just this rule.
Rule 3: a quarterly flag census. The team owner of each service walks through their flags, classifies each as release, kill, or experiment, and either confirms its purpose or schedules the cleanup. The first census is painful. The fourth is 20 minutes.
Those three rules cut my last team's active-flag count from 240 to 60 over six months, with no functionality lost.
The classification I would force every team to do
If I were to leave a single sentence on the wall of every team that uses feature flags, it would be: every flag you create needs a pattern label, an owner, and an expiry. Without the label, you do not know how to operate the flag. Without the owner, you do not know who reads the audit log when it flips. Without the expiry, the flag becomes a permanent fork in the code that survives reorgs. The tooling is easy to add (most flag platforms support all three fields out of the box, you just have to make them required), and the discipline is the actual product. Once those three fields are mandatory, the graveyard problem mostly solves itself.
