Community Article

CI/CD Pipelines: Stop Letting Them Rot

The maintenance habits that have kept my pipelines fast and trusted for years, the seven categories of rot I have actually seen, and the budget I run so the pipeline is treated as production code.

CI/CD Pipelines: Stop Letting Them Rot

The maintenance habits that have kept my pipelines fast and trusted for years, the seven categories of rot I have actually seen, and the budget I run so the pipeline is treated as production code.

reliability
craftsmanship
performance
code-organization
sanjayward

By @sanjayward

January 21, 2026

·

Updated May 20, 2026

1,037 views

10

4.1 (9)

Every team I have joined had a CI pipeline that looked fine on the surface and was rotting underneath. Tests took 14 minutes to run when they used to take 4. The deploy job had a flaky retry loop nobody had touched in a year. Half the steps emitted warnings that everyone scrolled past. The shape is universal: pipelines start tidy and decay slowly, and the decay is invisible until somebody new joins and asks why a build takes longer than the actual work.

This article is the maintenance discipline that has kept my pipelines fast and trusted across several teams. Seven categories of rot I keep finding, the lightweight habits that prevent each, and a small monthly budget so the pipeline is treated like the production code it actually is.

Why pipelines rot

A CI pipeline is code that runs on a schedule, has no real on-call, and serves an audience (engineers waiting on green) that has limited leverage to fix it. That combination produces the same kind of decay you see in any code that nobody owns: small inefficiencies accumulate, flakiness gets papered over with retries, and after eighteen months the pipeline is the slowest, least-trusted part of the engineering system.

The fix is not heroic rewrites. It is a small set of habits applied weekly so the rot never builds up.

Seven categories of rot I keep finding

Rot 1: warnings everyone has learned to ignore

The pipeline output has 80 lines of yellow text. None of it is acted on. New warnings get added because nobody notices when there are 80 already. After a year, a real warning (a deprecated API that will break the next quarter) is invisible in the pile.

Fix: warnings are errors or they are deleted. I run a quarterly pass where every warning either becomes an error (forcing a fix or an explicit allow-list) or gets suppressed at the source with a one-line justification. The pipeline output goes from 80 lines to about 8, and the remaining ones get attention.

Rot 2: silent test slowdowns

The test suite ran in 4 minutes a year ago. It runs in 14 today. Nobody noticed; it crept up by 30 seconds a month. The slowdown is almost always a small number of tests that did something expensive (spun up a real DB, called a real API, slept).

Fix: a duration histogram for the slowest 20 tests, posted to a Slack channel weekly. If a test is in the top 20 and is over 5 seconds, it is either fixed or deliberately tagged as slow and split into a separate slower-running suite. The weekly habit is what makes this work; once-a-quarter audits are too late.

A tiny script (the actual one I have shipped on three teams now) that computes the top 20 from a JUnit XML file:

# top-20 slowest tests from the last CI run
xmlstarlet sel -t \
  -m '//testcase' \
  -v '@time' -o ' ' -v '@classname' -o ' ' -v '@name' -n \
  test-results/*.xml \
  | sort -rn | head -20

Not fancy. Catches 80% of the slowdowns.

Rot 3: flaky tests with auto-retries

A test fails 3% of the time. Somebody adds a retry. The retry succeeds 95% of the time, so the build is green. The flakiness is now permanent, eats CI budget, and hides real regressions when the underlying issue gets worse.

Fix: the retry counter is a metric. Every retry increments a counter labelled by test name. The top 5 retried tests are quarantined every week (moved to a flaky suite that is run nightly, not on every PR) and the team that owns the test is responsible for either fixing the flake or deleting the test. The rule that has worked for me: a retry is a temporary mitigation, not a permanent fix. Most flake fixes turn out to be a leaked async side effect, a test depending on system clock, or an order-dependent assertion. Once or twice a year a flake turns out to be a real bug masked by lucky scheduling, and the retry was actively dangerous.

Rot 4: caches that miss more than they hit

The pipeline has a cache step. Cache misses every run. Nobody noticed because the build still works; it just takes 4 minutes longer than it should.

Fix: log the cache hit rate explicitly. GitHub Actions, GitLab, and Buildkite all expose cache hit/miss in their action output; if yours does not, wrap it.

# the cache key shape that tends to actually hit
- uses: actions/cache@v4
  with:
    path: |
      ~/.yarn/cache
      .yarn/cache
    key: yarn-${{ runner.os }}-${{ hashFiles('yarn.lock') }}
    restore-keys: |
      yarn-${{ runner.os }}-

The restore-keys fallback is the part most teams get wrong. Without it, every change to yarn.lock (which is most weeks) is a fresh cache; with it, you fall back to the most recent compatible cache and only re-resolve the changed packages. Cache hit rate goes from "once per yarn.lock" to "every run".

Rot 5: parallel jobs that are not actually parallel

The pipeline says it runs four shards in parallel. In practice, three shards finish in 45 seconds and one runs for 8 minutes. The slow shard always contains the same handful of integration tests. Total wall-clock time is the slow shard's time, not the average.

Fix: shard by historical runtime, not by file count. Most CI platforms support this; the tests are bucketed into shards of equal expected duration based on the last successful run's timing. The slow shard goes from 8 minutes to 2.5 (each shard now around 2 minutes).

If your platform does not support runtime-balanced sharding natively, a one-liner that almost-works: sort tests by their last duration descending, distribute them round-robin into N shards. The longest test goes to shard 0, the second longest to shard 1, and so on. The shards end up roughly balanced.

Rot 6: pipeline definition that is one giant file

The single .github/workflows/main.yml started at 80 lines and now is 1100 lines, with 14 jobs and conditional branches based on path filters that no one fully understands. Reviewing a pipeline change has become "hope it works in CI" because reading the file in full is impractical.

Fix: split the pipeline into composite actions or reusable workflows, one per concern. Build, test, lint, deploy each become small pieces that can be reviewed in isolation. The top-level file is short and orchestrates the composites. The first time you split a 1000-line CI file, it feels like overkill; six months later, when you need to change the test command, the diff is 4 lines instead of 80.

Rot 7: secrets and credentials with no rotation cadence

The deploy key has been in the CI vault for three years. The last person to rotate it left two years ago. Nobody knows where it is also referenced. The day it leaks, it has access to everything.

Fix: every secret has an owner, an expiry, and a documented rotation procedure. Rotation happens on a schedule (annually for most things, more frequently for high-risk credentials), and the rotation procedure is written down before the secret is ever first added. The vault becomes a registry of credentials with metadata, not a junk drawer of unsorted strings. The single-week project to do this on a stale codebase is one of the highest-leverage things a tech lead can sponsor.

A pipeline-health dashboard that fits on one screen

Weekly pipeline-health snapshot, the seven numbers I track
  Build duration p50 / p95               (target: p95 < 8 min)
  Test duration p50 / p95                (target: p95 < 5 min)
  Failed runs / total runs               (target: < 5%, excluding real test failures)
  Auto-retry count / total runs          (target: < 2%)
  Cache hit rate                         (target: > 90%)
  Top 5 slowest tests, their duration    (target: each < 10s, or tagged slow)
  Top 5 flakiest tests, their flake rate (target: each < 0.5%, or quarantined)

Every one of those is in a Slack message that goes out automatically every Monday. The conversation about pipeline rot has been moved from "why is CI slow this week" to "why is the cache hit rate down to 70%". That is a much more actionable conversation.

A monthly maintenance budget

The pipeline is production code. It deserves time on the team's calendar. The shape that has worked for me:

  • Weekly (5 minutes): read the dashboard. If anything is out of band, file a ticket.
  • Monthly (90 minutes): one engineer's afternoon. Tackle the top item from the dashboard. Examples I have actually done: rewrote the cache key shape, fixed three flakes, deleted three quarantined tests, split a 1000-line pipeline into composites.
  • Quarterly (4 hours): the warning audit, the secret-rotation review, the pipeline-shape review. One person blocks an afternoon, the rest of the team reviews the resulting PRs.

Total cost: about half a day per engineer per quarter. The payback is a pipeline that does not gradually become the slowest, least-trusted part of the engineering system. I have not had a team adopt this discipline and regret it; I have had several teams adopt it and discover their CI bill went down by 30% within two quarters because the wasted work (failing-then-retrying jobs, redundant cache rebuilds, slow-shard wall time) just stops happening.

A short manifesto I would put on the pipeline repo's README

The pipeline is a service. It has users (every engineer who waits on green), an SLO (build under 10 minutes, p99), and a maintenance owner. Every step in it is code that someone wrote on purpose; every retry, every cache, every shard exists because of a decision that needs to be revisited as the code base changes. Treat the pipeline like the rest of production: instrument it, alert on it, write down its operating manual, and rotate its credentials. The teams that do this have CI pipelines that age well. The ones that do not eventually rebuild from scratch every two or three years, paying the same cost in a heroic project that they could have paid in 90 minutes a month.

Back to Articles