CI Fails at Scale, Not Just Speed

CI/CD pipelines that worked fine at 20 engineers start to break down at 200. The failure modes of CI at scale are consistent across organisations.

A 45-minute CI pipeline provides weak feedback loops. Developers stop waiting for CI before merging and start merging and hoping. This feedback loop erosion happens gradually as CI grows from 10 minutes to 15 to 20, and nobody treats it as a problem because each addition was incremental. To fix this, organisations should make build time a reported metric, set a target of 10 minutes or less for the majority of PRs, and treat regressions as incidents.

In the teams I ran, we started treating build latency like any other production metric. We pushed the numbers into Prometheus, built a Grafana dashboard, and set an SLO that 90 % of PR builds must finish under ten minutes. When a nightly regression added two minutes to the average, the alert fired and we opened a post‑mortem. The culprit turned out to be a new static analysis step that pulled a 300 MB artifact from an internal Nexus repo; rolling it back restored the original timing. The incident cost us a day of delayed merges because developers stopped waiting and started cherry‑picking. The lesson is that you have to surface the latency as an incident, not as a nice‑to‑have number.

The fastest path to reducing CI time is parallelisation. Running 200 unit tests sequentially in a single container takes 10 minutes, while running the same tests in 10 parallel containers takes 1 minute plus parallelisation overhead. Most CI platforms, including GitHub Actions, Azure DevOps, and CircleCI, support matrix builds for parallel test sharding. The investment in parallelisation returns immediately in developer cycle time.

Parallelising tests is not free. When we moved from a single 8‑core runner to ten parallel containers on our self‑hosted Kubernetes pool, CPU usage jumped from 30 % to 85 % and we started seeing node OOM kills. We mitigated it by capping the number of shards to the number of physical cores and by using the pytest‑xdist plugin to reuse the same JVM for JUnit tests. The extra network latency of pulling the Docker image for each shard added about eight seconds per container, which we offset by baking a base image with the JDK and test runner preinstalled. The trade‑off was higher cloud spend, but the reduction from a 12‑minute suite to a 1.5‑minute feedback loop paid for itself in developer productivity.

Dependency caching can reduce CI times by 50-70% for incremental runs. This requires structuring Dockerfiles to separate dependency installation, which changes infrequently and should be cached, from application code, which changes every commit and expects a cache miss. A common CI performance anti-pattern is cache invalidation from unnecessarily broad COPY statements.

Cache management became a hidden source of failures. We relied on CircleCI’s remote cache backed by S3, and after a month of incremental builds the cache ballooned to 250 GB, causing storage throttling and occasional checksum mismatches that broke builds for unrelated branches. The fix was to add a cache‑prune step that runs nightly and to scope the cache key to the lockfile hash rather than the whole source tree. In another project we switched Dockerfiles to a multi‑stage build where the first stage installs Maven dependencies and writes them to /root/.m2; that layer is now cached and only refreshed when the pom.xml changes, cutting incremental build time from eight minutes to under three. The downside is you have to be disciplined about keeping the lockfile in sync, otherwise you get subtle version drift.

Flaky tests, which intermittently fail without code changes, are a systemic CI reliability problem at scale. With 100 tests and 1% individual flakiness, approximately one test will fail in every CI run on average. If there is no explicit ownership, the flaky test backlog grows faster than it is resolved. Systematic approaches include detecting flakiness by running tests multiple times on the same commit, tagging flaky tests as known-flaky and quarantining them, and tracking flakiness rates by test file and assigning ownership.