From aed75ae13cf03d3fc54fc2ef4965733a663668ee Mon Sep 17 00:00:00 2001 From: Ralf Anton Beier Date: Mon, 18 May 2026 07:48:14 +0200 Subject: [PATCH] =?UTF-8?q?ci(mutants):=20drop=20--jobs=204=20=E2=86=92=20?= =?UTF-8?q?--jobs=202=20to=20fit=2032G=20lean-mem=20cgroup?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Smithy operator status (2026-05-18) showed lean-mem runners approaching the 32G cgroup ceiling again, just 16 hours after the 24G→32G cgroup bump: runner3 31.0G/32G (996MB headroom) swap 2.1G peak 2.8G runner4 30.4G/32G (1.5G headroom) swap 2.1G peak 2.3G Same death-spiral pattern that triggered the prior bump. The cgroup fix bought time, not a durable answer. Workflow-side lever: cargo-mutants runs `--jobs N` parallel mutants per shard. Each worker compiles a fresh target dir and runs the full test suite — at 4-way that's ~8G/worker on a 32G runner, at or above rivet-core's compile peak. Halving to `--jobs 2` gives ~16G/worker with comfortable headroom for the test suite. Trade-off: each shard ~2× longer (was 12-20 min, now 20-40 min). Net effect on the lean-mem queue is similar because we stop OOM-cycling mutants that previously had to be re-tried. This addresses smithy's flagged escalation tier ("the next fix is on the workflow side"). It complements rather than replaces issue #299 (decoupling playwright/kani/rocq from `needs: [test]`). Co-Authored-By: Claude Opus 4.7 --- .github/workflows/ci.yml | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index bf0d3be..01dc2d3 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -410,7 +410,21 @@ jobs: # "x3 baseline" rule — anything slower than that is reported as # `timeout` (counts as caught) rather than `missed`. With 30s # cap and 16-way sharding, each shard finishes in ~12-20 min. - run: cargo mutants -p ${{ matrix.crate }} --shard ${{ matrix.shard }} --timeout 30 --jobs 4 --output mutants-out -- --lib || true + # + # `--jobs 2` (was 4): smithy operator flagged on 2026-05-18 that + # lean-mem runners were hitting their 32G cgroup ceiling + # (runner3 31.0G/32G with 996MB headroom, ~3GB swap). Each + # cargo-mutants worker compiles a fresh target dir and runs the + # full test suite — at 4-way parallel that's ~8G per worker on + # a 32G runner, which is at or above cargo-rustc's compile peak + # for rivet-core's larger crates and triggers the same swap- + # death-spiral pattern the cgroup bump was meant to break. + # Halving concurrency gives ~16G per worker, comfortable for + # the compile peak with headroom for the test suite. Trade-off: + # each shard takes ~2x as long (was 12-20 min, now 20-40 min), + # but the lean-mem pool stops needing emergency cgroup-ceiling + # bumps every quarter. + run: cargo mutants -p ${{ matrix.crate }} --shard ${{ matrix.shard }} --timeout 30 --jobs 2 --output mutants-out -- --lib || true - name: Check surviving mutants run: | MISSED=0