Problem
When the self-hosted rust-cpu pool gets saturated (rivet-core mutation testing held it for hours on 2026-05-17 — smithy commit 70401cd), the entire downstream CI tree blocks because 8 jobs chain off test:
- playwright (ubuntu-latest)
- vscode-extension (ubuntu-latest)
- coverage (rust-cpu, legitimate — runs same suite under llvm-cov)
- mutants (lean-mem, legitimate)
- kani (ubuntu-latest, runs Kani harnesses — independent)
- verus (lean-mem, Verus proofs — independent of test pass)
- rocq (ubuntu-latest, builds Rocq theorems — independent)
- release-results (light, but needs 5 upstream)
Smithy operator note (2026-05-17): ubuntu-latest-labeled jobs had their own runner capacity available, but couldn't start because they were queued behind the bottleneck.
Proposal — Phase 1
Remove needs: [test] from playwright, kani, rocq. These are independent of cargo-test passing.
Keep on coverage, mutants. Defer verus to Phase 2.
Phase 2 (out of scope)
- Audit vscode-extension dep on test.
- Reconsider release-results five-way fan-in.
- Lower cargo-mutants --jobs or shard the rivet-core mutation suite.
Refs
- 2026-05-17 incident: 4+ hour CI stall; smithy 70401cd bumped lean-mem cgroup 24G->32G. This issue addresses the workflow-level lever.
Problem
When the self-hosted rust-cpu pool gets saturated (rivet-core mutation testing held it for hours on 2026-05-17 — smithy commit 70401cd), the entire downstream CI tree blocks because 8 jobs chain off test:
Smithy operator note (2026-05-17): ubuntu-latest-labeled jobs had their own runner capacity available, but couldn't start because they were queued behind the bottleneck.
Proposal — Phase 1
Remove needs: [test] from playwright, kani, rocq. These are independent of cargo-test passing.
Keep on coverage, mutants. Defer verus to Phase 2.
Phase 2 (out of scope)
Refs