fix(test-infra): bound test invocations with a fail-fast watchdog + CI timeouts#1669
Conversation
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
U1/U2/KTD-6: extract the dashboard heap-runner's process-group kill lifecycle into a shared scripts/lib/run-vitest-watchdog.mjs with per-class budget bands (timings only tighten within a generous ceiling) and an inline hang-diagnostics snapshot. Delegate run-vitest-with-heap.mjs to it (behavior preserved). Add timeout-minutes backstops to all full-suite.yml test jobs so a wedged run can no longer hang to GitHub's 6h ceiling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Convert ci-test-shard.mjs's vitest invocation path from blocking spawnSync to async spawn through scripts/lib/run-vitest-watchdog.mjs. Each shard command gets a per-class budget (shard ceiling 30min) tightened by aggregated timings only when the snapshot is fresh; quick non-test commands stay synchronous. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…g (U1) Convert test-changed.mjs's invocation path from blocking spawnSync to async spawn through the shared watchdog. runWatchedTest throws on failure/timeout with .exitCode so the existing catch and the runMaybeIsolated finally (prune + isolation post-check) still run — a watchdog kill reaps any leaked isolated HOME instead of leaving it for the guard to flag. Full/affected runs use a generous 60min backstop (dashboard lanes are already inner-watchdog'd); the quick gate uses the changed-class ceiling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a shared Vitest watchdog, switches dashboard and CI/local test runners to async watchdog execution with derived budgets, adds CI job timeout backstops, updates watchdog-focused tests, and introduces a detailed timeout-reliability plan document. ChangesTest hang prevention and runner consolidation
Sequence Diagram(s)sequenceDiagram
participant CI as ci-test-shard.mjs
participant Watchdog as runWithWatchdog
participant Vitest as vitest child group
CI->>Watchdog: run command with label and budget
Watchdog->>Vitest: spawn detached process group
Watchdog->>Watchdog: track heartbeat and timeout budget
alt command finishes in time
Vitest-->>Watchdog: close(code=0 or nonzero)
Watchdog-->>CI: result with code/signal
else budget exceeded
Watchdog->>Vitest: SIGTERM
Watchdog->>Watchdog: log hang diagnostics
Watchdog->>Vitest: SIGKILL after grace
Vitest-->>Watchdog: close(timeout exit)
Watchdog-->>CI: timedOut result
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Greptile SummaryThis PR introduces a shared watchdog runner (
Confidence Score: 5/5Safe to merge — the watchdog adds a new guardrail layer without changing any test assertions or broadening timeouts, and all three converted runners fall back to the conservative per-class ceiling when timings are absent or stale. The core watchdog logic is well-isolated behind injected dependencies (spawn, killGroup, now) with 11 unit tests covering the full lifecycle. The spawnSync→async conversions preserve every existing exit/signal/cleanup contract, and the CI timeout values are sized safely above the L2 watchdog ceiling. The only committed noise is a development-artifact comment in test-changed.mjs that has no runtime effect. scripts/test-changed.mjs — contains an accidentally committed FNXC:TestInfrastructure annotation that should be removed before merging. Important Files Changed
Sequence DiagramsequenceDiagram
participant Runner as ci-test-shard / test-changed
participant WD as run-vitest-watchdog
participant Child as vitest process group
participant CI as GitHub Actions (L3)
Runner->>WD: runWithWatchdog(command, budgetMs)
WD->>Child: spawn detached (own pgid)
WD-->>WD: arm watchdog timer (budget)
WD-->>WD: install SIGINT/SIGTERM/SIGHUP forwarders
alt child exits within budget
Child-->>WD: close(code, signal)
WD-->>Runner: "resolve { code, signal, timedOut:false }"
Runner->>Runner: exit(code)
else watchdog fires (budget exceeded)
WD-->>WD: captureHangDiagnostics → log
WD->>Child: SIGTERM (process group)
WD-->>WD: arm grace timer
WD->>Child: SIGKILL after grace
Child-->>WD: close(null, SIGKILL)
WD-->>Runner: "resolve { code:124, timedOut:true }"
Runner->>Runner: exit(124)
else external SIGTERM/SIGINT reaches wrapper
Runner-->>WD: signal forwarded
WD->>Child: forward signal to group
WD-->>WD: arm grace timer → SIGKILL
Child-->>WD: close(null, signal)
WD-->>Runner: "resolve { signal }"
else L2 watchdog itself wedges
CI-->>Runner: timeout-minutes fires
CI->>Runner: job cancelled
end
Reviews (4): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
packages/dashboard/scripts/__tests__/run-vitest-with-heap.test.ts (1)
150-154: ⚡ Quick winKeep this timeout-path spec off a 2s real-clock wait.
Bumping the wrapper timeout to
2000msmakes this case sit on wall-clock time before it can pass. The shared watchdog already covers timeout semantics directly, so this wrapper spec should stay at a narrower seam — e.g. a stub that becomes observable immediately with a sub-second budget, or a helper-level timeout assertion instead of a 2s end-to-end wait.As per coding guidelines,
**/*.test.{ts,tsx,js}: "Do not add slow tests. Prefer narrow seams, in-memory fakes, shared harnesses, and targeted assertions. Prefer fake timers over real polling/time waits."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/dashboard/scripts/__tests__/run-vitest-with-heap.test.ts` around lines 150 - 154, The test should not rely on a 2s real-clock wait; replace the long end-to-end timeout with a narrow seam: either (A) reduce FUSION_RUN_VITEST_TIMEOUT_MS to a sub-second value (e.g., 200–500) and keep FUSION_RUN_VITEST_KILL_GRACE_MS proportionally small so the stub process becomes observable immediately, or (B) avoid real waiting by using fake timers or a test helper to simulate/advance the watchdog timeout and assert the timeout->exit behavior; update the environment variables FUSION_RUN_VITEST_TIMEOUT_MS and FUSION_RUN_VITEST_KILL_GRACE_MS (or convert the test to use fake timers) in run-vitest-with-heap.test.ts accordingly.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/dashboard-guide.md`:
- Line 731: The docs string that lists steerable in-review statuses currently
reads "(reviewing/merging/fixing)"; update that phrase to include the missing
status `merging-fix` (for example "(reviewing/merging/merging-fix/fixing)") so
the documentation matches the implementation. Locate the exact sentence
containing the parenthetical and insert `merging-fix` into the list.
In `@docs/plans/2026-06-13-001-fix-test-timeout-failures-plan.md`:
- Around line 115-116: Update the plan text to match the implemented contract in
deriveBudgetMs: replace references to a “median fallback” with a per-class
ceiling fallback and explicitly state that deriveBudgetMs defaults to the
per-class ceiling when timings are missing and only tightens within the
per-class floor/ceiling band when fresh timings exist; also call out that
timelines must be refreshed via scripts/ci-test-shard.mjs --write-timings before
U1 so budgets are derived from fresh data.
- Around line 131-151: The fenced code block with the project tree is missing a
language/tag for the Markdown fence (tripled backticks); fix by changing the
opening ``` to a labelled fence such as ```text (or another appropriate
language) so markdownlint MD040 is satisfied — update the unlabeled
triple-backtick fence that precedes the block starting with "scripts/" to use
```text.
- Around line 21-24: Summary: the paragraph reads as if watchdogs/backstops are
not yet present but those changes are already landing; either mark it explicitly
as historical context or update to the landed state. Fix: edit the paragraph to
(a) if keeping as background, prefix or reword it to clearly state it's
historical context (use past tense, e.g. "Historically, only the dashboard
runner had a watchdog; since then we added a shared watchdog and job-level
backstops elsewhere"), or (b) if making it current, update wording to
present/perfect tense to reflect that the shared watchdog and job-level
backstops are now in place and remove the implication they're missing; keep the
technical references (shared watchdog, job-level backstops, shared WORKER_ROOT
temp-isolation mechanism) so readers can correlate to the rest of the plan.
Ensure the final paragraph unambiguously states whether the described behavior
is corrected (landed) or retained as background.
In `@packages/dashboard/scripts/run-vitest-with-heap.mjs`:
- Around line 21-22: The parsed env values for timeoutMs and graceMs can become
NaN and silently disable the watchdog; update the parsing logic so that after
Number.parseInt you validate and clamp the values before they are used by
runWithWatchdog(): for timeoutMs, if the parsed value is NaN or <= 0, set it to
the default 900000 (and ensure it is finite); for graceMs, if the parsed value
is NaN or < 0, set it to the default 5000 (and ensure it is finite); then pass
these validated/clamped timeoutMs and graceMs into runWithWatchdog() so budgetMs
is always a positive finite number.
In `@scripts/lib/run-vitest-watchdog.mjs`:
- Around line 222-230: The custom signal handlers installed in forwardedSignals
only forward the signal to the child group (signalGroup) and prevent Node's
default termination behavior, which can hang runWithWatchdog(); update the
handler (the function created in the loop that is stored in signalHandlers) so
that after forwarding to signalGroup it also restores default behavior and
re-emits the original signal to the process (remove each handler from process
via process.removeListener or process.off using signalHandlers, then re-send the
same signal with process.kill(process.pid, sig) so Node's default exit semantics
apply); if you need to preserve the existing grace period logic, re-emit the
signal after the same watchdog grace timeout used for child termination (use
setTimeout with that timeout) so external cancellation follows the same grace
ladder.
In `@scripts/test-changed.mjs`:
- Around line 211-233: The new async path drops the child's cwd so tests run
from process.cwd() instead of the intended rootDir; fix by preserving and
forwarding cwd through the call chain: ensure runMaybeIsolated and
runWatchedTest (and the internal run() call they use) accept and pass through a
cwd option (defaulting to the existing rootDir when present) instead of
discarding spawnOptions, and update the shared helper (run-vitest-watchdog) to
accept a cwd parameter and pass it into its internal spawn(...) invocation so
spawned processes run from rootDir when provided.
---
Nitpick comments:
In `@packages/dashboard/scripts/__tests__/run-vitest-with-heap.test.ts`:
- Around line 150-154: The test should not rely on a 2s real-clock wait; replace
the long end-to-end timeout with a narrow seam: either (A) reduce
FUSION_RUN_VITEST_TIMEOUT_MS to a sub-second value (e.g., 200–500) and keep
FUSION_RUN_VITEST_KILL_GRACE_MS proportionally small so the stub process becomes
observable immediately, or (B) avoid real waiting by using fake timers or a test
helper to simulate/advance the watchdog timeout and assert the timeout->exit
behavior; update the environment variables FUSION_RUN_VITEST_TIMEOUT_MS and
FUSION_RUN_VITEST_KILL_GRACE_MS (or convert the test to use fake timers) in
run-vitest-with-heap.test.ts accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: c27a3d7b-d3a6-4a2f-87df-6359c6fa5f22
📒 Files selected for processing (12)
.github/workflows/full-suite.ymldocs/dashboard-guide.mddocs/plans/2026-06-13-001-fix-test-timeout-failures-plan.mdpackages/dashboard/app/components/TaskChatTab.csspackages/dashboard/app/components/TaskChatTab.tsxpackages/dashboard/app/components/__tests__/TaskChatTab.test.tsxpackages/dashboard/scripts/__tests__/run-vitest-with-heap.test.tspackages/dashboard/scripts/run-vitest-with-heap.mjsscripts/__tests__/run-vitest-watchdog.test.mjsscripts/ci-test-shard.mjsscripts/lib/run-vitest-watchdog.mjsscripts/test-changed.mjs
- watchdog: escalate forwarded SIGINT/SIGTERM/SIGHUP to SIGKILL after grace so external cancellation can't hang for the full budget (coderabbit major) - watchdog: route onProcExit through signalGroup for injection consistency (greptile) - watchdog: add cwd option; test-changed passes rootDir so pnpm runs from repo root regardless of invocation cwd (coderabbit major — preserved original run() cwd) - dashboard runner: validate/clamp FUSION_RUN_VITEST_* env so a malformed value can't NaN-disable the watchdog (coderabbit) - tests: verify exit-listener cleanup, forwarded-signal escalation, cwd passthrough - plan doc: per-class-ceiling fallback wording (not median); label Output Structure fence Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/dashboard/scripts/run-vitest-with-heap.mjs (1)
37-58: 💤 Low valueConsider wrapping
JSON.parsefor clearer error messaging.If
FUSION_RUN_VITEST_SPAWN_OVERRIDEcontains malformed JSON,JSON.parse()throws a genericSyntaxErrorthat doesn't mention the env var name. Since this is a test seam and the outer.catch()will still log and exit 1, this is minor—but a wrapped parse would make debugging easier when the override is misconfigured.🔧 Optional improvement
const override = process.env.FUSION_RUN_VITEST_SPAWN_OVERRIDE; if (!override) { return { command: "pnpm", args: ["exec", "vitest", ...vitestArgs] }; } - const parsedOverride = JSON.parse(override); + let parsedOverride; + try { + parsedOverride = JSON.parse(override); + } catch (err) { + throw new Error(`FUSION_RUN_VITEST_SPAWN_OVERRIDE is not valid JSON: ${err.message}`); + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/dashboard/scripts/run-vitest-with-heap.mjs` around lines 37 - 58, Wrap the JSON.parse call in resolveSpawnCommand so malformed FUSION_RUN_VITEST_SPAWN_OVERRIDE values throw a clearer Error that names the env var; keep the existing validation and return shape unchanged, and make the new parse failure message more actionable than the raw SyntaxError.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@packages/dashboard/scripts/run-vitest-with-heap.mjs`:
- Around line 37-58: Wrap the JSON.parse call in resolveSpawnCommand so
malformed FUSION_RUN_VITEST_SPAWN_OVERRIDE values throw a clearer Error that
names the env var; keep the existing validation and return shape unchanged, and
make the new parse failure message more actionable than the raw SyntaxError.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 3c3d8f52-9f39-4785-a315-453787e27866
📒 Files selected for processing (5)
docs/plans/2026-06-13-001-fix-test-timeout-failures-plan.mdpackages/dashboard/scripts/run-vitest-with-heap.mjsscripts/__tests__/run-vitest-watchdog.test.mjsscripts/lib/run-vitest-watchdog.mjsscripts/test-changed.mjs
✅ Files skipped from review due to trivial changes (1)
- docs/plans/2026-06-13-001-fix-test-timeout-failures-plan.md
🚧 Files skipped from review as they are similar to previous changes (2)
- scripts/lib/run-vitest-watchdog.mjs
- scripts/test-changed.mjs
…ures # Conflicts: # scripts/test-changed.mjs
…ures # Conflicts: # scripts/test-changed.mjs
Summary
Reliability-focused work from
docs/plans/2026-06-13-001-fix-test-timeout-failures-plan.md: make test timeout failures fail fast and diagnosably instead of hanging to the CI 6h ceiling, and lay the groundwork to root-cause the cross-package flake cluster.This PR lands the guardrail layer (plan units U1, U2, and KTD-6). The higher-risk surgery (U3 WORKER_ROOT fix + flake rescue, U4 guard attribution, U5 serialized-project containment, U6 shard guardrail) is recorded as residual follow-up work below — it needs iterative full-suite verification that shouldn't be rushed into one autonomous pass.
What's in this PR (verified)
scripts/lib/run-vitest-watchdog.mjs, U1/U2): generalizes the dashboard heap-runner's detached-process-groupSIGTERM→SIGKILLlifecycle. Per-class budget bands (shard/changed/dashboard-lane) are the safety net; timings only tighten within a generous ceiling, and only when the snapshot is fresh — a staletest-timings.jsoncan never produce a too-tight (false-kill) budget. On timeout it emits inline hang diagnostics (elapsed, last heartbeat, wrapper handle summary).spawnSyncto async watchdog spawn (U1):scripts/ci-test-shard.mjs(per-command budget from aggregated package weights) andscripts/test-changed.mjs(full/affected runs get a generous 60min backstop; the quick gate uses the changed ceiling). Intest-changed.mjsthe watched runner throws on failure/timeout with.exitCodeso the existingfinallycleanup (isolated-HOME prune + isolation post-check) still reaps any leaked HOME on a watchdog kill.run-vitest-with-heap.mjs), behavior preserved (6144MiB heap, 15min default, 5s grace, signal forwarding, exit 124).timeout-minutesbackstops on allfull-suite.ymltest jobs (KTD-6):test-shards/test-slow= 60,test-inventory-guard= 20 — strictly above the L2 watchdog ceiling so L2 fires first, and far below the silent 6h default.Verification
scripts/__tests__/run-vitest-watchdog.test.mjs— 11/11 (budget derivation, timeout→SIGTERM→SIGKILL→124, signal forwarding, listener cleanup).ci-test-shard— 37/38 (the 1 failure,--dry-rundashboard-lane expansion, is pre-existing onmain).test-changed— 84/84.pnpm testran the gate suite (664 tests) through the new async watchdog path and passed, with the[watchdog] still runningheartbeat confirming the wrapper is active.Residual / follow-up work (remaining plan units)
WORKER_ROOTcluster (highest value). The smoke run deterministically reproduced the leak: a gate run left 6 leaked/tmp/fusion-test-workers-*dirs, tripping the isolation guard. This is the pre-existing bug (documented inscripts/lib/test-quarantine.jsononmain, dated 2026-06-12, across 5 core + several engine entries), not a regression from this PR — the mechanism lives entirely inside the vitest process tree (packages/core/src/__test-utils__/vitest-teardown.tsglobalSetup + per-forkWORKER_ROOTinvitest-setup.ts), unaffected by whether the parentpnpmis spawned detached. Worker forks that don't inheritFUSION_TEST_WORKER_ROOTmint their own roots, and busy-dirrmSyncfailures on teardown leak them. Fix: make the worker-root lifetime reliably per-invocation-shared and robustly cleaned, then rescue/delete the quarantined cluster per the ratchet.vitest-setup.ts(structured record naming the owning test/argv/elapsed on a kill; budget scaling stays deferred as a data-gated follow-up per the plan's anti-appeasement constraint).engine-reliability/engine-slow); decision-gated, may be zero-code if the U1 watchdog already contains the wedge.DEFAULT_BALANCE_TOLERANCE/TIMINGS_STALENESS_DAYS/--check-timings-stalenessmachinery inci-test-shard.mjs.Notes
mainthat was 2 commits ahead oforigin/main(FN-6344, FN-6339); those will reconcile when local main lands.🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
New Features
Documentation