feat(watch): health check — show running watch instance status (#808)#809
feat(watch): health check — show running watch instance status (#808)#809tamirdresher wants to merge 4 commits intodevfrom
Conversation
Adds --health flag to squad watch that reports: - Whether a watch instance is running (PID tracking) - Uptime, auth account, interval, capabilities - Auth drift detection (expected vs current gh account) - Stale PID cleanup for crashed instances Writes .squad/.watch-pid.json at startup, cleaned on exit/SIGINT. Auth guard captures active gh user at startup and restores on drift. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🛫 PR Readiness Check
|
| Status | Check | Details |
|---|---|---|
| ❌ | Single commit | 4 commits — consider squashing before review |
| ✅ | Not in draft | Ready for review |
| ✅ | Branch up to date | Up to date with dev |
| ❌ | Copilot review | No Copilot review yet — it may still be processing |
| ✅ | Changeset present | Changeset file found |
| ✅ | Scope clean | No .squad/ or docs/proposals/ files |
| ✅ | No merge conflicts | No merge conflicts |
| ❌ | Copilot threads resolved | 7 unresolved Copilot thread(s) — fix and resolve before merging |
| ❌ | CI passing | 15 check(s) still running |
This check runs automatically on every push. Fix any ❌ items and push again.
See CONTRIBUTING.md and PR Requirements for details.
There was a problem hiding this comment.
Pull request overview
Adds a watch “health check” implementation for monitoring a running squad watch instance by persisting runtime metadata to a PID file and reporting status/auth drift.
Changes:
- Introduces
watch/health.tsfor PID-file read/write, liveness checks, and formatted health output. - Writes
.squad/.watch-pid.jsonon watch startup and adds a GitHub auth “drift” guard inside the watch loop. - Adds unit tests, a new package export for the health module, and ignores the runtime PID file via
.gitignore.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
test/cli/watch-health.test.ts |
Adds unit tests for PID-file I/O and health status string output. |
packages/squad-cli/src/cli/commands/watch/index.ts |
Captures expected GH user, writes PID file on startup, adds auth drift restore, and re-exports health helpers. |
packages/squad-cli/src/cli/commands/watch/health.ts |
Implements PID-file schema + health reporting + process liveness checks. |
packages/squad-cli/src/cli-entry.ts |
Updates CLI help text (note: no --health dispatch added in the shown changes). |
packages/squad-cli/package.json |
Exposes ./commands/watch/health export entry. |
.gitignore |
Ignores .squad/.watch-pid.json runtime file. |
.changeset/watch-health.md |
Adds release notes for the new watch health feature. |
| * always runs — it is not an opt-in capability. | ||
| */ | ||
|
|
||
| import fs from 'node:fs'; |
There was a problem hiding this comment.
fs is imported but never used in this module, which adds noise and can trip lint rules in some setups. Remove the unused import (or start using it) to keep the file clean.
| import fs from 'node:fs'; |
| export function getActiveGhUser(): string | undefined { | ||
| try { | ||
| const result = execFileSync('gh', ['auth', 'status', '--active'], { | ||
| encoding: 'utf-8', | ||
| stdio: ['pipe', 'pipe', 'pipe'], | ||
| }); | ||
| // gh auth status outputs: "Logged in to github.com account XXXXX ..." | ||
| // Capture stderr too — gh writes status info to both streams | ||
| const match = result.match(/account\s+(\S+)/); | ||
| return match?.[1]; |
There was a problem hiding this comment.
execFileSync() only returns stdout; the comment says it captures stderr too, but stderr output from a successful gh auth status invocation won't be included in result. If gh writes the account info to stderr with exit code 0, this will fail to detect the active user—consider using spawnSync() and combining stdout+stderr (or call gh api user -q .login).
| // Pre-create squad member labels so addTag never fails on missing labels | ||
| if (adapter.ensureTag) { | ||
| // eslint-disable-next-line @typescript-eslint/no-explicit-any | ||
| const adapterAny = adapter as any; | ||
| if (adapterAny.ensureTag) { | ||
| for (const member of roster) { | ||
| try { | ||
| await adapter.ensureTag(member.label, { color: 'd4c5f9', description: `Squad triage: ${member.name}` }); | ||
| await adapterAny.ensureTag(member.label, { color: 'd4c5f9', description: `Squad triage: ${member.name}` }); | ||
| } catch { /* best-effort — continue if label creation fails */ } |
There was a problem hiding this comment.
Casting adapter to any and disabling the lint rule is unnecessary here: PlatformAdapter already defines ensureTag?. Prefer if (adapter.ensureTag) await adapter.ensureTag(...) (or optional chaining) to keep type-safety and avoid masking real adapter-shape issues.
| process.on('exit', cleanupPidFile); | ||
| process.on('SIGINT', () => { cleanupPidFile(); process.exit(0); }); | ||
| process.on('SIGTERM', () => { cleanupPidFile(); process.exit(0); }); |
There was a problem hiding this comment.
The SIGINT/SIGTERM handlers call process.exit(0), which discards the standard signal exit codes (130/143) and can make scripted supervision think the watch exited cleanly. Prefer setting process.exitCode (or exiting with the conventional codes) and avoid immediate process.exit() so other cleanup/flush handlers have a chance to run.
| process.on('exit', cleanupPidFile); | |
| process.on('SIGINT', () => { cleanupPidFile(); process.exit(0); }); | |
| process.on('SIGTERM', () => { cleanupPidFile(); process.exit(0); }); | |
| const handleSigint = () => { | |
| process.off('SIGINT', handleSigint); | |
| cleanupPidFile(); | |
| process.kill(process.pid, 'SIGINT'); | |
| }; | |
| const handleSigterm = () => { | |
| process.off('SIGTERM', handleSigterm); | |
| cleanupPidFile(); | |
| process.kill(process.pid, 'SIGTERM'); | |
| }; | |
| process.on('exit', cleanupPidFile); | |
| process.on('SIGINT', handleSigint); | |
| process.on('SIGTERM', handleSigterm); |
| */ | ||
| export function isProcessAlive(pid: number): boolean { | ||
| try { | ||
| process.kill(pid, 0); | ||
| return true; | ||
| } catch { | ||
| return false; |
There was a problem hiding this comment.
isProcessAlive() treats all errors from process.kill(pid, 0) as "dead". On some platforms this can throw EPERM for a live process you don't have permission to signal, which would incorrectly mark the PID as stale and trigger PID-file cleanup. Handle EPERM as "alive" and only treat ESRCH as "not running".
| */ | |
| export function isProcessAlive(pid: number): boolean { | |
| try { | |
| process.kill(pid, 0); | |
| return true; | |
| } catch { | |
| return false; | |
| * `EPERM` means the process exists but cannot be signaled by this user. | |
| */ | |
| export function isProcessAlive(pid: number): boolean { | |
| try { | |
| process.kill(pid, 0); | |
| return true; | |
| } catch (error: unknown) { | |
| const err = error as NodeJS.ErrnoException; | |
| if (err.code === 'EPERM') { | |
| return true; | |
| } | |
| if (err.code === 'ESRCH') { | |
| return false; | |
| } | |
| throw error; |
| // Process is alive — build status report | ||
| const uptime = formatUptime(Date.now() - new Date(info.startedAt).getTime()); | ||
|
|
There was a problem hiding this comment.
startedAt is not validated before computing uptime. If the PID file is missing startedAt or contains an invalid timestamp, new Date(info.startedAt).getTime() becomes NaN and the output will show NaNm. Consider validating startedAt and falling back to an "unknown" uptime (or treating the PID file as invalid).
| feat(watch): health check — show running watch instance status (#808) | ||
|
|
||
| Adds `squad watch --health` to display the status of a running watch | ||
| instance: PID, uptime, auth account, capabilities, and auth drift | ||
| detection. Writes `.squad/.watch-pid.json` at startup for instance |
There was a problem hiding this comment.
This changeset claims the CLI adds squad watch --health, but there is currently no argument parsing/dispatch for --health in packages/squad-cli/src/cli-entry.ts (and no other CLI router references it). Either wire up the flag/command as described, or adjust the release note to match the implemented API surface.
| it('isProcessAlive returns true for current process', () => { | ||
| expect(isProcessAlive(process.pid)).toBe(true); | ||
| }); | ||
|
|
||
| it('isProcessAlive returns false for non-existent PID', () => { | ||
| expect(isProcessAlive(999999999)).toBe(false); | ||
| }); |
There was a problem hiding this comment.
Current tests cover "alive" and "dead" PIDs, but they don't cover the important EPERM case (live process but insufficient permissions), which should be treated as alive after adjusting isProcessAlive(). Consider stubbing process.kill in this suite to throw an EPERM-shaped error and asserting the function returns true.
| function probeCurrentGhUser(): string | undefined { | ||
| try { | ||
| const result = execFileSync('gh', ['auth', 'status', '--active'], { | ||
| encoding: 'utf-8', | ||
| stdio: ['pipe', 'pipe', 'pipe'], | ||
| }); | ||
| const match = result.match(/account\s+(\S+)/); | ||
| return match?.[1]; |
There was a problem hiding this comment.
probeCurrentGhUser() has the same stdout/stderr issue as getActiveGhUser(): execFileSync() does not include stderr when the command succeeds. If gh auth status --active prints the active account to stderr with exit code 0, auth drift detection here will silently fail. Consider switching to spawnSync() and merging outputs (or using a more reliable gh api query).
Moves .squad/.watch-pid.json → {os.tmpdir()}/squad-watch-{hash}.json
to prevent accidental git commits. Uses MD5 hash of repo path for
worktree/multi-clone safety. Cross-platform via os.tmpdir().
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove unused fs import from watch/index.ts - Fix SIGINT exit code to 128+2 (130), SIGTERM to 128+15 (143) - Replace (adapter as any) with proper type narrowing via optional method check - Handle EPERM in isProcessAlive (process exists but no permission) - Add startedAt validation guard before computing uptime - Wire --health flag in cli-entry.ts - Add doc comment for probeCurrentGhUser stderr handling - Add EPERM test case for isProcessAlive Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🏗️ Architectural Review
Automated architectural review — informational only. |
|
Superseded by #830 — combined into a single watch next-gen PR for easier review. |
|
Closing — superseded by #830 (watch-next-gen combined PR) |
What
Implements #808 — \squad watch --health\ command for watch instance monitoring.
Why
When multiple watch instances run (or one gets stuck), there's no way to check status. Auth switches silently break the watch with no visibility.
How
Files changed
Testing
Exports
emovePidFile, \getPidPath, \isProcessAlive, \WatchPidInfo\ re-exported from @bradygaster/squad-cli/commands/watch\
Breaking changes
None