Skip to content

feat(watch): health check — show running watch instance status (#808)#809

Closed
tamirdresher wants to merge 4 commits intodevfrom
squad/808-watch-health
Closed

feat(watch): health check — show running watch instance status (#808)#809
tamirdresher wants to merge 4 commits intodevfrom
squad/808-watch-health

Conversation

@tamirdresher
Copy link
Copy Markdown
Collaborator

What

Implements #808 — \squad watch --health\ command for watch instance monitoring.

Why

When multiple watch instances run (or one gets stuck), there's no way to check status. Auth switches silently break the watch with no visibility.

How

  • PID file (.squad/.watch-pid.json) written at watch startup with pid, auth user, capabilities, interval
  • --health\ flag reads PID file, verifies process is alive, checks auth drift
  • Stale PID cleanup when process has died
  • Exit handlers (SIGINT/SIGTERM/exit) clean up PID file
  • Auth guard captures active gh user at startup and restores on drift during each round

Files changed

  • \packages/squad-cli/src/cli/commands/watch/health.ts\ — new health check module
  • \packages/squad-cli/src/cli/commands/watch/index.ts\ — PID file write at startup, auth guard, re-exports
  • \packages/squad-cli/src/cli-entry.ts\ — --health\ flag dispatch
  • \packages/squad-cli/package.json\ — new ./commands/watch/health\ export
  • .gitignore\ — ignore .squad/.watch-pid.json\
  • \ est/cli/watch-health.test.ts\ — 10 unit tests

Testing

  • Unit tests for health check parsing, stale detection, auth drift, edge cases (corrupt file, missing fields)
  • All 10 new tests pass, all 6 existing watch tests pass

Exports

  • \getWatchHealth, \writePidFile,
    emovePidFile, \getPidPath, \isProcessAlive, \WatchPidInfo\ re-exported from @bradygaster/squad-cli/commands/watch\

Breaking changes

None

Adds --health flag to squad watch that reports:
- Whether a watch instance is running (PID tracking)
- Uptime, auth account, interval, capabilities
- Auth drift detection (expected vs current gh account)
- Stale PID cleanup for crashed instances

Writes .squad/.watch-pid.json at startup, cleaned on exit/SIGINT.
Auth guard captures active gh user at startup and restores on drift.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 4, 2026 12:18
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

🛫 PR Readiness Check

ℹ️ This comment updates on each push. Last checked: commit f2c734b

⚠️ 4 item(s) to address before review

Status Check Details
Single commit 4 commits — consider squashing before review
Not in draft Ready for review
Branch up to date Up to date with dev
Copilot review No Copilot review yet — it may still be processing
Changeset present Changeset file found
Scope clean No .squad/ or docs/proposals/ files
No merge conflicts No merge conflicts
Copilot threads resolved 7 unresolved Copilot thread(s) — fix and resolve before merging
CI passing 15 check(s) still running

This check runs automatically on every push. Fix any ❌ items and push again.
See CONTRIBUTING.md and PR Requirements for details.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a watch “health check” implementation for monitoring a running squad watch instance by persisting runtime metadata to a PID file and reporting status/auth drift.

Changes:

  • Introduces watch/health.ts for PID-file read/write, liveness checks, and formatted health output.
  • Writes .squad/.watch-pid.json on watch startup and adds a GitHub auth “drift” guard inside the watch loop.
  • Adds unit tests, a new package export for the health module, and ignores the runtime PID file via .gitignore.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
test/cli/watch-health.test.ts Adds unit tests for PID-file I/O and health status string output.
packages/squad-cli/src/cli/commands/watch/index.ts Captures expected GH user, writes PID file on startup, adds auth drift restore, and re-exports health helpers.
packages/squad-cli/src/cli/commands/watch/health.ts Implements PID-file schema + health reporting + process liveness checks.
packages/squad-cli/src/cli-entry.ts Updates CLI help text (note: no --health dispatch added in the shown changes).
packages/squad-cli/package.json Exposes ./commands/watch/health export entry.
.gitignore Ignores .squad/.watch-pid.json runtime file.
.changeset/watch-health.md Adds release notes for the new watch health feature.

* always runs — it is not an opt-in capability.
*/

import fs from 'node:fs';
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fs is imported but never used in this module, which adds noise and can trip lint rules in some setups. Remove the unused import (or start using it) to keep the file clean.

Suggested change
import fs from 'node:fs';

Copilot uses AI. Check for mistakes.
Comment on lines +50 to +59
export function getActiveGhUser(): string | undefined {
try {
const result = execFileSync('gh', ['auth', 'status', '--active'], {
encoding: 'utf-8',
stdio: ['pipe', 'pipe', 'pipe'],
});
// gh auth status outputs: "Logged in to github.com account XXXXX ..."
// Capture stderr too — gh writes status info to both streams
const match = result.match(/account\s+(\S+)/);
return match?.[1];
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

execFileSync() only returns stdout; the comment says it captures stderr too, but stderr output from a successful gh auth status invocation won't be included in result. If gh writes the account info to stderr with exit code 0, this will fail to detect the active user—consider using spawnSync() and combining stdout+stderr (or call gh api user -q .login).

Copilot uses AI. Check for mistakes.
Comment on lines 698 to 705
// Pre-create squad member labels so addTag never fails on missing labels
if (adapter.ensureTag) {
// eslint-disable-next-line @typescript-eslint/no-explicit-any
const adapterAny = adapter as any;
if (adapterAny.ensureTag) {
for (const member of roster) {
try {
await adapter.ensureTag(member.label, { color: 'd4c5f9', description: `Squad triage: ${member.name}` });
await adapterAny.ensureTag(member.label, { color: 'd4c5f9', description: `Squad triage: ${member.name}` });
} catch { /* best-effort — continue if label creation fails */ }
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting adapter to any and disabling the lint rule is unnecessary here: PlatformAdapter already defines ensureTag?. Prefer if (adapter.ensureTag) await adapter.ensureTag(...) (or optional chaining) to keep type-safety and avoid masking real adapter-shape issues.

Copilot uses AI. Check for mistakes.
Comment on lines +763 to +765
process.on('exit', cleanupPidFile);
process.on('SIGINT', () => { cleanupPidFile(); process.exit(0); });
process.on('SIGTERM', () => { cleanupPidFile(); process.exit(0); });
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SIGINT/SIGTERM handlers call process.exit(0), which discards the standard signal exit codes (130/143) and can make scripted supervision think the watch exited cleanly. Prefer setting process.exitCode (or exiting with the conventional codes) and avoid immediate process.exit() so other cleanup/flush handlers have a chance to run.

Suggested change
process.on('exit', cleanupPidFile);
process.on('SIGINT', () => { cleanupPidFile(); process.exit(0); });
process.on('SIGTERM', () => { cleanupPidFile(); process.exit(0); });
const handleSigint = () => {
process.off('SIGINT', handleSigint);
cleanupPidFile();
process.kill(process.pid, 'SIGINT');
};
const handleSigterm = () => {
process.off('SIGTERM', handleSigterm);
cleanupPidFile();
process.kill(process.pid, 'SIGTERM');
};
process.on('exit', cleanupPidFile);
process.on('SIGINT', handleSigint);
process.on('SIGTERM', handleSigterm);

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +51
*/
export function isProcessAlive(pid: number): boolean {
try {
process.kill(pid, 0);
return true;
} catch {
return false;
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isProcessAlive() treats all errors from process.kill(pid, 0) as "dead". On some platforms this can throw EPERM for a live process you don't have permission to signal, which would incorrectly mark the PID as stale and trigger PID-file cleanup. Handle EPERM as "alive" and only treat ESRCH as "not running".

Suggested change
*/
export function isProcessAlive(pid: number): boolean {
try {
process.kill(pid, 0);
return true;
} catch {
return false;
* `EPERM` means the process exists but cannot be signaled by this user.
*/
export function isProcessAlive(pid: number): boolean {
try {
process.kill(pid, 0);
return true;
} catch (error: unknown) {
const err = error as NodeJS.ErrnoException;
if (err.code === 'EPERM') {
return true;
}
if (err.code === 'ESRCH') {
return false;
}
throw error;

Copilot uses AI. Check for mistakes.
Comment on lines +115 to +117
// Process is alive — build status report
const uptime = formatUptime(Date.now() - new Date(info.startedAt).getTime());

Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startedAt is not validated before computing uptime. If the PID file is missing startedAt or contains an invalid timestamp, new Date(info.startedAt).getTime() becomes NaN and the output will show NaNm. Consider validating startedAt and falling back to an "unknown" uptime (or treating the PID file as invalid).

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +9
feat(watch): health check — show running watch instance status (#808)

Adds `squad watch --health` to display the status of a running watch
instance: PID, uptime, auth account, capabilities, and auth drift
detection. Writes `.squad/.watch-pid.json` at startup for instance
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changeset claims the CLI adds squad watch --health, but there is currently no argument parsing/dispatch for --health in packages/squad-cli/src/cli-entry.ts (and no other CLI router references it). Either wire up the flag/command as described, or adjust the release note to match the implemented API surface.

Copilot uses AI. Check for mistakes.
Comment on lines +118 to +124
it('isProcessAlive returns true for current process', () => {
expect(isProcessAlive(process.pid)).toBe(true);
});

it('isProcessAlive returns false for non-existent PID', () => {
expect(isProcessAlive(999999999)).toBe(false);
});
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current tests cover "alive" and "dead" PIDs, but they don't cover the important EPERM case (live process but insufficient permissions), which should be treated as alive after adjusting isProcessAlive(). Consider stubbing process.kill in this suite to throw an EPERM-shaped error and asserting the function returns true.

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +76
function probeCurrentGhUser(): string | undefined {
try {
const result = execFileSync('gh', ['auth', 'status', '--active'], {
encoding: 'utf-8',
stdio: ['pipe', 'pipe', 'pipe'],
});
const match = result.match(/account\s+(\S+)/);
return match?.[1];
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probeCurrentGhUser() has the same stdout/stderr issue as getActiveGhUser(): execFileSync() does not include stderr when the command succeeds. If gh auth status --active prints the active account to stderr with exit code 0, auth drift detection here will silently fail. Consider switching to spawnSync() and merging outputs (or using a more reliable gh api query).

Copilot uses AI. Check for mistakes.
Moves .squad/.watch-pid.json → {os.tmpdir()}/squad-watch-{hash}.json
to prevent accidental git commits. Uses MD5 hash of repo path for
worktree/multi-clone safety. Cross-platform via os.tmpdir().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tamirdresher and others added 2 commits April 4, 2026 18:29
- Remove unused fs import from watch/index.ts
- Fix SIGINT exit code to 128+2 (130), SIGTERM to 128+15 (143)
- Replace (adapter as any) with proper type narrowing via optional method check
- Handle EPERM in isProcessAlive (process exists but no permission)
- Add startedAt validation guard before computing uptime
- Wire --health flag in cli-entry.ts
- Add doc comment for probeCurrentGhUser stderr handling
- Add EPERM test case for isProcessAlive

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

🏗️ Architectural Review

⚠️ Architectural review: 1 warning(s).

Severity Category Finding Files
🟡 warning bootstrap-area 1 file(s) in the bootstrap area (packages/squad-cli/src/cli/core/) were modified. These files must maintain zero external dependencies. Review carefully. packages/squad-cli/src/cli/core/upgrade.ts

Automated architectural review — informational only.

@tamirdresher
Copy link
Copy Markdown
Collaborator Author

Superseded by #830 — combined into a single watch next-gen PR for easier review.

@tamirdresher
Copy link
Copy Markdown
Collaborator Author

Closing — superseded by #830 (watch-next-gen combined PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants