Skip to content

fix(tool-server): prevent duplicate tool-servers after an nvm node switch#385

Open
filip131311 wants to merge 1 commit into
mainfrom
fix/tool-server-duplicate-on-node-switch
Open

fix(tool-server): prevent duplicate tool-servers after an nvm node switch#385
filip131311 wants to merge 1 commit into
mainfrom
fix/tool-server-duplicate-on-node-switch

Conversation

@filip131311

Copy link
Copy Markdown
Collaborator

Problem

Switching Node versions with nvm can leave two (or more) tool-server processes alive at once while ~/.argent/tool-server.json tracks only one. The extra servers are orphaned (untracked, on leaked ports) until their 30‑minute idle timeout.

This was reproduced live: a machine had three node …/dist/tool-server.cjs start processes running, all spawned within the same minute, with state tracking only the last one.

Root cause

@argent/tools-client ensureToolsServer() ran a non-atomic
read state → check alive+healthy → clearState → spawn → writeState sequence with:

  1. No cross-process serialization. An nvm switch relaunches argent mcp under the new Node while the previous MCP's 30s health monitor reconnects — several launchers run ensureToolsServer() at nearly the same time, all observe "no healthy server", and each spawns its own detached server on a fresh free port. Last‑writer‑wins on the state file orphans the rest.
  2. No kill-before-respawn. When the tracked server was alive but unhealthy, the old code just clearState()d and spawned a new one, leaving the old detached process running untracked.

Fix (packages/argent-tools-client/src/launcher.ts)

  • O_EXCL spawn lock (~/.argent/tool-server.lock) serialises the spawn decision across processes: per-acquisition nonce, stale‑steal (dead holder / aged lock), bounded wait, and a hot‑spin‑proof reclaim loop. Inside the lock, ensureToolsServer double‑checks for a peer‑spawned healthy server before spawning, so a burst of concurrent launches collapses to a single server.
  • Kill-before-respawn terminates a wedged server we own before spawning its replacement, with three guards so it can never signal the wrong process:
    • managed === "autospawn" — a user's argent server start server is now tagged managed: "cli" and is never killed by the auto‑spawn path;
    • a structural node <bundlePath> start command-line identity match (not a bare substring);
    • a re-check of that identity immediately before each signal (covers PID reuse across the SIGTERM grace window).
  • spawnToolsServer reaps its detached child on every reject path, so a readiness timeout can no longer leak a child that binds its port moments later as an untracked second server.

The lock-free fast path (reuse a healthy tracked server) is unchanged, so steady‑state behaviour and latency are unaffected; the lock is only taken when a spawn is actually needed.

Tests

New packages/argent-tools-client/test/launcher-duplicate-spawn.test.ts:

  • a 5× concurrent launch burst collapses to one server;
  • an alive‑but‑unhealthy autospawn server is terminated (not orphaned) on respawn;
  • a recycled pid whose command line doesn't match is never signalled;
  • a managed: "cli" server is left alive instead of being killed;
  • a readiness timeout reaps the spawned child (no orphan).

Each case was verified to fail against the pre-fix code and pass with the fix. Existing launcher-spawn / launcher-state suites are unchanged and green.

Note: the @swmansion/argent dispatcher › forwards SIGTERM to the child binary unit test is a pre-existing flake (alternates pass/fail independent of this change, which doesn't touch that package).

🤖 Generated with Claude Code

…itch

Switching node versions via nvm could leave two or more tool-server
processes alive at once while ~/.argent/tool-server.json tracked only one.

Root cause in @argent/tools-client `ensureToolsServer()`: the
read-state -> check-alive+healthy -> clearState -> spawn -> writeState
sequence had no cross-process serialization (two launchers both spawn) and
never terminated the previously-tracked detached server on respawn (orphan).
An nvm switch relaunches `argent mcp` under the new node while the previous
MCP's health monitor reconnects, so several launchers race and each spawns
its own detached server on a fresh port; last-writer-wins state leaves the
rest orphaned until their 30-min idle timeout.

- Add an O_EXCL spawn lock (~/.argent/tool-server.lock) with a
  per-acquisition nonce, stale-steal (dead holder / aged lock), a bounded
  wait, and a hot-spin-proof reclaim loop. ensureToolsServer double-checks
  for a peer-spawned healthy server inside the lock before spawning.
- Kill-before-respawn: terminate a wedged/unhealthy server we own before
  spawning its replacement, gated by (a) managed === "autospawn", (b) a
  structural `node <bundlePath> start` command-line identity match, and
  (c) a re-check immediately before each signal -- so a recycled pid or a
  user's `argent server start` (now tagged managed: "cli") is never killed.
- spawnToolsServer reaps its detached child on every reject path so a
  readiness timeout can't leak an orphan that binds its port late.

Adds launcher-duplicate-spawn.test.ts (burst-collapse, unhealthy-orphan,
recycled-pid safety, CLI-server preserved, spawn-timeout reap); each case
was verified to fail against the pre-fix code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant