Skip to content

fix(daemon): bind socket atomically to survive a concurrent-start herd#94

Merged
rgao-coreweave merged 3 commits into
mainfrom
fix/daemon-startup-race
Jul 2, 2026
Merged

fix(daemon): bind socket atomically to survive a concurrent-start herd#94
rgao-coreweave merged 3 commits into
mainfrom
fix/daemon-startup-race

Conversation

@rgao-coreweave

@rgao-coreweave rgao-coreweave commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

The herd guard probed then unlinked the socket before listen(), leaving a race: two daemons that both saw no socket reached listen() together and the loser crashed with EEXIST/EADDRINUSE instead of yielding (7 such crashes locally).

Listen first; on EADDRINUSE/EEXIST re-probe the socket and yield if a live daemon owns it, reclaiming only a confirmed-stale inode. A late starter can no longer unlink the winner's live socket and split the team map.

Test plan

  • node --import tsx --test tests/daemon-startup-race.test.ts (herd of 12 concurrent starts: zero crashes, exactly one listener)

rgao-coreweave commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@rgao-coreweave rgao-coreweave force-pushed the fix/daemon-idle-during-inflight branch from a6b64d0 to f576eda Compare July 1, 2026 16:55
@rgao-coreweave rgao-coreweave force-pushed the fix/daemon-startup-race branch from d993f9b to 3338a70 Compare July 1, 2026 16:57
@rgao-coreweave rgao-coreweave marked this pull request as ready for review July 1, 2026 17:09
@rgao-coreweave rgao-coreweave requested a review from a team as a code owner July 1, 2026 17:09
@rgao-coreweave rgao-coreweave force-pushed the fix/daemon-idle-during-inflight branch from f576eda to be99a66 Compare July 1, 2026 17:20
rgao-coreweave and others added 3 commits July 1, 2026 10:33
The herd guard probed then unlinked the socket before listen(), leaving a
race: two daemons that both saw no socket reached listen() together and the
loser crashed with EEXIST/EADDRINUSE (exit 1) instead of yielding. Seven such
crashes appeared in one local log; a herd of 12 concurrent starts reproduces
it ~2 of 3 runs.

Listen first; on EADDRINUSE/EEXIST re-probe the socket and yield (exit 0) if a
live daemon owns it, reclaiming only a confirmed-stale inode. A late starter
can no longer unlink the winner's live socket, which previously split the
teamMembers map across two daemons and broke cross-session nesting.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
'yield' reads as the JS keyword / 'yield the event loop' in a .ts file; the
loser actually process.exit(0)s. Say so plainly, and note the loser's hook
event still reaches the winning daemon over the socket.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rgao-coreweave rgao-coreweave force-pushed the fix/daemon-startup-race branch from 69ec875 to e473d7d Compare July 1, 2026 17:34
@rgao-coreweave rgao-coreweave changed the base branch from fix/daemon-idle-during-inflight to main July 1, 2026 17:34

@drtangible drtangible left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍 👍

@rgao-coreweave rgao-coreweave merged commit e46af22 into main Jul 2, 2026
4 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jul 2, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants