Skip to content

fix: make peek after kill deterministically report no session#46

Merged
kylecarbs merged 2 commits into
mainfrom
fix/kill-peek-race
Jun 11, 2026
Merged

fix: make peek after kill deterministically report no session#46
kylecarbs merged 2 commits into
mainfrom
fix/kill-peek-race

Conversation

@kylecarbs

Copy link
Copy Markdown
Member

Fixes the flaky agent loop: new, send, wait, peek, kill integration test that failed on macOS CI (run 27382774714) with boo peek: wanted exit 3, got .{ .Exited = 1 }: error: ConnectionLost.

Problem

The daemon acked quit while its socket file and listener still existed; the socket was unlinked only later during teardown, after Window.destroy reaped the child. A peek issued right after kill returned could connect in that window, read EOF, and exit 1 with error.ConnectionLost instead of the documented no-session exit 3.

Fix

  • daemon: retire the listener (close it and unlink the socket file) before acking quit. By the time kill returns, no new client can find the socket, making the sequential kill-then-peek case deterministic.
  • client: map error.ConnectionLost in mustControl to the no-session failure (exit 3). An EOF on a control exchange means the daemon died before replying, which also covers control connections racing a natural command exit.
  • test: regression test repeating the kill-then-peek sequence 10 times.

Verification

A/B with a temporary 100ms sleep at the top of daemon teardown to widen the race window, driving a 30-iteration new/kill/peek loop:

build result
main + 100ms window 30/30 failed, error: ConnectionLost, exit 1 (exact CI failure)
this PR + 100ms window 0/30 failed

Also zig build test-all: 132/132 passed, and a 300-iteration stress loop on the release build: 0 failures.

Investigation notes
  • sessionInfo already treats any control error as "daemon gone", so ls and wait --idle were robust; mustControl callers (send, peek, wait --text, kill, rename) were not.
  • The window was wide enough to hit on loaded macOS runners because deinit runs Window.destroy (child reaping) before unlinking the socket.
  • Daemon.deinit now reuses retireListener, which guards against double-close after an early retire.
  • The race did not reproduce naturally on a fast Linux box (0/300 on main), hence the deterministic widened-window A/B.

This PR was generated by Coder Agents on behalf of @kylecarbs.

The daemon acked quit while its socket file and listener still
existed; teardown unlinked the socket only afterwards. A control
command issued right after kill returned could connect in that
window, read EOF, and die with error.ConnectionLost (exit 1) instead
of the documented no-session exit 3. This flaked the agent-loop
integration test on macOS CI.

Retire the listener (close it and unlink the socket file) before
acking quit, so by the time kill returns no new client can find the
socket. Also map ConnectionLost in mustControl to the no-session
failure: an EOF on a control exchange means the daemon died before
replying, which callers should see as the session being gone. This
covers control connections racing a natural command exit as well.
Restores the daemon half of the race fix: close the listening socket
and unlink its file before replying to quit, so the socket is gone by
the time kill returns.
@kylecarbs kylecarbs merged commit 810dd76 into main Jun 11, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant