fix(daemon): nest agent-teams specialists via cross-session team correlation (0.2.8-rc.0)#86
Conversation
…elation
In the agent-teams (TeamCreate) model each teammate runs as its OWN Claude
Code session, so its TeammateIdle hook fires under the teammate's session_id,
not the coordinator's. The per-session SubagentTracking lookup therefore
misses and specialists' tool calls never nest under the run — they land on the
conversation root with 0 children (and tokens unattributed).
Fix: add a process-global cross-session team-correlation map keyed by
`${team_name}::${name}`. The coordinator's PreToolUse(Agent, team_name)
registers the invoke_agent span there; PostToolUse(Agent) no longer closes it
early; the teammate's TeammateIdle looks it up by team+name regardless of which
session fires it, emits the teammate's transcript chat spans under it, and
closes it. Single-daemon herd-prevention keeps the map in one process.
Verified: 42/42 tests pass incl. the span-tree assertion (chat span is a child
of invoke_agent). Reproduced live — specialists now nest per agent_name with
populated tokens (vs flat/0-children on published 0.2.7).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
I have read the CLA Document and I hereby sign the CLA You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot. |
…open hardening
Addresses code review of the cross-session nesting fix:
- HIGH: `teamMembers` was keyed `${team}::${name}` to a single value, so a
re-spawn of the same agent (the TARS Sonnet→Opus flow) overwrote the first,
still-open invoke_agent span — leaking it and mis-attributing the first
teammate's transcript to the second span. Now a FIFO queue per key: re-spawn
appends; each TeammateIdle consumes the oldest not-yet-emitted entry (mirrors
SubagentTracking.findPendingTeammateIdle). New regression test spawns the same
key twice before either idles and asserts BOTH nest.
- MEDIUM: handlePostToolUseFailure now honors `teamName` (mirrors
handlePostToolUse), so a failed Agent dispatch no longer closes the
team-owned span early and then double-ends it at TeammateIdle.
- Fail-open hardening: handleTeammateIdle no longer early-returns on a missing
session (the teammate's session may be unknown to this daemon) — session is
optional for the cross-session path, required only for the per-session
fallback. A violated `teammate_name === Agent.name` invariant now logs loudly
with the registered keys instead of failing silently.
Tests: 43/43 pass (was 42; +1 re-spawn regression). Verified live: rebuilt
daemon nests /nest-test (queue depth 1→0).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aemon's lifetime The cross-session teamMembers map is in-memory, so any daemon restart mid-run wipes it and breaks nesting (specialists find "no team entry"). Observed live: a real /triage failed to nest because the daemon restarted during a quiet window. The nesting logic is deterministic; the failures came from the daemon's *lifetime*. Prevent the two dominant restart triggers: - Inactivity guard (checkInactivity): do NOT shut down while any teamMembers entry is unemitted — agent-teams runs have quiet windows after spawn that would otherwise trip the 10-min timeout and wipe the map. Bounded by TEAM_INFLIGHT_MAX_MS (60 min idle ceiling) so a crashed teammate that never emits TeammateIdle can't pin the daemon forever (the shutdown backstop then flushes its span). Added WEAVE_INACTIVITY_MS override (testable + ops-tunable) and made the check interval responsive to a short timeout. - Detach the daemon (nohup + disown in hook-handler.sh): a terminal SIGHUP from the session that lazily spawned the daemon could kill it mid-run. macOS has no setsid, so nohup+disown is the portable detach. Daemon still self-reaps via the inactivity timeout. Deterministic, and strictly fewer orphaned traces than before. Residual orphan risk only from the rare herd race (a follow-up emit-on-miss fallback will degrade that to "emitted but not nested" rather than lost) and hard crash/OOM. Tests: 44/44 (was 43; +1 inactivity-guard regression using WEAVE_INACTIVITY_MS). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rgao-coreweave
left a comment
There was a problem hiding this comment.
LGTM, thanks for working on this!
| @@ -1,4 +1,4 @@ | |||
| // BEGIN AUTO-MANAGED VERSION | |||
| // This section is maintained by release automation. Do not edit manually. | |||
| export const VERSION = '0.2.7'; | |||
There was a problem hiding this comment.
This section is maintained by release automation. Do not edit manually.
I think making it rc is good so that when you build a dev version you know the rc is running. Okay for this PR and I will follow up improving the versioning
| @@ -1,6 +1,6 @@ | |||
| { | |||
| "name": "weave-claude-code", | |||
| "version": "0.2.7", | |||
There was a problem hiding this comment.
This section is maintained by release automation. Do not edit manually.
I think making it rc is good so that when you build a dev version you know the rc is running. Okay for this PR and I will follow up improving the versioning
|
|
||
| if ! is_daemon_alive; then | ||
| weave-claude-code daemon >> "${ERROR_LOG}" 2>&1 & | ||
| # Detach the daemon from the spawning session. The daemon is started lazily by |
There was a problem hiding this comment.
This looks like a different issue. Would recommend breaking it into a different PR for the future. Okay for this PR.
…review @rgao-coreweave noted the hook-handler SIGHUP-detach is a separate concern from the cross-session daemon fix. Removing it here so #86 stays focused on the daemon (cross-session nesting + re-spawn FIFO + inactivity guard); detach moves to its own PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
I have read the CLA Document and I hereby sign the CLA |
|
Thanks for the review @rgao-coreweave! Addressed:
So #86 is now scoped strictly to the daemon. Will sign the CLA to clear that check. |
What this does
Makes the Weave daemon nest agent-teams specialists under the run, reliably. In
the
TeamCreatemodel each teammate is its own Claude Code session, so itsTeammateIdlefires under the teammate'ssession_id, not the coordinator's; theper-session lookup misses → specialists land on the conversation root with 0
children and unattributed tokens. This adds a process-global cross-session
correlation map keyed by
${team_name}::${name}, and hardens it for real runs.Commits
PreToolUse(Agent, team_name)registers theinvoke_agentspan;PostToolUse/PostToolUseFailureno longer close it early; the teammate'sTeammateIdlelooks it up and closes it with the teammate's spans.${team}::${name}can be spawned twice (TARS Sonnet→Opus re-spawn). The map is a FIFO queue per key: re-spawn appends; each idle consumes the oldest unemitted entry. No overwrite/leak/mis-attribution. +PostToolUseFailurehonorsteamName; fail-open logging./triagefailed to nest after the daemon restarted in a quiet window). Prevent the two dominant restart triggers (below).Durability (commit 3)
The nesting logic is deterministic; the live failures came from the daemon's lifetime, not its logic. Two prevention fixes:
checkInactivitydoes not shut down while any team member is unemitted (agent-teams runs have quiet windows after spawn that would otherwise trip the 10-min timeout and wipe the map). Bounded by a 60-min idle ceiling so a crashed teammate can't pin the daemon (the shutdown backstop then flushes its span). NewWEAVE_INACTIVITY_MSenv override (testable + ops-tunable); check interval scales to a short timeout.nohup+disowninhook-handler.shso a terminal SIGHUP from the session that lazily spawned the daemon can't kill it mid-run (macOS has nosetsid). It still self-reaps via the inactivity timeout.Result: deterministic, strictly fewer orphaned traces. Residual orphan risk only from the rare herd race and hard crash/OOM (see deferred).
What works (verified)
/triage(8244b755) nests all six peragent_name; queried back via/agents/spans/queryit returns one tree (invoke_agent claude-code → invoke_agent ×6 → chat/tool). Rebuilt/nest-testconfirms (queue depth 1→0). Published 0.2.7 produced flat / 0-children / tokens=0 (4398fae7).Fail-open behavior (explicit)
handleTeammateIdledoes not early-return on a missing session (the teammate's session may be unknown to this daemon) — session is optional for the cross-session path, required only for the per-session fallback.teammate_name === Agent.nameinvariant logs loudly with the registered keys, not silently.Deferred (follow-ups, not required for this PR)
custom_attrs_stringso runs are queryable bysupp-NNNNNinstead of a session UUID. Spans are already queryable byconversation_id+agent_name.0/0while the agent span has real tokens — upstream UI issue.Test evidence
🤖 Generated with Claude Code