Skip to content

fix(daemon): nest agent-teams specialists via cross-session team correlation (0.2.8-rc.0)#86

Merged
titothedeveloper merged 4 commits into
mainfrom
fix/cross-session-team-correlation
Jun 11, 2026
Merged

fix(daemon): nest agent-teams specialists via cross-session team correlation (0.2.8-rc.0)#86
titothedeveloper merged 4 commits into
mainfrom
fix/cross-session-team-correlation

Conversation

@titothedeveloper

@titothedeveloper titothedeveloper commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What this does

Makes the Weave daemon nest agent-teams specialists under the run, reliably. In
the TeamCreate model each teammate is its own Claude Code session, so its
TeammateIdle fires under the teammate's session_id, not the coordinator's; the
per-session lookup misses → specialists land on the conversation root with 0
children
and unattributed tokens. This adds a process-global cross-session
correlation map keyed by ${team_name}::${name}, and hardens it for real runs.

Commits

  1. Cross-session nesting — coordinator PreToolUse(Agent, team_name) registers the invoke_agent span; PostToolUse/PostToolUseFailure no longer close it early; the teammate's TeammateIdle looks it up and closes it with the teammate's spans.
  2. Re-spawn safety (FIFO queue) — the same ${team}::${name} can be spawned twice (TARS Sonnet→Opus re-spawn). The map is a FIFO queue per key: re-spawn appends; each idle consumes the oldest unemitted entry. No overwrite/leak/mis-attribution. + PostToolUseFailure honors teamName; fail-open logging.
  3. Durability across the daemon's lifetime — the map is in-memory, so a daemon restart mid-run wiped it (observed live: a real /triage failed to nest after the daemon restarted in a quiet window). Prevent the two dominant restart triggers (below).

Durability (commit 3)

The nesting logic is deterministic; the live failures came from the daemon's lifetime, not its logic. Two prevention fixes:

  • Inactivity guardcheckInactivity does not shut down while any team member is unemitted (agent-teams runs have quiet windows after spawn that would otherwise trip the 10-min timeout and wipe the map). Bounded by a 60-min idle ceiling so a crashed teammate can't pin the daemon (the shutdown backstop then flushes its span). New WEAVE_INACTIVITY_MS env override (testable + ops-tunable); check interval scales to a short timeout.
  • Daemon detachnohup + disown in hook-handler.sh so a terminal SIGHUP from the session that lazily spawned the daemon can't kill it mid-run (macOS has no setsid). It still self-reaps via the inactivity timeout.

Result: deterministic, strictly fewer orphaned traces. Residual orphan risk only from the rare herd race and hard crash/OOM (see deferred).

What works (verified)

  • 44/44 tests pass, incl. span-tree nesting, cross-session (TeammateIdle from a different session), re-spawn FIFO, and the inactivity-guard regression.
  • Live, end-to-end: a 6-specialist /triage (8244b755) nests all six per agent_name; queried back via /agents/spans/query it returns one tree (invoke_agent claude-code → invoke_agent ×6 → chat/tool). Rebuilt /nest-test confirms (queue depth 1→0). Published 0.2.7 produced flat / 0-children / tokens=0 (4398fae7).

Fail-open behavior (explicit)

  • handleTeammateIdle does not early-return on a missing session (the teammate's session may be unknown to this daemon) — session is optional for the cross-session path, required only for the per-session fallback.
  • A violated teammate_name === Agent.name invariant logs loudly with the registered keys, not silently.
  • Unclosed queue entries (teammate crash / daemon exit) are flushed by the shutdown backstop.

Deferred (follow-ups, not required for this PR)

  • Emit-on-miss optimistic fallback — if a rare herd-race restart still misses, emit the teammate's span (grouped, not nested) instead of dropping it. Safe degradation; no cross-run guessing. (Full disk-reconstruction was rejected — it risks mis-attribution across concurrent triages.)
  • Run/ticket stamp in custom_attrs_string so runs are queryable by supp-NNNNN instead of a session UUID. Spans are already queryable by conversation_id + agent_name.
  • W&B "agents" UI conversation-level token rollup shows 0/0 while the agent span has real tokens — upstream UI issue.

Test evidence

ℹ tests 44  ℹ pass 44  ℹ fail 0
✔ TeammateIdle span tree: invoke_agent span contains chat child
✔ Cross-session: TeammateIdle from teammate session finds coordinator team member
✔ Cross-session: re-spawn of same team::name nests BOTH (FIFO queue, no overwrite)
✔ Inactivity guard: daemon stays up past timeout while team correlation is in flight
✔ Cross-session: duplicate TeammateIdle does not double-emit

🤖 Generated with Claude Code

…elation

In the agent-teams (TeamCreate) model each teammate runs as its OWN Claude
Code session, so its TeammateIdle hook fires under the teammate's session_id,
not the coordinator's. The per-session SubagentTracking lookup therefore
misses and specialists' tool calls never nest under the run — they land on the
conversation root with 0 children (and tokens unattributed).

Fix: add a process-global cross-session team-correlation map keyed by
`${team_name}::${name}`. The coordinator's PreToolUse(Agent, team_name)
registers the invoke_agent span there; PostToolUse(Agent) no longer closes it
early; the teammate's TeammateIdle looks it up by team+name regardless of which
session fires it, emits the teammate's transcript chat spans under it, and
closes it. Single-daemon herd-prevention keeps the map in one process.

Verified: 42/42 tests pass incl. the span-tree assertion (chat span is a child
of invoke_agent). Reproduced live — specialists now nest per agent_name with
populated tokens (vs flat/0-children on published 0.2.7).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@titothedeveloper titothedeveloper requested a review from a team as a code owner June 11, 2026 17:59
@github-actions

Copy link
Copy Markdown
Contributor


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

titothedeveloper and others added 2 commits June 11, 2026 15:04
…open hardening

Addresses code review of the cross-session nesting fix:

- HIGH: `teamMembers` was keyed `${team}::${name}` to a single value, so a
  re-spawn of the same agent (the TARS Sonnet→Opus flow) overwrote the first,
  still-open invoke_agent span — leaking it and mis-attributing the first
  teammate's transcript to the second span. Now a FIFO queue per key: re-spawn
  appends; each TeammateIdle consumes the oldest not-yet-emitted entry (mirrors
  SubagentTracking.findPendingTeammateIdle). New regression test spawns the same
  key twice before either idles and asserts BOTH nest.

- MEDIUM: handlePostToolUseFailure now honors `teamName` (mirrors
  handlePostToolUse), so a failed Agent dispatch no longer closes the
  team-owned span early and then double-ends it at TeammateIdle.

- Fail-open hardening: handleTeammateIdle no longer early-returns on a missing
  session (the teammate's session may be unknown to this daemon) — session is
  optional for the cross-session path, required only for the per-session
  fallback. A violated `teammate_name === Agent.name` invariant now logs loudly
  with the registered keys instead of failing silently.

Tests: 43/43 pass (was 42; +1 re-spawn regression). Verified live: rebuilt
daemon nests /nest-test (queue depth 1→0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aemon's lifetime

The cross-session teamMembers map is in-memory, so any daemon restart mid-run
wipes it and breaks nesting (specialists find "no team entry"). Observed live:
a real /triage failed to nest because the daemon restarted during a quiet
window. The nesting logic is deterministic; the failures came from the daemon's
*lifetime*. Prevent the two dominant restart triggers:

- Inactivity guard (checkInactivity): do NOT shut down while any teamMembers
  entry is unemitted — agent-teams runs have quiet windows after spawn that
  would otherwise trip the 10-min timeout and wipe the map. Bounded by
  TEAM_INFLIGHT_MAX_MS (60 min idle ceiling) so a crashed teammate that never
  emits TeammateIdle can't pin the daemon forever (the shutdown backstop then
  flushes its span). Added WEAVE_INACTIVITY_MS override (testable + ops-tunable)
  and made the check interval responsive to a short timeout.

- Detach the daemon (nohup + disown in hook-handler.sh): a terminal SIGHUP from
  the session that lazily spawned the daemon could kill it mid-run. macOS has no
  setsid, so nohup+disown is the portable detach. Daemon still self-reaps via the
  inactivity timeout.

Deterministic, and strictly fewer orphaned traces than before. Residual orphan
risk only from the rare herd race (a follow-up emit-on-miss fallback will degrade
that to "emitted but not nested" rather than lost) and hard crash/OOM.

Tests: 44/44 (was 43; +1 inactivity-guard regression using WEAVE_INACTIVITY_MS).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@rgao-coreweave rgao-coreweave left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this!

Comment thread src/version.mjs
@@ -1,4 +1,4 @@
// BEGIN AUTO-MANAGED VERSION
// This section is maintained by release automation. Do not edit manually.
export const VERSION = '0.2.7';

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is maintained by release automation. Do not edit manually.

I think making it rc is good so that when you build a dev version you know the rc is running. Okay for this PR and I will follow up improving the versioning

Comment thread package.json
@@ -1,6 +1,6 @@
{
"name": "weave-claude-code",
"version": "0.2.7",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is maintained by release automation. Do not edit manually.

I think making it rc is good so that when you build a dev version you know the rc is running. Okay for this PR and I will follow up improving the versioning

Comment thread hooks/hook-handler.sh Outdated

if ! is_daemon_alive; then
weave-claude-code daemon >> "${ERROR_LOG}" 2>&1 &
# Detach the daemon from the spawning session. The daemon is started lazily by

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a different issue. Would recommend breaking it into a different PR for the future. Okay for this PR.

…review

@rgao-coreweave noted the hook-handler SIGHUP-detach is a separate concern from the
cross-session daemon fix. Removing it here so #86 stays focused on the daemon
(cross-session nesting + re-spawn FIFO + inactivity guard); detach moves to its own PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@titothedeveloper

Copy link
Copy Markdown
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@titothedeveloper

Copy link
Copy Markdown
Contributor Author

Thanks for the review @rgao-coreweave! Addressed:

  • version.mjs / package.json (rc bump): kept as-is per your note — agreed it's useful to know the rc is the one running in dev builds. Appreciate you taking the versioning-automation follow-up.
  • hook-handler.sh detach: agreed it's a separate concern. Pulled it out of this PR (now daemon-only: cross-session nesting + re-spawn FIFO queue + inactivity guard) and moved the SIGHUP-detach to its own PR → fix(hooks): detach daemon from spawning session so terminal SIGHUP can't kill it mid-run #87.

So #86 is now scoped strictly to the daemon. Will sign the CLA to clear that check.

@titothedeveloper titothedeveloper merged commit 00f7920 into main Jun 11, 2026
3 of 4 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 11, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants