Skip to content

feat: tool timeouts + recovery signature + operator-cancel reaction#87

Open
spashii wants to merge 4 commits into
mainfrom
sam/timeout-awareness-and-state
Open

feat: tool timeouts + recovery signature + operator-cancel reaction#87
spashii wants to merge 4 commits into
mainfrom
sam/timeout-awareness-and-state

Conversation

@spashii
Copy link
Copy Markdown
Member

@spashii spashii commented May 25, 2026

What this changes

  • bash tool budget raised from 120s to 300s and now returns a structured TIMEOUT after Ns (wall-clock kill, process killed) payload with a hint to diagnose env instead of retrying the same syntax — adk_runner.py
  • Audit log records error_type: "timeout" automatically when a tool output starts with the TIMEOUT after prefix; greppable for the first time without string-matching the human-readable line — audit.py
  • daily-maintenance §1 gets a new timeout / repeat-failure scan that surfaces high-cost failure shapes (timeouts and ≥2 same-shape failures per session); independent of the friction scan above it — daily-maintenance/skill.md
  • _safety_net_journal_entry writes last_failure_signature: into the journal frontmatter when the audit log shows a repeat-failure pattern for the dead session — session.py
  • Recovered sessions get the prior failure signature in their preamble: "Prior failure mode — do NOT retry this shape: tool=bash error_type=timeout repeated=5×, command=pip install pip-audit ...". Cross-session state is no longer blind — session.py
  • New set_status tool replaces the bash-curl-to-setStatus pattern; calls assistant.threads.setStatus directly with channel+thread bound at closure time. Audited cleanly under tool: "set_status"adk_runner.py
  • respond() auto-clears the status indicator before posting the final reply. Sam never needs to remember to clear status; the runtime does it. Closed the recurring 2026-05-13 slip class — adk_runner.py
  • New capability src/capabilities/tool-timeouts.md names the contract: budgets per tool, what a TIMEOUT means and doesn't mean, the don't-retry-the-same-shape rule, diagnose-env steps, and the journal-stub → recovery-preamble flow — tool-timeouts.md
  • slack.md updated: SET-mid-session / CLEAR-on-respond contract, the set_status tool replaces bash-curl, and the new operator-cancel reaction — slack.md
  • Operator can stop an in-flight session by adding :no_entry: to the original Slack message. Event-driven via the existing reaction_added subscription, no polling. Cleanup: lifecycle stamps :no_entry: as terminal, daemon posts a brief "cancelled by operator" note, journal entry records status: cancelled with the last_failure_signature if any, cursor advances so boot-replay doesn't re-fire — daemon.py
  • 12 new tests covering: error_type inference, record_tool_call field, audit-log read-back, repeat-failure signature detection, bash TIMEOUT prefix contract, safety-net journal stub shape, prior-failure-signature lookup across journal files filtered by thread_ts — test_timeout_awareness.py

What Sam noticed that led to this

The 2026-05-25 incident: a session spent ~10 min retrying git clone / pip install pip-audit against an unreachable PyPI (5 bash timeouts at 120s each) before pivoting to the correct approach — writing the security-audit workflow file in the target repo's CI. Sam did self-correct, but slowly, and the operator saw the response only after ~1 hour. The Vertex _ResourceExhaustedError at 19:13:37Z (Gemini quota mid-journal-write) was incidental; the real cost was the retry-loop.

Five gaps surfaced from that incident, confirmed by reading the code:

  1. Bash timeout returned (command timed out after 120s) — plain string, no structure, no hint to diagnose env vs syntax.
  2. Audit log recorded the timeout as the output string only — no error_type field, no greppable structured shape.
  3. daily-maintenance aggregated tool calls by (tool, first_token) — successful and failed bash calls merged. No timeout-specific surfacing.
  4. _safety_net_journal_entry wrote a minimal stub with status only — no signature of what the prior session died doing.
  5. _format_recovered_preamble said "you're picking this up late" but didn't carry forward the prior session's failure mode — recovered sessions retried blind.

Separately, the operator's "i think adding ❌ reaction check is good to cancel it. polling not the right soln tho event driven with queue management is key" directive: the daemon already subscribes to reaction_added events; wiring :no_entry: from the principal operator to a task.cancel() is the minimum-viable event-driven cancel — no new polling loop required.

Tier

Tier 3 — touches src/runtime/{adk_runner,audit,daemon,session}.py. Tier 1 prose in src/capabilities/tool-timeouts.md (new), src/capabilities/slack.md, and src/skills/daily-maintenance/skill.md. The Tier 3 changes compose into one concept ("honest state across timeout/cancel boundaries") which the operator explicitly authorized bundling into a single PR.

What does NOT change

  • No new top-level config keys; BASH_TIMEOUT_SECONDS and DEFAULT_TOOL_TIMEOUT_SECONDS live in adk_runner.py as module constants. Can be promoted to config.py later if other call sites need them.
  • No backwards-compat shims: the (command timed out after 120s) literal is replaced wholesale. Any audit-log readers grepping that exact string will need to switch to TIMEOUT after (or to error_type=timeout); only daily-maintenance does this and it's updated in this PR.
  • No changes to other tool budgets (grep, fetch_url, etc.) — they stay at their existing values.
  • No in-session repeat-failure injection (the "synthetic system message at N=3" idea from earlier discussion) — that's a separate Tier 3 PR. This one preserves state across sessions; in-session intervention is the next concept.

Confidence

High on the audit-log + journal + recovery-preamble path — 10 tests cover the happy path, the negative cases, and the cross-thread filter. Medium on the operator-cancel path — the asyncio task-cancellation cleanup is exercised by integration only (no unit test for the daemon-worker race today). The cancel mechanism honors cancel at the next natural await boundary; in-flight bash calls run to their 300s ceiling before cancellation propagates. That's acceptable for v1; if sub-bash cancellation is needed, a follow-up PR can race process.wait() against a cancel event.

@spashii spashii added capabilities Touches src/capabilities/* runtime Touches src/runtime/* or top-level config skill Touches src/skills/* labels May 25, 2026
@spashii spashii enabled auto-merge May 25, 2026 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

capabilities Touches src/capabilities/* runtime Touches src/runtime/* or top-level config skill Touches src/skills/*

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant