feat: tool timeouts + recovery signature + operator-cancel reaction by spashii · Pull Request #87 · Dembrane/sam

spashii · 2026-05-25T19:28:47Z

What this changes

bash tool budget raised from 120s to 300s and now returns a structured TIMEOUT after Ns (wall-clock kill, process killed) payload with a hint to diagnose env instead of retrying the same syntax — adk_runner.py
Audit log records error_type: "timeout" automatically when a tool output starts with the TIMEOUT after prefix; greppable for the first time without string-matching the human-readable line — audit.py
daily-maintenance §1 gets a new timeout / repeat-failure scan that surfaces high-cost failure shapes (timeouts and ≥2 same-shape failures per session); independent of the friction scan above it — daily-maintenance/skill.md
_safety_net_journal_entry writes last_failure_signature: into the journal frontmatter when the audit log shows a repeat-failure pattern for the dead session — session.py
Recovered sessions get the prior failure signature in their preamble: "Prior failure mode — do NOT retry this shape: tool=bash error_type=timeout repeated=5×, command=pip install pip-audit ...". Cross-session state is no longer blind — session.py
New set_status tool replaces the bash-curl-to-setStatus pattern; calls assistant.threads.setStatus directly with channel+thread bound at closure time. Audited cleanly under tool: "set_status" — adk_runner.py
respond() auto-clears the status indicator before posting the final reply. Sam never needs to remember to clear status; the runtime does it. Closed the recurring 2026-05-13 slip class — adk_runner.py
New capability src/capabilities/tool-timeouts.md names the contract: budgets per tool, what a TIMEOUT means and doesn't mean, the don't-retry-the-same-shape rule, diagnose-env steps, and the journal-stub → recovery-preamble flow — tool-timeouts.md
slack.md updated: SET-mid-session / CLEAR-on-respond contract, the set_status tool replaces bash-curl, and the new operator-cancel reaction — slack.md
Operator can stop an in-flight session by adding :no_entry: to the original Slack message. Event-driven via the existing reaction_added subscription, no polling. Cleanup: lifecycle stamps :no_entry: as terminal, daemon posts a brief "cancelled by operator" note, journal entry records status: cancelled with the last_failure_signature if any, cursor advances so boot-replay doesn't re-fire — daemon.py
12 new tests covering: error_type inference, record_tool_call field, audit-log read-back, repeat-failure signature detection, bash TIMEOUT prefix contract, safety-net journal stub shape, prior-failure-signature lookup across journal files filtered by thread_ts — test_timeout_awareness.py

What Sam noticed that led to this

The 2026-05-25 incident: a session spent ~10 min retrying git clone / pip install pip-audit against an unreachable PyPI (5 bash timeouts at 120s each) before pivoting to the correct approach — writing the security-audit workflow file in the target repo's CI. Sam did self-correct, but slowly, and the operator saw the response only after ~1 hour. The Vertex _ResourceExhaustedError at 19:13:37Z (Gemini quota mid-journal-write) was incidental; the real cost was the retry-loop.

Five gaps surfaced from that incident, confirmed by reading the code:

Bash timeout returned (command timed out after 120s) — plain string, no structure, no hint to diagnose env vs syntax.
Audit log recorded the timeout as the output string only — no error_type field, no greppable structured shape.
daily-maintenance aggregated tool calls by (tool, first_token) — successful and failed bash calls merged. No timeout-specific surfacing.
_safety_net_journal_entry wrote a minimal stub with status only — no signature of what the prior session died doing.
_format_recovered_preamble said "you're picking this up late" but didn't carry forward the prior session's failure mode — recovered sessions retried blind.

Separately, the operator's "i think adding ❌ reaction check is good to cancel it. polling not the right soln tho event driven with queue management is key" directive: the daemon already subscribes to reaction_added events; wiring :no_entry: from the principal operator to a task.cancel() is the minimum-viable event-driven cancel — no new polling loop required.

Tier

Tier 3 — touches src/runtime/{adk_runner,audit,daemon,session}.py. Tier 1 prose in src/capabilities/tool-timeouts.md (new), src/capabilities/slack.md, and src/skills/daily-maintenance/skill.md. The Tier 3 changes compose into one concept ("honest state across timeout/cancel boundaries") which the operator explicitly authorized bundling into a single PR.

What does NOT change

No new top-level config keys; BASH_TIMEOUT_SECONDS and DEFAULT_TOOL_TIMEOUT_SECONDS live in adk_runner.py as module constants. Can be promoted to config.py later if other call sites need them.
No backwards-compat shims: the (command timed out after 120s) literal is replaced wholesale. Any audit-log readers grepping that exact string will need to switch to TIMEOUT after (or to error_type=timeout); only daily-maintenance does this and it's updated in this PR.
No changes to other tool budgets (grep, fetch_url, etc.) — they stay at their existing values.
No in-session repeat-failure injection (the "synthetic system message at N=3" idea from earlier discussion) — that's a separate Tier 3 PR. This one preserves state across sessions; in-session intervention is the next concept.

Confidence

High on the audit-log + journal + recovery-preamble path — 10 tests cover the happy path, the negative cases, and the cross-thread filter. Medium on the operator-cancel path — the asyncio task-cancellation cleanup is exercised by integration only (no unit test for the daemon-worker race today). The cancel mechanism honors cancel at the next natural await boundary; in-flight bash calls run to their 300s ceiling before cancellation propagates. That's acceptable for v1; if sub-bash cancellation is needed, a follow-up PR can race process.wait() against a cancel event.

…t re-queued on boot

…rincipal operator

feat: tool timeouts + recovery signature + operator-cancel reaction

9a38f5d

spashii added capabilities Touches src/capabilities/* runtime Touches src/runtime/* or top-level config skill Touches src/skills/* labels May 25, 2026

spashii enabled auto-merge May 25, 2026 19:28

spashii added 3 commits May 25, 2026 21:34

fix(daemon): add ⛔ to _TERMINAL_REACTIONS so cancelled messages aren'…

1d10fa6

…t re-queued on boot

feat: ack tool + synthetic audit entry on operator-cancel

93b03e2

fix(daemon): ⛔ cancel works for anyone in the channel, not just the p…

9f11e21

…rincipal operator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: tool timeouts + recovery signature + operator-cancel reaction#87

feat: tool timeouts + recovery signature + operator-cancel reaction#87
spashii wants to merge 4 commits into
mainfrom
sam/timeout-awareness-and-state

spashii commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spashii commented May 25, 2026

What this changes

What Sam noticed that led to this

Tier

What does NOT change

Confidence

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant