feat: tool timeouts + recovery signature + operator-cancel reaction#87
Open
spashii wants to merge 4 commits into
Open
feat: tool timeouts + recovery signature + operator-cancel reaction#87spashii wants to merge 4 commits into
spashii wants to merge 4 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this changes
bashtool budget raised from 120s to 300s and now returns a structuredTIMEOUT after Ns (wall-clock kill, process killed)payload with a hint to diagnose env instead of retrying the same syntax —adk_runner.pyerror_type: "timeout"automatically when a tool output starts with theTIMEOUT afterprefix; greppable for the first time without string-matching the human-readable line —audit.pydaily-maintenance§1 gets a new timeout / repeat-failure scan that surfaces high-cost failure shapes (timeouts and ≥2 same-shape failures per session); independent of the friction scan above it —daily-maintenance/skill.md_safety_net_journal_entrywriteslast_failure_signature:into the journal frontmatter when the audit log shows a repeat-failure pattern for the dead session —session.pypip install pip-audit ...". Cross-session state is no longer blind —session.pyset_statustool replaces the bash-curl-to-setStatus pattern; callsassistant.threads.setStatusdirectly with channel+thread bound at closure time. Audited cleanly undertool: "set_status"—adk_runner.pyrespond()auto-clears the status indicator before posting the final reply. Sam never needs to remember to clear status; the runtime does it. Closed the recurring 2026-05-13 slip class —adk_runner.pysrc/capabilities/tool-timeouts.mdnames the contract: budgets per tool, what a TIMEOUT means and doesn't mean, the don't-retry-the-same-shape rule, diagnose-env steps, and the journal-stub → recovery-preamble flow —tool-timeouts.mdslack.mdupdated: SET-mid-session / CLEAR-on-respond contract, theset_statustool replaces bash-curl, and the new operator-cancel reaction —slack.md:no_entry:to the original Slack message. Event-driven via the existingreaction_addedsubscription, no polling. Cleanup: lifecycle stamps:no_entry:as terminal, daemon posts a brief "cancelled by operator" note, journal entry recordsstatus: cancelledwith the last_failure_signature if any, cursor advances so boot-replay doesn't re-fire —daemon.pytest_timeout_awareness.pyWhat Sam noticed that led to this
The 2026-05-25 incident: a session spent ~10 min retrying
git clone/pip install pip-auditagainst an unreachable PyPI (5 bash timeouts at 120s each) before pivoting to the correct approach — writing the security-audit workflow file in the target repo's CI. Sam did self-correct, but slowly, and the operator saw the response only after ~1 hour. The Vertex_ResourceExhaustedErrorat 19:13:37Z (Gemini quota mid-journal-write) was incidental; the real cost was the retry-loop.Five gaps surfaced from that incident, confirmed by reading the code:
(command timed out after 120s)— plain string, no structure, no hint to diagnose env vs syntax.error_typefield, no greppable structured shape.daily-maintenanceaggregated tool calls by (tool, first_token) — successful and failed bash calls merged. No timeout-specific surfacing._safety_net_journal_entrywrote a minimal stub withstatusonly — no signature of what the prior session died doing._format_recovered_preamblesaid "you're picking this up late" but didn't carry forward the prior session's failure mode — recovered sessions retried blind.Separately, the operator's "i think adding ❌ reaction check is good to cancel it. polling not the right soln tho event driven with queue management is key" directive: the daemon already subscribes to
reaction_addedevents; wiring:no_entry:from the principal operator to atask.cancel()is the minimum-viable event-driven cancel — no new polling loop required.Tier
Tier 3 — touches
src/runtime/{adk_runner,audit,daemon,session}.py. Tier 1 prose insrc/capabilities/tool-timeouts.md(new),src/capabilities/slack.md, andsrc/skills/daily-maintenance/skill.md. The Tier 3 changes compose into one concept ("honest state across timeout/cancel boundaries") which the operator explicitly authorized bundling into a single PR.What does NOT change
BASH_TIMEOUT_SECONDSandDEFAULT_TOOL_TIMEOUT_SECONDSlive inadk_runner.pyas module constants. Can be promoted toconfig.pylater if other call sites need them.(command timed out after 120s)literal is replaced wholesale. Any audit-log readers grepping that exact string will need to switch toTIMEOUT after(or toerror_type=timeout); onlydaily-maintenancedoes this and it's updated in this PR.grep,fetch_url, etc.) — they stay at their existing values.Confidence
High on the audit-log + journal + recovery-preamble path — 10 tests cover the happy path, the negative cases, and the cross-thread filter. Medium on the operator-cancel path — the asyncio task-cancellation cleanup is exercised by integration only (no unit test for the daemon-worker race today). The cancel mechanism honors cancel at the next natural await boundary; in-flight bash calls run to their 300s ceiling before cancellation propagates. That's acceptable for v1; if sub-bash cancellation is needed, a follow-up PR can race
process.wait()against a cancel event.