fix(acp): override astep to keep post-prompt callbacks on caller thread#3358
fix(acp): override astep to keep post-prompt callbacks on caller thread#3358simonrosenberg wants to merge 1 commit into
Conversation
ACPAgent inherited the default AgentBase.astep, which routes the synchronous step() through loop.run_in_executor(None, self.step, ...). That moves every post-prompt callback (including telemetry's stats_callback, which re-enters state.lock) onto a worker thread — while LocalConversation.arun holds the conversation state's reentrant FIFOLock on the loop thread. The callback then blocks on a lock owned by the thread that is itself awaiting astep to return, deadlocking the conversation. See #3348 for the original symptom and #3350 for the architectural diagnosis. The fix overrides astep natively: the ACP conn.prompt(...) round-trip is scheduled on the portal loop via BlockingPortal.start_task_soon, and the result is awaited back on the caller's loop via asyncio.wrap_future. All post-prompt work — _record_usage, _finalize_successful_turn, on_event emission, status flips — runs on the caller thread, where the FIFOLock owner already lives, so re-entrant acquires succeed. Both step() and astep() now share extracted helpers (_do_acp_prompt, _finalize_successful_turn, _emit_turn_timeout, _emit_turn_error, _clear_turn_callbacks). Adds three regression tests in TestACPAgentAstep: - astep overrides AgentBase.astep (structural guarantee) - post-prompt on_event callbacks fire on caller thread, not portal - astep completes when invoked while caller holds state.lock Fixes #3348 Closes #3350 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
||||||||||||||||||||
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable. I found two concurrency gaps that should be addressed before merge. Because this changes ACP agent execution behavior, I’m leaving a COMMENT for human maintainer/eval follow-up rather than approving.
Risk: 🟡 MEDIUM — ACP async execution/interrupt behavior can affect live conversations.
AI-generated review by OpenHands on behalf of the requester.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26260099305
|
|
||
| if not response_text: | ||
| response_text = "(No response from ACP server)" | ||
| self._reset_client_for_turn(on_token, on_event) |
There was a problem hiding this comment.
🟠 Important: astep() still wires the raw on_event into _OpenHandsACPBridge before scheduling the portal-loop prompt. session_update() emits ACPToolCallEvents synchronously on the portal thread, so an ACP turn that starts/progresses a tool call can still run conversation callbacks from a different thread while arun() owns state.lock on the caller loop. That leaves the same deadlock/data-race class for live tool events, and the new tests don’t exercise it because the fake prompt only appends text. For the async path, please marshal bridge on_event/on_token back onto the caller loop (preserving ordering), or otherwise avoid live portal-thread state callbacks.
| self._finalize_successful_turn(response, elapsed, state, on_event) | ||
| except TimeoutError: | ||
| self._emit_turn_timeout(time.monotonic() - t0, state, on_event) | ||
| except Exception as e: |
There was a problem hiding this comment.
🟠 Important: asyncio.CancelledError bypasses both handlers here and goes straight to finally, which clears callbacks without calling _cancel_inflight_tool_calls(). If an interrupt lands after live ACPToolCallEvents for pending/in_progress tool calls, those cards remain orphaned; LocalConversation._emit_orphaned_action_errors() only patches ActionEvents, not ACPToolCallEvents. Please catch asyncio.CancelledError, emit terminal failed events for in-flight ACP tool calls on the caller thread, then re-raise so arun() can transition to PAUSED.
Summary
Architectural fix for #3348:
ACPAgentnow overridesastep()natively instead of inheritingAgentBase.astep'sloop.run_in_executor(None, self.step, ...)wrapper.The inherited default moved every post-prompt callback onto an executor worker thread, while
LocalConversation.arunstill held the conversation state's reentrantFIFOLockon the loop thread. Any state-lock acquire insidestep()'s call chain (today:stats_callback; tomorrow: any callback added to LLM telemetry) then deadlocked — the worker waited on a lock owned by the loop thread that was itselfawait-ingastepto return.#3349plugged today's specific manifestation by removing onewith state:; this PR removes the entire class of deadlock by structurally aligning the lock-holder thread with the callback thread.Why this is the structural fix
After this change, for the
arunpath: lock holder thread == step thread == on_event callback thread.Concretely:
self._conn.prompt(...)still runs on the AnyIOBlockingPortal's loop (where the connection lives, where the_OpenHandsACPBridge.session_updatedispatcher lives) — scheduled viaBlockingPortal.start_task_soon(self._do_acp_prompt, prompt_blocks).asyncio.wrap_future(future), withasyncio.wait_for(..., timeout=self.acp_prompt_timeout)enforcing the same per-attempt timeout as the sync path._record_usage→ telemetry callbacks →on_event(ActionEvent)→on_event(ObservationEvent)→state.execution_status = FINISHED) runs on the caller's loop thread.step()andastep()now share extracted helpers so error/timeout/finalize semantics stay symmetric:async def _do_acp_prompt(self, prompt_blocks)— singleawait conn.prompt + usage_sync.wait(); always on portal loop.def _finalize_successful_turn(...)— sync post-prompt bookkeeping + event emission.def _emit_turn_timeout(...),def _emit_turn_error(...)— sync error handlers.def _clear_turn_callbacks(...)—finally-block cleanup.Issue Number
Fixes #3348
Closes #3350
Related: #3349 (hotfix this PR makes redundant — see "End-to-end validation")
How to Test
Unit tests
Three new regression tests in
tests/sdk/agent/test_acp_agent.py::TestACPAgentAstep:test_astep_overrides_default_agentbase_implementation— assertsACPAgent.astep is not AgentBase.astep. Structural guarantee that no future refactor can quietly reintroduce the inherited wrapper.test_astep_runs_post_prompt_callbacks_on_caller_thread— spins up a realAsyncExecutor/BlockingPortal, callsastep, asserts the mockedconn.promptran on the portal thread AND everyon_eventcallback (ActionEvent + ObservationEvent) ran on the caller thread. This is the structural property that defeats the deadlock.test_astep_does_not_deadlock_under_reentrant_state_lock— mirrorsLocalConversation.arun's shape: holdsstate.lockon the loop thread acrossawait astep(...), with anon_eventcallback that also takesstate.lock. Must complete within 10s. Fails (hangs) under the old defaultAgentBase.astep.Suite:
End-to-end validation
Confirmed working. This branch was branched directly from
mainand deliberately does NOT include the#3349hotfix —event_service.py:531still has the redundantwith state:block that triggered the original deadlock. A successful end-to-end ACP turn therefore proves the architectural fix in this PR is sufficient on its own.Reproduction:
Pinned
agent-canvasto commit50453a350(this PR's HEAD, includes only the astep override).Started the dev stack (
OH_AGENT_SERVER_LOCAL_PATH=…feat-acp-native-astep npm run dev).Started a Claude Code ACP conversation via
POST /api/acp/conversationswith prompt: "Reply with exactly: ASTEP VALIDATION OK".Status transitions observed via polling
/api/conversations/{id}:Agent-server log confirmed the
astep(async) path was used — not the legacy syncstep:(Line
1692is unique toACPAgent.astep's success log — the sync path's log is at1566.)FinishActionemitted with the expected payload ("ASTEP VALIDATION OK"),execution_statustransitioned cleanly tofinished.Conclusion: with the buggy
with state:block still in place, the conversation completes end-to-end. The architectural fix removes the deadlock at its source.Type
Bug fix (architectural)
Notes
step()retains its existing shape:self._executor.run_async(_prompt, timeout=...)via a closure over_do_acp_prompt(prompt_blocks)— keeps the call signature stable for existing mock-based tests.astepwithawait asyncio.sleep(delay)instead oftime.sleep(delay)so retries don't block the caller's loop.asyncio.TimeoutErroris aliased to builtinTimeoutErroron Python 3.11+, so a singleexcept TimeoutError:clause catches bothasyncio.wait_fortimeouts and re-raised ones.🤖 Generated with Claude Code
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:50453a3-pythonRun
All tags pushed for this build
About Multi-Architecture Support
50453a3-python) is a multi-arch manifest supporting both amd64 and arm6450453a3-python-amd64) are also available if needed