Skip to content

fix(acp): override astep to keep post-prompt callbacks on caller thread#3358

Open
simonrosenberg wants to merge 1 commit into
mainfrom
feat-acp-native-astep
Open

fix(acp): override astep to keep post-prompt callbacks on caller thread#3358
simonrosenberg wants to merge 1 commit into
mainfrom
feat-acp-native-astep

Conversation

@simonrosenberg
Copy link
Copy Markdown
Collaborator

@simonrosenberg simonrosenberg commented May 21, 2026

Summary

Architectural fix for #3348: ACPAgent now overrides astep() natively instead of inheriting AgentBase.astep's loop.run_in_executor(None, self.step, ...) wrapper.

The inherited default moved every post-prompt callback onto an executor worker thread, while LocalConversation.arun still held the conversation state's reentrant FIFOLock on the loop thread. Any state-lock acquire inside step()'s call chain (today: stats_callback; tomorrow: any callback added to LLM telemetry) then deadlocked — the worker waited on a lock owned by the loop thread that was itself await-ing astep to return. #3349 plugged today's specific manifestation by removing one with state:; this PR removes the entire class of deadlock by structurally aligning the lock-holder thread with the callback thread.

Why this is the structural fix

After this change, for the arun path: lock holder thread == step thread == on_event callback thread.

Concretely:

  • self._conn.prompt(...) still runs on the AnyIO BlockingPortal's loop (where the connection lives, where the _OpenHandsACPBridge.session_update dispatcher lives) — scheduled via BlockingPortal.start_task_soon(self._do_acp_prompt, prompt_blocks).
  • The result is awaited back on the caller's loop via asyncio.wrap_future(future), with asyncio.wait_for(..., timeout=self.acp_prompt_timeout) enforcing the same per-attempt timeout as the sync path.
  • All post-prompt work (_record_usage → telemetry callbacks → on_event(ActionEvent)on_event(ObservationEvent)state.execution_status = FINISHED) runs on the caller's loop thread.

step() and astep() now share extracted helpers so error/timeout/finalize semantics stay symmetric:

  • async def _do_acp_prompt(self, prompt_blocks) — single await conn.prompt + usage_sync.wait(); always on portal loop.
  • def _finalize_successful_turn(...) — sync post-prompt bookkeeping + event emission.
  • def _emit_turn_timeout(...), def _emit_turn_error(...) — sync error handlers.
  • def _clear_turn_callbacks(...)finally-block cleanup.

Issue Number

Fixes #3348
Closes #3350
Related: #3349 (hotfix this PR makes redundant — see "End-to-end validation")

How to Test

Unit tests

Three new regression tests in tests/sdk/agent/test_acp_agent.py::TestACPAgentAstep:

  1. test_astep_overrides_default_agentbase_implementation — asserts ACPAgent.astep is not AgentBase.astep. Structural guarantee that no future refactor can quietly reintroduce the inherited wrapper.
  2. test_astep_runs_post_prompt_callbacks_on_caller_thread — spins up a real AsyncExecutor/BlockingPortal, calls astep, asserts the mocked conn.prompt ran on the portal thread AND every on_event callback (ActionEvent + ObservationEvent) ran on the caller thread. This is the structural property that defeats the deadlock.
  3. test_astep_does_not_deadlock_under_reentrant_state_lock — mirrors LocalConversation.arun's shape: holds state.lock on the loop thread across await astep(...), with an on_event callback that also takes state.lock. Must complete within 10s. Fails (hangs) under the old default AgentBase.astep.

Suite:

uv run --project openhands-sdk pytest tests/sdk/agent/ -x
# 465 passed

End-to-end validation

Confirmed working. This branch was branched directly from main and deliberately does NOT include the #3349 hotfixevent_service.py:531 still has the redundant with state: block that triggered the original deadlock. A successful end-to-end ACP turn therefore proves the architectural fix in this PR is sufficient on its own.

Reproduction:

  • Pinned agent-canvas to commit 50453a350 (this PR's HEAD, includes only the astep override).

  • Started the dev stack (OH_AGENT_SERVER_LOCAL_PATH=…feat-acp-native-astep npm run dev).

  • Started a Claude Code ACP conversation via POST /api/acp/conversations with prompt: "Reply with exactly: ASTEP VALIDATION OK".

  • Status transitions observed via polling /api/conversations/{id}:

    02:01:06  status=running
    02:01:08  status=finished        # 2s poll cycle
    
  • Agent-server log confirmed the astep (async) path was used — not the legacy sync step:

    [02:01:04] INFO  Sending ACP prompt (timeout=1800s, blocks=1, async)  acp_agent.py:1620
    [02:01:08] INFO  ACP prompt returned in 4.1s (async)                   acp_agent.py:1692
    

    (Line 1692 is unique to ACPAgent.astep's success log — the sync path's log is at 1566.)

  • FinishAction emitted with the expected payload ("ASTEP VALIDATION OK"), execution_status transitioned cleanly to finished.

Conclusion: with the buggy with state: block still in place, the conversation completes end-to-end. The architectural fix removes the deadlock at its source.

Type

Bug fix (architectural)

Notes

  • Sync step() retains its existing shape: self._executor.run_async(_prompt, timeout=...) via a closure over _do_acp_prompt(prompt_blocks) — keeps the call signature stable for existing mock-based tests.
  • The retry/backoff loop is mirrored in astep with await asyncio.sleep(delay) instead of time.sleep(delay) so retries don't block the caller's loop.
  • asyncio.TimeoutError is aliased to builtin TimeoutError on Python 3.11+, so a single except TimeoutError: clause catches both asyncio.wait_for timeouts and re-raised ones.

🤖 Generated with Claude Code


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:50453a3-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-50453a3-python \
  ghcr.io/openhands/agent-server:50453a3-python

All tags pushed for this build

ghcr.io/openhands/agent-server:50453a3-golang-amd64
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-golang-amd64
ghcr.io/openhands/agent-server:feat-acp-native-astep-golang-amd64
ghcr.io/openhands/agent-server:50453a3-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:50453a3-golang-arm64
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-golang-arm64
ghcr.io/openhands/agent-server:feat-acp-native-astep-golang-arm64
ghcr.io/openhands/agent-server:50453a3-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:50453a3-java-amd64
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-java-amd64
ghcr.io/openhands/agent-server:feat-acp-native-astep-java-amd64
ghcr.io/openhands/agent-server:50453a3-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:50453a3-java-arm64
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-java-arm64
ghcr.io/openhands/agent-server:feat-acp-native-astep-java-arm64
ghcr.io/openhands/agent-server:50453a3-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:50453a3-python-amd64
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-python-amd64
ghcr.io/openhands/agent-server:feat-acp-native-astep-python-amd64
ghcr.io/openhands/agent-server:50453a3-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:50453a3-python-arm64
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-python-arm64
ghcr.io/openhands/agent-server:feat-acp-native-astep-python-arm64
ghcr.io/openhands/agent-server:50453a3-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:50453a3-golang
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-golang
ghcr.io/openhands/agent-server:feat-acp-native-astep-golang
ghcr.io/openhands/agent-server:50453a3-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:50453a3-java
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-java
ghcr.io/openhands/agent-server:feat-acp-native-astep-java
ghcr.io/openhands/agent-server:50453a3-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:50453a3-python
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-python
ghcr.io/openhands/agent-server:feat-acp-native-astep-python
ghcr.io/openhands/agent-server:50453a3-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 50453a3-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 50453a3-python-amd64) are also available if needed

ACPAgent inherited the default AgentBase.astep, which routes the
synchronous step() through loop.run_in_executor(None, self.step, ...).
That moves every post-prompt callback (including telemetry's
stats_callback, which re-enters state.lock) onto a worker thread —
while LocalConversation.arun holds the conversation state's reentrant
FIFOLock on the loop thread.  The callback then blocks on a lock owned
by the thread that is itself awaiting astep to return, deadlocking
the conversation.  See #3348 for the original symptom and #3350 for
the architectural diagnosis.

The fix overrides astep natively: the ACP conn.prompt(...) round-trip
is scheduled on the portal loop via BlockingPortal.start_task_soon,
and the result is awaited back on the caller's loop via
asyncio.wrap_future.  All post-prompt work — _record_usage,
_finalize_successful_turn, on_event emission, status flips — runs on
the caller thread, where the FIFOLock owner already lives, so
re-entrant acquires succeed.

Both step() and astep() now share extracted helpers
(_do_acp_prompt, _finalize_successful_turn, _emit_turn_timeout,
_emit_turn_error, _clear_turn_callbacks).

Adds three regression tests in TestACPAgentAstep:
- astep overrides AgentBase.astep (structural guarantee)
- post-prompt on_event callbacks fire on caller thread, not portal
- astep completes when invoked while caller holds state.lock

Fixes #3348
Closes #3350

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/agent
   acp_agent.py6776989%341–343, 473–474, 507, 509, 513, 517, 525, 588–589, 594, 661, 814, 817–818, 835–836, 865, 870, 888, 898, 919–922, 1088–1091, 1095–1097, 1100–1104, 1106, 1266, 1612–1614, 1642, 1647, 1650–1652, 1655, 1663–1665, 1667–1669, 1673, 1676, 1685–1687, 1689, 1694–1698, 1790–1791
TOTAL27776826670% 

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable. I found two concurrency gaps that should be addressed before merge. Because this changes ACP agent execution behavior, I’m leaving a COMMENT for human maintainer/eval follow-up rather than approving.

Risk: 🟡 MEDIUM — ACP async execution/interrupt behavior can affect live conversations.

AI-generated review by OpenHands on behalf of the requester.


Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26260099305


if not response_text:
response_text = "(No response from ACP server)"
self._reset_client_for_turn(on_token, on_event)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: astep() still wires the raw on_event into _OpenHandsACPBridge before scheduling the portal-loop prompt. session_update() emits ACPToolCallEvents synchronously on the portal thread, so an ACP turn that starts/progresses a tool call can still run conversation callbacks from a different thread while arun() owns state.lock on the caller loop. That leaves the same deadlock/data-race class for live tool events, and the new tests don’t exercise it because the fake prompt only appends text. For the async path, please marshal bridge on_event/on_token back onto the caller loop (preserving ordering), or otherwise avoid live portal-thread state callbacks.

self._finalize_successful_turn(response, elapsed, state, on_event)
except TimeoutError:
self._emit_turn_timeout(time.monotonic() - t0, state, on_event)
except Exception as e:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: asyncio.CancelledError bypasses both handlers here and goes straight to finally, which clears callbacks without calling _cancel_inflight_tool_calls(). If an interrupt lands after live ACPToolCallEvents for pending/in_progress tool calls, those cards remain orphaned; LocalConversation._emit_orphaned_action_errors() only patches ActionEvents, not ACPToolCallEvents. Please catch asyncio.CancelledError, emit terminal failed events for in-flight ACP tool calls on the caller thread, then re-raise so arun() can transition to PAUSED.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants