fix(acp): override astep to keep post-prompt callbacks on caller thread by simonrosenberg · Pull Request #3358 · OpenHands/software-agent-sdk

simonrosenberg · 2026-05-21T23:55:55Z

Summary

Architectural fix for #3348: ACPAgent now overrides astep() natively instead of inheriting AgentBase.astep's loop.run_in_executor(None, self.step, ...) wrapper.

The inherited default moved every post-prompt callback onto an executor worker thread, while LocalConversation.arun still held the conversation state's reentrant FIFOLock on the loop thread. Any state-lock acquire inside step()'s call chain (today: stats_callback; tomorrow: any callback added to LLM telemetry) then deadlocked — the worker waited on a lock owned by the loop thread that was itself await-ing astep to return. #3349 plugged today's specific manifestation by removing one with state:; this PR removes the entire class of deadlock by structurally aligning the lock-holder thread with the callback thread.

Why this is the structural fix

After this change, for the arun path: lock holder thread == step thread == on_event callback thread.

Concretely:

self._conn.prompt(...) still runs on the AnyIO BlockingPortal's loop (where the connection lives, where the _OpenHandsACPBridge.session_update dispatcher lives) — scheduled via BlockingPortal.start_task_soon(self._do_acp_prompt, prompt_blocks).
The result is awaited back on the caller's loop via asyncio.wrap_future(future), with asyncio.wait_for(..., timeout=self.acp_prompt_timeout) enforcing the same per-attempt timeout as the sync path.
All post-prompt work (_record_usage → telemetry callbacks → on_event(ActionEvent) → on_event(ObservationEvent) → state.execution_status = FINISHED) runs on the caller's loop thread.

step() and astep() now share extracted helpers so error/timeout/finalize semantics stay symmetric:

async def _do_acp_prompt(self, prompt_blocks) — single await conn.prompt + usage_sync.wait(); always on portal loop.
def _finalize_successful_turn(...) — sync post-prompt bookkeeping + event emission.
def _emit_turn_timeout(...), def _emit_turn_error(...) — sync error handlers.
def _clear_turn_callbacks(...) — finally-block cleanup.

Issue Number

Fixes #3348
Closes #3350
Related: #3349 (hotfix this PR makes redundant — see "End-to-end validation")

How to Test

Unit tests

Three new regression tests in tests/sdk/agent/test_acp_agent.py::TestACPAgentAstep:

test_astep_overrides_default_agentbase_implementation — asserts ACPAgent.astep is not AgentBase.astep. Structural guarantee that no future refactor can quietly reintroduce the inherited wrapper.
test_astep_runs_post_prompt_callbacks_on_caller_thread — spins up a real AsyncExecutor/BlockingPortal, calls astep, asserts the mocked conn.prompt ran on the portal thread AND every on_event callback (ActionEvent + ObservationEvent) ran on the caller thread. This is the structural property that defeats the deadlock.
test_astep_does_not_deadlock_under_reentrant_state_lock — mirrors LocalConversation.arun's shape: holds state.lock on the loop thread across await astep(...), with an on_event callback that also takes state.lock. Must complete within 10s. Fails (hangs) under the old default AgentBase.astep.

Suite:

uv run --project openhands-sdk pytest tests/sdk/agent/ -x
# 465 passed

End-to-end validation

Confirmed working. This branch was branched directly from main and deliberately does NOT include the #3349 hotfix — event_service.py:531 still has the redundant with state: block that triggered the original deadlock. A successful end-to-end ACP turn therefore proves the architectural fix in this PR is sufficient on its own.

Reproduction:

Pinned agent-canvas to commit 50453a350 (this PR's HEAD, includes only the astep override).
Started the dev stack (OH_AGENT_SERVER_LOCAL_PATH=…feat-acp-native-astep npm run dev).
Started a Claude Code ACP conversation via POST /api/acp/conversations with prompt: "Reply with exactly: ASTEP VALIDATION OK".

Status transitions observed via polling /api/conversations/{id}:

02:01:06  status=running
02:01:08  status=finished        # 2s poll cycle

Agent-server log confirmed the astep (async) path was used — not the legacy sync step:

[02:01:04] INFO  Sending ACP prompt (timeout=1800s, blocks=1, async)  acp_agent.py:1620
[02:01:08] INFO  ACP prompt returned in 4.1s (async)                   acp_agent.py:1692

(Line 1692 is unique to ACPAgent.astep's success log — the sync path's log is at 1566.)

FinishAction emitted with the expected payload ("ASTEP VALIDATION OK"), execution_status transitioned cleanly to finished.

Conclusion: with the buggy with state: block still in place, the conversation completes end-to-end. The architectural fix removes the deadlock at its source.

Type

Bug fix (architectural)

Notes

Sync step() retains its existing shape: self._executor.run_async(_prompt, timeout=...) via a closure over _do_acp_prompt(prompt_blocks) — keeps the call signature stable for existing mock-based tests.
The retry/backoff loop is mirrored in astep with await asyncio.sleep(delay) instead of time.sleep(delay) so retries don't block the caller's loop.
asyncio.TimeoutError is aliased to builtin TimeoutError on Python 3.11+, so a single except TimeoutError: clause catches both asyncio.wait_for timeouts and re-raised ones.

🤖 Generated with Claude Code

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:50453a3-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-50453a3-python \
  ghcr.io/openhands/agent-server:50453a3-python

All tags pushed for this build

ghcr.io/openhands/agent-server:50453a3-golang-amd64
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-golang-amd64
ghcr.io/openhands/agent-server:feat-acp-native-astep-golang-amd64
ghcr.io/openhands/agent-server:50453a3-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:50453a3-golang-arm64
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-golang-arm64
ghcr.io/openhands/agent-server:feat-acp-native-astep-golang-arm64
ghcr.io/openhands/agent-server:50453a3-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:50453a3-java-amd64
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-java-amd64
ghcr.io/openhands/agent-server:feat-acp-native-astep-java-amd64
ghcr.io/openhands/agent-server:50453a3-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:50453a3-java-arm64
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-java-arm64
ghcr.io/openhands/agent-server:feat-acp-native-astep-java-arm64
ghcr.io/openhands/agent-server:50453a3-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:50453a3-python-amd64
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-python-amd64
ghcr.io/openhands/agent-server:feat-acp-native-astep-python-amd64
ghcr.io/openhands/agent-server:50453a3-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:50453a3-python-arm64
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-python-arm64
ghcr.io/openhands/agent-server:feat-acp-native-astep-python-arm64
ghcr.io/openhands/agent-server:50453a3-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:50453a3-golang
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-golang
ghcr.io/openhands/agent-server:feat-acp-native-astep-golang
ghcr.io/openhands/agent-server:50453a3-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:50453a3-java
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-java
ghcr.io/openhands/agent-server:feat-acp-native-astep-java
ghcr.io/openhands/agent-server:50453a3-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:50453a3-python
ghcr.io/openhands/agent-server:50453a350888cfaf6d52201cc9ee01fb34723768-python
ghcr.io/openhands/agent-server:feat-acp-native-astep-python
ghcr.io/openhands/agent-server:50453a3-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., 50453a3-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 50453a3-python-amd64) are also available if needed

ACPAgent inherited the default AgentBase.astep, which routes the synchronous step() through loop.run_in_executor(None, self.step, ...). That moves every post-prompt callback (including telemetry's stats_callback, which re-enters state.lock) onto a worker thread — while LocalConversation.arun holds the conversation state's reentrant FIFOLock on the loop thread. The callback then blocks on a lock owned by the thread that is itself awaiting astep to return, deadlocking the conversation. See #3348 for the original symptom and #3350 for the architectural diagnosis. The fix overrides astep natively: the ACP conn.prompt(...) round-trip is scheduled on the portal loop via BlockingPortal.start_task_soon, and the result is awaited back on the caller's loop via asyncio.wrap_future. All post-prompt work — _record_usage, _finalize_successful_turn, on_event emission, status flips — runs on the caller thread, where the FIFOLock owner already lives, so re-entrant acquires succeed. Both step() and astep() now share extracted helpers (_do_acp_prompt, _finalize_successful_turn, _emit_turn_timeout, _emit_turn_error, _clear_turn_callbacks). Adds three regression tests in TestACPAgentAstep: - astep overrides AgentBase.astep (structural guarantee) - post-prompt on_event callbacks fire on caller thread, not portal - astep completes when invoked while caller holds state.lock Fixes #3348 Closes #3350 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-21T23:56:23Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-21T23:56:31Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-21T23:58:45Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/agent
acp_agent.py	677	69	89%	341–343, 473–474, 507, 509, 513, 517, 525, 588–589, 594, 661, 814, 817–818, 835–836, 865, 870, 888, 898, 919–922, 1088–1091, 1095–1097, 1100–1104, 1106, 1266, 1612–1614, 1642, 1647, 1650–1652, 1655, 1663–1665, 1667–1669, 1673, 1676, 1685–1687, 1689, 1694–1698, 1790–1791
TOTAL	27776	8266	70%

all-hands-bot

Taste Rating: 🟡 Acceptable. I found two concurrency gaps that should be addressed before merge. Because this changes ACP agent execution behavior, I’m leaving a COMMENT for human maintainer/eval follow-up rather than approving.

Risk: 🟡 MEDIUM — ACP async execution/interrupt behavior can affect live conversations.

AI-generated review by OpenHands on behalf of the requester.

Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26260099305

all-hands-bot · 2026-05-22T00:03:20Z


-            if not response_text:
-                response_text = "(No response from ACP server)"
+        self._reset_client_for_turn(on_token, on_event)


🟠 Important: astep() still wires the raw on_event into _OpenHandsACPBridge before scheduling the portal-loop prompt. session_update() emits ACPToolCallEvents synchronously on the portal thread, so an ACP turn that starts/progresses a tool call can still run conversation callbacks from a different thread while arun() owns state.lock on the caller loop. That leaves the same deadlock/data-race class for live tool events, and the new tests don’t exercise it because the fake prompt only appends text. For the async path, please marshal bridge on_event/on_token back onto the caller loop (preserving ordering), or otherwise avoid live portal-thread state callbacks.

all-hands-bot · 2026-05-22T00:03:20Z

+            self._finalize_successful_turn(response, elapsed, state, on_event)
+        except TimeoutError:
+            self._emit_turn_timeout(time.monotonic() - t0, state, on_event)
        except Exception as e:


🟠 Important: asyncio.CancelledError bypasses both handlers here and goes straight to finally, which clears callbacks without calling _cancel_inflight_tool_calls(). If an interrupt lands after live ACPToolCallEvents for pending/in_progress tool calls, those cards remain orphaned; LocalConversation._emit_orphaned_action_errors() only patches ActionEvents, not ACPToolCallEvents. Please catch asyncio.CancelledError, emit terminal failed events for in-flight ACP tool calls on the caller thread, then re-raise so arun() can transition to PAUSED.

all-hands-bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(acp): override astep to keep post-prompt callbacks on caller thread#3358

fix(acp): override astep to keep post-prompt callbacks on caller thread#3358
simonrosenberg wants to merge 1 commit into
mainfrom
feat-acp-native-astep

simonrosenberg commented May 21, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot May 22, 2026

Uh oh!

all-hands-bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simonrosenberg commented May 21, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this is the structural fix

Issue Number

How to Test

Unit tests

End-to-end validation

Type

Notes

Uh oh!

github-actions Bot commented May 21, 2026

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented May 21, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simonrosenberg commented May 21, 2026 •

edited by github-actions Bot

Loading