fix: handle tool execution timeout/error causing IllegalStateExceptio… by chensk0601 · Pull Request #956 · agentscope-ai/agentscope-java

chensk0601 · 2026-03-14T06:38:08Z

…n (#951)

ReActAgent throws IllegalStateException when tool calls timeout or fail, because no tool result is written to memory, leaving orphaned pending tool call states that crash the agent on subsequent requests.

Root cause:

Tool execution timeout/error propagates without writing results to memory
Pending tool call state remains, blocking subsequent doCall() invocations
validateAndAddToolResults() throws when user message has no tool results

Changes:

doCall(): detect pending tool calls without user-provided results and auto-generate error results to clear the pending state
executeToolCalls(): add onErrorResume to catch tool execution failures and generate error tool results instead of propagating exceptions
Add generateAndAddErrorToolResults() helper to create error results for orphaned pending tool calls

This ensures the agent recovers gracefully from tool failures instead of crashing, and the model receives proper error feedback to continue processing.

Closes #951

AgentScope-Java Version

[The version of AgentScope-Java you are working on, e.g. 1.0.9, check your pom.xml dependency version or run mvn dependency:tree | grep agentscope-parent:pom(only mac/linux)]

Description

[Please describe the background, purpose, changes made, and how to test this PR]

Checklist

Please check the following items before code is ready to be reviewed.

Code has been formatted with mvn spotless:apply
All tests are passing (mvn test)
Javadoc comments are complete and follow project conventions
Related documentation has been updated (e.g. links, examples, etc.)
Code is ready for review

cla-assistant · 2026-03-14T06:38:23Z

All committers have signed the CLA.

cla-assistant · 2026-03-14T06:38:23Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

凡勇 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codecov · 2026-03-14T07:07:52Z

Codecov Report

❌ Patch coverage is 66.03774% with 18 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...e/src/main/java/io/agentscope/core/ReActAgent.java	66.03%	14 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

agentscope-ai#951) ReActAgent throws IllegalStateException when tool calls timeout or fail, because no tool result is written to memory, leaving orphaned pending tool call states that crash the agent on subsequent requests. Root cause: - Tool execution timeout/error propagates without writing results to memory - Pending tool call state remains, blocking subsequent doCall() invocations - validateAndAddToolResults() throws when user message has no tool results Changes: - doCall(): detect pending tool calls without user-provided results and auto-generate error results to clear the pending state - executeToolCalls(): add onErrorResume to catch tool execution failures and generate error tool results instead of propagating exceptions - Add generateAndAddErrorToolResults() helper to create error results for orphaned pending tool calls This ensures the agent recovers gracefully from tool failures instead of crashing, and the model receives proper error feedback to continue processing. Closes agentscope-ai#951

LearningGp

Handling tool exceptions as ToolResult seems like a solid approach. For pending tool calls where no result is provided, I’m wondering if it might be more appropriate to expose those to the developer for handling instead. Also, perhaps we could consider adding a configurable exception handler mechanism in the future? (Just a thought—this last point definitely doesn't need to block the PR).

agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java

Copilot

Pull request overview

Fixes ReActAgent resiliency when tool execution fails (timeout/error) by ensuring pending tool-call state is cleared via synthetic error tool results, preventing IllegalStateException on subsequent calls.

Changes:

Update ReActAgent#doCall() to detect pending tool calls without user-provided tool results and auto-generate error tool results to clear pending state.
Update ReActAgent#executeToolCalls() to convert tool-execution failures into error tool results via onErrorResume instead of propagating exceptions.
Update HookStopAgentTest expectations to validate the new auto-recovery behavior (no longer expecting an exception).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java	Adds auto-recovery for orphaned pending tool calls and converts tool execution failures into error tool results.
agentscope-core/src/test/java/io/agentscope/core/hook/HookStopAgentTest.java	Updates tests to expect auto-recovery rather than `IllegalStateException` when pending tool calls exist.

agentscope-core/src/test/java/io/agentscope/core/hook/HookStopAgentTest.java

agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java

…dling

- Extract shared buildErrorToolResult() helper to deduplicate ToolResultBlock construction - Route generateAndAddErrorToolResults() through PostActingEvent hook pipeline for consistent tool-result lifecycle (StreamingHook TOOL_RESULT emission, hook-based transforms) - Narrow onErrorResume catch scope to Exception.class, letting critical JVM errors (e.g. OutOfMemoryError) propagate - Use ExceptionUtils.getErrorMessage() for non-null error messages and log the exception object itself for full stack traces - Strengthen HookStopAgentTest auto-recovery assertions: verify error ToolResultBlock in memory, model re-invocation, and response content

LearningGp

The changes to executeToolCalls look solid, as they allow the model to proceed even after a tool call exception. However, I’m not sure about adding automatic recovery to doCall. In my view, it might be better to surface these exceptions to the framework consumers instead. Open to discussion on this point!

chensk0601 · 2026-03-27T08:27:06Z

The changes to executeToolCalls look solid, as they allow the model to proceed even after a tool call exception. However, I’m not sure about adding automatic recovery to doCall. In my view, it might be better to surface these exceptions to the framework consumers instead. Open to discussion on this point!

This fix not only ensures execution continues after tool call timeouts and exceptions, but also resolves a critical issue where the entire agent would become unusable following a timeout, repeatedly throwing IllegalStateException. I have been running this updated version in our production environment for some time now, and it has been performing flawlessly.

LearningGp · 2026-03-27T10:02:14Z

The changes to executeToolCalls look solid, as they allow the model to proceed even after a tool call exception. However, I’m not sure about adding automatic recovery to doCall. In my view, it might be better to surface these exceptions to the framework consumers instead. Open to discussion on this point!

This fix not only ensures execution continues after tool call timeouts and exceptions, but also resolves a critical issue where the entire agent would become unusable following a timeout, repeatedly throwing IllegalStateException. I have been running this updated version in our production environment for some time now, and it has been performing flawlessly.

It seems that the modifications to executeToolCalls should be sufficient to ensure a ToolResultBlock is present following a timeout or exception, which would prevent the persistent IllegalStateException.

However, I have some reservations about the auto-supplement logic in doCall. Since HITL (Human-In-The-Loop) workflows require a ToolResultBlock to be manually provided when resuming a conversation, automating this process might lead to unintended execution paths. It could also potentially mask improper HITL usage, making it harder to detect implementation errors.

It’s great to hear that this fix has been verified in production! A couple of follow-up questions:

Have we encountered specific scenarios where the auto-supplement logic in doCall was actually triggered?
In our current context, would it be possible to achieve the same goal by only modifying executeToolCalls?

chensk0601 · 2026-03-27T13:55:14Z

The changes to executeToolCalls look solid, as they allow the model to proceed even after a tool call exception. However, I’m not sure about adding automatic recovery to doCall. In my view, it might be better to surface these exceptions to the framework consumers instead. Open to discussion on this point!

This fix not only ensures execution continues after tool call timeouts and exceptions, but also resolves a critical issue where the entire agent would become unusable following a timeout, repeatedly throwing IllegalStateException. I have been running this updated version in our production environment for some time now, and it has been performing flawlessly.

It seems that the modifications to executeToolCalls should be sufficient to ensure a ToolResultBlock is present following a timeout or exception, which would prevent the persistent IllegalStateException.

However, I have some reservations about the auto-supplement logic in doCall. Since HITL (Human-In-The-Loop) workflows require a ToolResultBlock to be manually provided when resuming a conversation, automating this process might lead to unintended execution paths. It could also potentially mask improper HITL usage, making it harder to detect implementation errors.

It’s great to hear that this fix has been verified in production! A couple of follow-up questions:

Have we encountered specific scenarios where the auto-supplement logic in doCall was actually triggered?

In our current context, would it be possible to achieve the same goal by only modifying executeToolCalls?

Thanks for the thorough review! I appreciate the concern about HITL workflows — preserving the explicit contract for manual ToolResultBlock provision is indeed important.

Let me address your questions:

Regarding scenarios where `doCall` auto-supplement was triggered

While executeToolCalls's onErrorResume handles failures within tool execution, we've identified cases where pending states persist outside that coverage:

Pre-execution failures: Network timeouts before toolkit.callTools() begins, JVM crashes, or async scheduling failures where tool calls are queued but never executed
Memory recovery: When agent state is persisted and restored (e.g., after restart), orphaned pending IDs may exist without corresponding results

In these cases, executeToolCalls is never reached, so its error handling cannot apply. The doCall auto-recovery acts as a safety net for these edge cases.

On achieving the goal with only `executeToolCalls` modifications

For failures occurring during executeToolCalls, yes — the onErrorResume logic is sufficient. However, for the scenarios above, we need the doCall layer to handle states that originate from outside the tool execution flow.

Proposed compromise to protect HITL workflows

To address your concern about masking improper HITL usage, I suggest adding an explicit mode check:

if (providedResults.isEmpty()) {
    if (isHITLMode()) {  // Check if conversation is in HITL pause state
        // Strict behavior: require manual ToolResultBlock
        throw new IllegalStateException(
            "HITL workflow requires manual ToolResultBlock for pending IDs: " + pendingIds);
    }
    // Non-HITL: Auto-recover from orphaned states
    log.warn("Orphaned pending tool calls detected, auto-generating error results: {}", pendingIds);
    generateAndAddErrorToolResults(pendingIds);
}

chensk0601 · 2026-03-27T14:38:30Z

The changes to executeToolCalls look solid, as they allow the model to proceed even after a tool call exception. However, I’m not sure about adding automatic recovery to doCall. In my view, it might be better to surface these exceptions to the framework consumers instead. Open to discussion on this point!

This fix not only ensures execution continues after tool call timeouts and exceptions, but also resolves a critical issue where the entire agent would become unusable following a timeout, repeatedly throwing IllegalStateException. I have been running this updated version in our production environment for some time now, and it has been performing flawlessly.

It seems that the modifications to executeToolCalls should be sufficient to ensure a ToolResultBlock is present following a timeout or exception, which would prevent the persistent IllegalStateException.
However, I have some reservations about the auto-supplement logic in doCall. Since HITL (Human-In-The-Loop) workflows require a ToolResultBlock to be manually provided when resuming a conversation, automating this process might lead to unintended execution paths. It could also potentially mask improper HITL usage, making it harder to detect implementation errors.
It’s great to hear that this fix has been verified in production! A couple of follow-up questions:

Have we encountered specific scenarios where the auto-supplement logic in doCall was actually triggered?

In our current context, would it be possible to achieve the same goal by only modifying executeToolCalls?

Thanks for the thorough review! I appreciate the concern about HITL workflows — preserving the explicit contract for manual ToolResultBlock provision is indeed important.

Let me address your questions:

Regarding scenarios where doCall auto-supplement was triggered

While executeToolCalls's onErrorResume handles failures within tool execution, we've identified cases where pending states persist outside that coverage:

Pre-execution failures: Network timeouts before toolkit.callTools() begins, JVM crashes, or async scheduling failures where tool calls are queued but never executed

Memory recovery: When agent state is persisted and restored (e.g., after restart), orphaned pending IDs may exist without corresponding results

In these cases, executeToolCalls is never reached, so its error handling cannot apply. The doCall auto-recovery acts as a safety net for these edge cases.

On achieving the goal with only executeToolCalls modifications

For failures occurring during executeToolCalls, yes — the onErrorResume logic is sufficient. However, for the scenarios above, we need the doCall layer to handle states that originate from outside the tool execution flow.

Proposed compromise to protect HITL workflows

To address your concern about masking improper HITL usage, I suggest adding an explicit mode check:
if (providedResults.isEmpty()) {
    if (isHITLMode()) {  // Check if conversation is in HITL pause state
        // Strict behavior: require manual ToolResultBlock
        throw new IllegalStateException(
            "HITL workflow requires manual ToolResultBlock for pending IDs: " + pendingIds);
    }
    // Non-HITL: Auto-recover from orphaned states
    log.warn("Orphaned pending tool calls detected, auto-generating error results: {}", pendingIds);
    generateAndAddErrorToolResults(pendingIds);
}

Thanks for the thorough review! I've carefully reconsidered this, and I believe this is fundamentally a bug fix rather than an enhancement — adding configuration parameters would not be the right approach.

Why this is a critical bug (not just an enhancement)

The current behavior causes complete conversation failure on any tool execution error:

First failure: Tool timeout/exception → IllegalStateException → Agent crashes
Cascading effect: Pending state persists in memory
All subsequent requests: Fail with same exception, making the agent unusable

This violates basic fault isolation principles — a single tool failure should not crash the entire agent permanently.

Two key reasons why auto-supplement is necessary

First, tool execution failures should not leave the agent in a permanently broken state. Whether it's a timeout, network error, or unexpected exception, the agent must recover and continue functioning. The "auto-supplement" in doCall serves as a critical safety net for orphaned pending states that originate outside executeToolCalls coverage (e.g., JVM crashes, memory recovery, pre-execution failures).

Second, manually generating error results and feeding them to the LLM is essential for proper decision-making. Without this feedback, the model has no visibility into what happened. By providing explicit error messages (e.g., "[ERROR] Tool execution failed: timeout"), we enable the model to:

Understand the failure context
Decide on alternative approaches
Communicate meaningfully with the user

This is not masking errors — it's proper error propagation to the model layer.

Regarding HITL concerns

The HITL workflow concern is valid, but I'd argue:

HITL pause/resume should be handled at a higher orchestration layer, not by leaving the agent in a corrupted state
If HITL requires manual ToolResultBlock, it should explicitly manage the conversation lifecycle (clearing/resuming state) rather than relying on low-level exceptions

Summary

Approach	Result
Only `executeToolCalls` fix	Covers runtime failures, but leaves recovery gaps
Configuration switches	Adds complexity without solving the root problem
Current PR (both fixes)	Complete fault isolation + proper LLM feedback

The combination ensures robust error handling while keeping the implementation clean and deterministic. I'm happy to discuss alternative HITL integration patterns if needed, but I believe the core fix should remain as-is.

Would appreciate your thoughts on this perspective.

…dling

chensk0601 requested a review from a team March 14, 2026 06:38

chensk0601 force-pushed the fix/951-react-agent-tool-execution-error-handling branch from f3080ad to 86c49aa Compare March 14, 2026 07:00

chensk0601 force-pushed the fix/951-react-agent-tool-execution-error-handling branch 2 times, most recently from 0a3e447 to 0684fd6 Compare March 16, 2026 11:51

chensk0601 force-pushed the fix/951-react-agent-tool-execution-error-handling branch from 0684fd6 to a6fd753 Compare March 17, 2026 08:07

LearningGp mentioned this pull request Mar 24, 2026

fix(session): Automatically detect and remove dangling tool calls whe… #873

Closed

LearningGp requested changes Mar 24, 2026

View reviewed changes

agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java Outdated Show resolved Hide resolved

agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java Outdated Show resolved Hide resolved

agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java Show resolved Hide resolved

LearningGp requested a review from Copilot March 24, 2026 03:35

Copilot started reviewing on behalf of LearningGp March 24, 2026 03:36 View session

LearningGp mentioned this pull request Mar 24, 2026

(fix) fix for issue #951 #967

Open

Copilot AI reviewed Mar 24, 2026

View reviewed changes

chensk0601 requested a review from LearningGp March 27, 2026 03:08

chensk0601 and others added 3 commits March 27, 2026 11:17

Merge branch 'main' into fix/951-react-agent-tool-execution-error-han…

7f2a1ed

…dling

Merge branch 'main' into fix/951-react-agent-tool-execution-error-han…

2d7ad5f

…dling

LearningGp reviewed Mar 27, 2026

View reviewed changes

Merge branch 'main' into fix/951-react-agent-tool-execution-error-han…

ebf474e

…dling

Conversation

chensk0601 commented Mar 14, 2026

AgentScope-Java Version

Description

Checklist

Uh oh!

cla-assistant bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cla-assistant bot commented Mar 14, 2026

Uh oh!

codecov bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

LearningGp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LearningGp left a comment

Choose a reason for hiding this comment

Uh oh!

chensk0601 commented Mar 27, 2026

Uh oh!

LearningGp commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chensk0601 commented Mar 27, 2026

Regarding scenarios where doCall auto-supplement was triggered

On achieving the goal with only executeToolCalls modifications

Proposed compromise to protect HITL workflows

Uh oh!

chensk0601 commented Mar 27, 2026

Regarding scenarios where doCall auto-supplement was triggered

On achieving the goal with only executeToolCalls modifications

Proposed compromise to protect HITL workflows

Why this is a critical bug (not just an enhancement)

Two key reasons why auto-supplement is necessary

Regarding HITL concerns

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cla-assistant bot commented Mar 14, 2026 •

edited

Loading

codecov bot commented Mar 14, 2026 •

edited

Loading

LearningGp commented Mar 27, 2026 •

edited

Loading

Regarding scenarios where `doCall` auto-supplement was triggered

On achieving the goal with only `executeToolCalls` modifications

Regarding scenarios where `doCall` auto-supplement was triggered

On achieving the goal with only `executeToolCalls` modifications