feat: support agent graceful shutdown#909
feat: support agent graceful shutdown#909uuuyuqi wants to merge 4 commits intoagentscope-ai:mainfrom
Conversation
Change-Id: I5958adfad87fdfd64db45cffbb36cb3c85d83850
…hutdownManager Change-Id: I82ffdab04bc0d6c5eed62d23510c350426e97f64
Change-Id: I694c2d4edd2e683b9c1d2aec7ce074af5c2aeef7
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Implements a framework-level graceful shutdown lifecycle for AgentScope-Java, adding request tracking, checkpoint-based interruption, session persistence for resume, and ordered JVM shutdown cleanup (agents first, HTTP transports last), plus a new example module and comprehensive tests.
Changes:
- Add
GracefulShutdownManager+ hook integration to reject new requests during shutdown and interrupt in-flight requests at safe checkpoints / timeout. - Persist shutdown-interrupted state to
Sessionand deduplicate retried input on the next request. - Add a tool-execution shutdown guard and a unified JVM shutdown hook to close HTTP transports after agent termination; include new examples and tests.
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| agentscope-examples/pom.xml | Adds the new graceful-shutdown example module to the build. |
| agentscope-examples/graceful-shutdown/pom.xml | New Spring Boot-based example module for graceful shutdown demos/e2e. |
| agentscope-examples/graceful-shutdown/src/main/resources/application.yml | Example config for Spring graceful shutdown + agent shutdown options. |
| agentscope-examples/graceful-shutdown/src/main/java/io/agentscope/examples/shutdown/smoke/GracefulShutdownExample.java | Standalone smoke scenarios for shutdown + resume behavior. |
| agentscope-examples/graceful-shutdown/src/main/java/io/agentscope/examples/shutdown/e2e/ShutdownE2eApplication.java | Spring Boot app entrypoint for e2e shutdown testing. |
| agentscope-examples/graceful-shutdown/src/main/java/io/agentscope/examples/shutdown/e2e/AgentController.java | REST endpoints for chat + readiness/status + triggering shutdown. |
| agentscope-examples/graceful-shutdown/src/main/java/io/agentscope/examples/shutdown/e2e/AgentService.java | Creates agents per request, configures shutdown config, handles shutdown exceptions. |
| agentscope-examples/graceful-shutdown/src/main/java/io/agentscope/examples/shutdown/e2e/SlowAnalysisTool.java | Slow tool to simulate long-running tool execution during shutdown. |
| agentscope-core/src/main/java/io/agentscope/core/shutdown/ShutdownState.java | New global shutdown state enum. |
| agentscope-core/src/main/java/io/agentscope/core/shutdown/PartialReasoningPolicy.java | Policy enum for saving/discarding partial reasoning on shutdown. |
| agentscope-core/src/main/java/io/agentscope/core/shutdown/GracefulShutdownConfig.java | Shutdown timeout + partial reasoning policy configuration record. |
| agentscope-core/src/main/java/io/agentscope/core/shutdown/AgentShuttingDownException.java | Exception for rejected/interrupted requests during shutdown. |
| agentscope-core/src/main/java/io/agentscope/core/shutdown/ShutdownInterruptedState.java | Session-persisted flag for shutdown-interrupted sessions. |
| agentscope-core/src/main/java/io/agentscope/core/shutdown/ShutdownSessionBinding.java | Session binding record used by shutdown persistence. |
| agentscope-core/src/main/java/io/agentscope/core/shutdown/ActiveRequestContext.java | Per-request context: interrupt + save-to-session behavior. |
| agentscope-core/src/main/java/io/agentscope/core/shutdown/GracefulShutdownManager.java | Core singleton: state machine, tracking, timeout monitor, persistence integration. |
| agentscope-core/src/main/java/io/agentscope/core/shutdown/GracefulShutdownHook.java | System hook: safe-checkpoint interruption + resume input deduplication. |
| agentscope-core/src/main/java/io/agentscope/core/shutdown/AgentScopeJvmShutdownHook.java | Unified JVM shutdown hook: graceful shutdown → await → transport shutdown. |
| agentscope-core/src/main/java/io/agentscope/core/agent/AgentBase.java | Integrates request registration/unregistration + shutdown request rejection; adds interrupt source. |
| agentscope-core/src/main/java/io/agentscope/core/ReActAgent.java | Binds session for shutdown persistence; saves/discards partial reasoning; system-interrupt handling. |
| agentscope-core/src/main/java/io/agentscope/core/tool/ToolExecutor.java | Races tool execution against global shutdown-timeout signal. |
| agentscope-core/src/main/java/io/agentscope/core/session/Session.java | Adds per-key delete API (default method). |
| agentscope-core/src/main/java/io/agentscope/core/session/JsonSession.java | Implements per-key delete. |
| agentscope-core/src/main/java/io/agentscope/core/session/InMemorySession.java | Implements per-key delete. |
| agentscope-core/src/main/java/io/agentscope/core/model/transport/HttpTransportFactory.java | Removes per-transport shutdown hook in favor of unified shutdown orchestration. |
| agentscope-core/src/test/java/io/agentscope/core/shutdown/GracefulShutdownTest.java | Comprehensive tests covering shutdown lifecycle, hooks, persistence, and tracking. |
| } | ||
|
|
||
| @Override | ||
| public void loadFrom(Session session, SessionKey sessionKey) { |
There was a problem hiding this comment.
loadIfExists() binds the session to GracefulShutdownManager, but loadFrom() does not. If callers use loadFrom(session, key) and then rely on graceful-shutdown persistence/dedup, the manager won't have the session binding and won’t be able to save/clear the shutdown-interrupted flag. Consider also binding in loadFrom() for consistent behavior.
| public void loadFrom(Session session, SessionKey sessionKey) { | |
| public void loadFrom(Session session, SessionKey sessionKey) { | |
| shutdownManager.bindSession(this, session, sessionKey); |
| resetInterruptFlag(); | ||
| GracefulShutdownManager.getInstance().ensureAcceptingRequests(); | ||
| GracefulShutdownManager.getInstance().registerRequest(this); |
There was a problem hiding this comment.
acquireExecution() sets running to true before calling GracefulShutdownManager.ensureAcceptingRequests(). If ensureAcceptingRequests() throws (shutdown in progress), the Mono.using resource supplier fails and releaseExecution() will not run, leaving running=true and effectively bricking the agent instance for future calls. Consider checking shutdown acceptance before flipping running, or wrapping the shutdown check/registration in try/catch to revert running on failure.
| resetInterruptFlag(); | |
| GracefulShutdownManager.getInstance().ensureAcceptingRequests(); | |
| GracefulShutdownManager.getInstance().registerRequest(this); | |
| try { | |
| resetInterruptFlag(); | |
| GracefulShutdownManager.getInstance().ensureAcceptingRequests(); | |
| GracefulShutdownManager.getInstance().registerRequest(this); | |
| } catch (RuntimeException ex) { | |
| if (checkRunning) { | |
| running.set(false); | |
| } | |
| throw ex; | |
| } |
| String requestId = UUID.randomUUID().toString(); | ||
| ActiveRequestContext ctx = | ||
| new ActiveRequestContext(requestId, agentBase, session, sessionKey); | ||
| activeRequestsByAgentId.put(agent.getAgentId(), ctx); | ||
| return requestId; |
There was a problem hiding this comment.
Active requests are tracked in activeRequestsByAgentId keyed only by agentId. If checkRunning is disabled (concurrent call()s on the same agent), a second registerRequest() will overwrite the first context; later unregisterRequest() can drop the remaining in-flight request, causing getActiveRequestCount() to be wrong and potentially transitioning to TERMINATED while work is still running. Tracking should be per-request (e.g., keyed by requestId) or support multiple contexts per agentId.
| return; | ||
| } | ||
| activeRequestsByAgentId.remove(agent.getAgentId()); | ||
| sessionBindings.remove(agent.getAgentId()); | ||
| } |
There was a problem hiding this comment.
unregisterRequest() removes the agent's sessionBindings entry on every request completion. For agent instances that call loadIfExists() once and are then reused across multiple call()s, this drops the session binding and prevents shutdown persistence / interrupted-flag clearing on subsequent requests. Consider keeping the binding for the agent lifecycle (or providing an explicit unbind/close path) rather than removing it on every request cleanup.
| * @param sessionKey the session identifier | ||
| * @param key the state key to delete | ||
| */ | ||
| default void delete(SessionKey sessionKey, String key) { | ||
| // Default no-op; implementations should override if they support per-key deletion |
There was a problem hiding this comment.
The new Session.delete(SessionKey, String) defaults to a no-op. GracefulShutdownManager.checkAndClearShutdownInterrupted() relies on this to clear the shutdown-interrupted flag; any custom Session that doesn't override this will never clear the flag and may keep discarding input on every retry. Consider failing fast (e.g., UnsupportedOperationException) or changing shutdown logic to clear the flag via an overwrite instead of requiring per-key deletion support.
| * @param sessionKey the session identifier | |
| * @param key the state key to delete | |
| */ | |
| default void delete(SessionKey sessionKey, String key) { | |
| // Default no-op; implementations should override if they support per-key deletion | |
| * <p>The default implementation throws {@link UnsupportedOperationException}. Implementations | |
| * that support per-key deletion must override this method. | |
| * | |
| * @param sessionKey the session identifier | |
| * @param key the state key to delete | |
| * @throws UnsupportedOperationException if the session implementation does not support per-key | |
| * deletion | |
| */ | |
| default void delete(SessionKey sessionKey, String key) { | |
| throw new UnsupportedOperationException( | |
| "Per-key deletion is not supported by this Session implementation"); |
| activeRequestsByAgentId.clear(); | ||
| sessionBindings.clear(); | ||
| shutdownStartedAt.set(null); | ||
| monitorStarted.set(false); | ||
| shutdownTimeoutSignal.set(Sinks.empty()); |
There was a problem hiding this comment.
resetForTesting() resets monitorStarted to false but does not cancel/shutdown the already-scheduled monitor.scheduleAtFixedRate(...) task (nor recreate the executor). If performGracefulShutdown() is invoked again after a reset, additional monitor tasks can be scheduled, leading to duplicated timeout enforcement/logging and potential test flakiness. Consider keeping a ScheduledFuture to cancel on reset, or shutting down/recreating the executor in resetForTesting().
| * <p>All registered transports and the default transport will be closed | ||
| * during JVM shutdown, orchestrated by {@code AgentScopeJvmShutdownHook} | ||
| * after all agents have finished processing. |
There was a problem hiding this comment.
With the per-transport shutdown hook removed, HttpTransportFactory now relies on AgentScopeJvmShutdownHook (registered from GracefulShutdownManager) to call HttpTransportFactory.shutdown(). If an application uses models/transports without ever loading/creating an AgentBase (so GracefulShutdownManager is never initialized), transports may no longer be closed on JVM shutdown despite this Javadoc. Consider ensuring the unified hook is registered whenever HttpTransportFactory is used (e.g., trigger GracefulShutdownManager.getInstance() from HttpTransportFactory).
| () -> { | ||
| try { | ||
| manager.performGracefulShutdown(); | ||
| manager.awaitTermination( |
There was a problem hiding this comment.
The awaitTermination in the JVM hook currently uses the same timeout as force-interrupt. Since enforceTimeoutAndInterrupt triggers the interrupt at the exact moment awaitTermination times out, the Agent is left with no buffer to process the interrupt or unregister before the connection is closed by HttpTransportFactory.shutdown().
We should add a grace period (e.g., an additional 5s) to the awaitTermination timeout to ensure the Agent has sufficient time to handle the interrupt and perform a clean teardown.
| throw new IllegalStateException("Agent is still running, please wait for it to finish"); | ||
| } | ||
| resetInterruptFlag(); | ||
| GracefulShutdownManager.getInstance().ensureAcceptingRequests(); |
There was a problem hiding this comment.
The running.compareAndSet(false, true) is executed before ensureAcceptingRequests(). Since the latter is part of the resourceSupplier in Mono.using, if it throws an AgentShuttingDownException, the cleanup callback (releaseExecution) will not be triggered (as the resource was never successfully initialized).
This results in a state leak where the running flag is stuck at true, permanently preventing this Agent instance from accepting any subsequent requests. We should ensure the state is reset if the resource acquisition fails.
AgentScope-Java Version
1.0.10-SNAPSHOT
Description
Closes #907
Checklist
Please check the following items before code is ready to be reviewed.
mvn spotless:applymvn test)