-
Notifications
You must be signed in to change notification settings - Fork 451
feat: Support foreground deletion for Sandbox CRs to wait for pod termination #729
Description
Problem Statement
When deleting a Sandbox, openshell sandbox delete (and the gRPC DeleteSandbox RPC) returns immediately after the Kubernetes API accepts the delete request. The underlying pod may still be running and terminating. This is because:
SandboxClient::delete()usesDeleteParams::default(), which sends nopropagationPolicy— Kubernetes defaults toBackgrounddeletion.- The gRPC handler returns
DeleteSandboxResponseas soon as the K8s API acknowledges the delete, without polling for actual resource removal. - No
ownerReferencesorfinalizersare set by OpenShell on the Sandbox CR.
This means callers that need to know the sandbox (and its pod) are fully gone must implement their own polling. Currently only the Python SDK provides wait_deleted() as a workaround.
Affected code paths:
crates/openshell-server/src/sandbox/mod.rs:257-301—SandboxClient::delete()withDeleteParams::default()crates/openshell-server/src/grpc.rs:587-688—delete_sandboxgRPC handler returns immediatelycrates/openshell-cli/src/run.rs:2825-2879— CLIsandbox_deletereturns immediately after gRPC response
Proposed Design
Option A: Foreground propagation policy
Set propagation_policy: Some(PropagationPolicy::Foreground) in DeleteParams when deleting the Sandbox CR. This tells the Kubernetes API server to block until all dependent resources (pods, services) are garbage-collected via ownerReferences.
Prerequisite: Verify that the agent-sandbox-controller sets ownerReferences on child Pods pointing back to the Sandbox CR. If it does not, foreground propagation alone will not block on pod deletion.
Option B: Server-side poll after delete
After issuing the delete, the gRPC handler polls (or watches) the Sandbox resource until it returns 404 Not Found, then responds to the client. This is similar to what the Python SDK's wait_deleted() does today but moved server-side so all clients benefit.
Option C: Expose a wait parameter on the RPC
Add an optional wait field to DeleteSandboxRequest (or a separate DeleteSandboxAndWait RPC). When set, the server waits for full removal before responding. When unset, behavior remains fire-and-forget for callers that don't need synchronous deletion.
Recommendation
Option C gives the most flexibility — existing behavior is preserved by default, and callers that need synchronous deletion opt in. The server-side implementation can use either foreground propagation (Option A) or polling (Option B) under the hood.
Alternatives Considered
- Client-side polling only (status quo for Python SDK): Works but forces every client to reimplement the same logic. The Rust CLI and TUI currently have no wait support at all.
- Finalizers on the Sandbox CR: OpenShell could add a finalizer and only remove it after confirming the pod is gone. This is more invasive and couples OpenShell more tightly to the controller lifecycle.
Agent Investigation
Investigation of the codebase confirmed:
| Component | Waits for pod deletion? |
|---|---|
CLI (openshell sandbox delete) |
No |
CLI (--no-keep cleanup) |
No |
| TUI delete action | No |
| gRPC server handler | No |
Python SDK SandboxClient.delete() |
No |
Python SDK SandboxClient.wait_deleted() |
Yes — polls every 1s, 60s timeout |
Python SDK Sandbox.__exit__() |
Yes — via wait_deleted() |
DeleteParams::default()sends nopropagationPolicy→ Kubernetes defaults toBackground- No
ownerReferencesorfinalizersare set by OpenShell on the Sandbox CR (sandbox/mod.rs:197-202) - The async watcher (
spawn_sandbox_watcher) eventually picks up theDeletedevent and cleans up the store, but this is decoupled from the delete response - A safety-net reconciler runs every 60s to catch orphaned store records