-
Notifications
You must be signed in to change notification settings - Fork 451
feat(sandbox): persist /sandbox directory across gateway restarts via init container + PVC #743
Description
Problem Statement
Sandbox pod filesystem data (installed packages, user files, configured tools) is lost whenever the gateway is stopped and restarted. The deterministic node name fix (#738) prevents the server database and existing PVCs from being cascade-deleted, but sandbox pods themselves are ephemeral — k3s reschedules them with fresh container filesystems. Users expect their workspace to survive a gateway restart.
A previous attempt to mount a PVC directly at /sandbox broke the sandbox because the empty PVC shadows the image's Python venv, pip, dotfiles, and agent skills — every tool in the sandbox became command not found.
Technical Context
The default sandbox image (ghcr.io/nvidia/openshell-community/sandboxes/base:latest) sets WORKDIR /sandbox and populates it with:
| Path | Contents |
|---|---|
/sandbox/.venv/ |
Python 3.13 venv with pip, setuptools, cloudpickle |
/sandbox/.uv/python/ |
uv-managed Python 3.13 toolchain |
/sandbox/.bashrc, .profile |
Shell init: exports PATH, VIRTUAL_ENV, sets PS1 |
/sandbox/.agents/skills/ |
Agent skill definitions |
/sandbox/.claude/skills/ |
Symlinks to .agents/skills/* |
The sandbox PATH is /sandbox/.venv/bin:/usr/local/bin:/usr/bin:/bin. Mounting an empty PVC at /sandbox removes the venv, breaking all Python/pip/uv commands. The image's /sandbox is ~300-400MB.
Affected Components
| Component | Key Files | Role |
|---|---|---|
| Sandbox pod spec | crates/openshell-server/src/sandbox/mod.rs |
Builds the Sandbox CR and podTemplate |
| Supervisor sideload | crates/openshell-server/src/sandbox/mod.rs:678-734 |
Pattern to follow for pod spec modification |
| VCT passthrough | crates/openshell-server/src/sandbox/mod.rs:786-788 |
Where user-provided volumeClaimTemplates are inserted |
| Sandbox CRD | deploy/kube/manifests/agent-sandbox.yaml:3901-4008 |
CRD schema for volumeClaimTemplates |
| Default sandbox image | External: ghcr.io/nvidia/openshell-community/sandboxes/base |
Defines /sandbox contents |
Technical Investigation
Current Behavior
Sandbox pods have no PersistentVolumeClaims by default. The Sandbox CRD supports volumeClaimTemplates (deploy/kube/manifests/agent-sandbox.yaml:3901), and the server passes them through from the proto definition (sandbox/mod.rs:786-788), but nothing populates this field by default. All pod data lives in the ephemeral container layer.
What Would Need to Change
New function apply_workspace_persistence in sandbox/mod.rs (after apply_supervisor_sideload at ~line 734):
- Add a
workspacevolume referencing the PVC. - Add a
volumeMountfor/sandboxon the agent container. - Inject an
initContainersentry using the same sandbox image that:- Mounts the PVC at a temporary path (e.g.,
/workspace-pvc) - Checks for a sentinel file (
/workspace-pvc/.initialized) - If absent, copies the image's
/sandbox/contents into the PVC mount and creates the sentinel - If present, no-op (fast path)
- Mounts the PVC at a temporary path (e.g.,
The init container sees the image's original /sandbox contents (since the PVC isn't mounted there), copies them to the PVC, and the agent container then mounts the PVC at /sandbox — getting both the image's base contents and any prior user data.
Default volumeClaimTemplates injection in sandbox_to_k8s_spec (~line 786):
When the user hasn't provided custom volumeClaimTemplates, inject a default 2Gi PVC named workspace.
Alternative Approaches Considered
| Approach | Verdict | Why |
|---|---|---|
SubPath mount at /sandbox/workspace |
Rejected | Only persists a subdirectory. User-installed packages, venv changes, and dotfiles are still lost. Agents work in /sandbox, not a subdirectory. |
| Overlay filesystem (fuse-overlayfs) | Rejected | Requires FUSE device or CAP_SYS_ADMIN mount usage, which weakens the sandbox security posture. Over-engineered for this use case. |
Init container + PVC at /sandbox |
Recommended | Standard k8s pattern. Fully persists image contents + user changes. Follows existing apply_supervisor_sideload code pattern. |
Patterns to Follow
The apply_supervisor_sideload function (sandbox/mod.rs:678-734) is the exact pattern:
- Gets
specas mutable object - Adds to
spec.volumes[] - Finds the agent container by name
- Modifies container
volumeMounts - The new function would additionally add
spec.initContainers[]
Proposed Approach
Add an init container that seeds a PVC with the image's /sandbox contents on first creation. The init container uses the same sandbox image, mounts the PVC at a temp path, and copies /sandbox/* into it if a sentinel file is absent. The agent container then mounts the populated PVC at /sandbox. This preserves both the image's base tools and any user modifications across restarts. Skip injection when the user provides their own volumeClaimTemplates to avoid conflicts.
Scope Assessment
- Complexity: Medium
- Confidence: High — clear path, standard k8s patterns, follows existing code
- Estimated files to change: 1-2 (
sandbox/mod.rs, potentiallyconstants.rs) - Issue type:
feat
Risks & Open Questions
- Image upgrade staleness: When a user updates their sandbox image, the PVC retains the old image's
/sandboxcontents. Start with "delete + recreate sandbox" as the upgrade path; a digest-based sentinel could be a follow-up. - Opt-out mechanism: Some users may want ephemeral sandboxes. Consider
--no-persistflag or server env var. Can be a follow-up. - Init container runs as root: Needed to
cp -aand preserve ownership. The supervisor sideload already setsrunAsUser: 0on the agent container, so this is consistent. - Partial copy on init container kill: If killed mid-copy, PVC is partially initialized. Sentinel is written last, so next attempt re-copies. Acceptable risk.
- PVC cleanup on sandbox delete: The CRD controller RBAC includes PVC delete and likely sets ownerReferences for garbage collection. Verify with actual controller behavior.
- Custom images without
/sandbox: Init container should be a no-op if/sandboxis empty or absent in the image.
Test Considerations
- Unit tests: Test
apply_workspace_persistencefunction — verify init container injection, volume mount, sentinel command. - Unit tests: Test that default VCT is injected when user doesn't provide one, and skipped when they do.
- E2E tests: The
gateway-resumeE2E test should verify sandbox pod data survives stop/start cycles. - E2E tests: Verify
python,pip,uvstill work inside the sandbox (regression test for the shadowing bug). - Follow existing test patterns in
sandbox/mod.rstests (lines 1495+).
Created by spike investigation. Use build-from-issue to plan and implement.