Skip to content

feat(sandbox): persist /sandbox directory across gateway restarts via init container + PVC #743

@drew

Description

@drew

Problem Statement

Sandbox pod filesystem data (installed packages, user files, configured tools) is lost whenever the gateway is stopped and restarted. The deterministic node name fix (#738) prevents the server database and existing PVCs from being cascade-deleted, but sandbox pods themselves are ephemeral — k3s reschedules them with fresh container filesystems. Users expect their workspace to survive a gateway restart.

A previous attempt to mount a PVC directly at /sandbox broke the sandbox because the empty PVC shadows the image's Python venv, pip, dotfiles, and agent skills — every tool in the sandbox became command not found.

Technical Context

The default sandbox image (ghcr.io/nvidia/openshell-community/sandboxes/base:latest) sets WORKDIR /sandbox and populates it with:

Path Contents
/sandbox/.venv/ Python 3.13 venv with pip, setuptools, cloudpickle
/sandbox/.uv/python/ uv-managed Python 3.13 toolchain
/sandbox/.bashrc, .profile Shell init: exports PATH, VIRTUAL_ENV, sets PS1
/sandbox/.agents/skills/ Agent skill definitions
/sandbox/.claude/skills/ Symlinks to .agents/skills/*

The sandbox PATH is /sandbox/.venv/bin:/usr/local/bin:/usr/bin:/bin. Mounting an empty PVC at /sandbox removes the venv, breaking all Python/pip/uv commands. The image's /sandbox is ~300-400MB.

Affected Components

Component Key Files Role
Sandbox pod spec crates/openshell-server/src/sandbox/mod.rs Builds the Sandbox CR and podTemplate
Supervisor sideload crates/openshell-server/src/sandbox/mod.rs:678-734 Pattern to follow for pod spec modification
VCT passthrough crates/openshell-server/src/sandbox/mod.rs:786-788 Where user-provided volumeClaimTemplates are inserted
Sandbox CRD deploy/kube/manifests/agent-sandbox.yaml:3901-4008 CRD schema for volumeClaimTemplates
Default sandbox image External: ghcr.io/nvidia/openshell-community/sandboxes/base Defines /sandbox contents

Technical Investigation

Current Behavior

Sandbox pods have no PersistentVolumeClaims by default. The Sandbox CRD supports volumeClaimTemplates (deploy/kube/manifests/agent-sandbox.yaml:3901), and the server passes them through from the proto definition (sandbox/mod.rs:786-788), but nothing populates this field by default. All pod data lives in the ephemeral container layer.

What Would Need to Change

New function apply_workspace_persistence in sandbox/mod.rs (after apply_supervisor_sideload at ~line 734):

  1. Add a workspace volume referencing the PVC.
  2. Add a volumeMount for /sandbox on the agent container.
  3. Inject an initContainers entry using the same sandbox image that:
    • Mounts the PVC at a temporary path (e.g., /workspace-pvc)
    • Checks for a sentinel file (/workspace-pvc/.initialized)
    • If absent, copies the image's /sandbox/ contents into the PVC mount and creates the sentinel
    • If present, no-op (fast path)

The init container sees the image's original /sandbox contents (since the PVC isn't mounted there), copies them to the PVC, and the agent container then mounts the PVC at /sandbox — getting both the image's base contents and any prior user data.

Default volumeClaimTemplates injection in sandbox_to_k8s_spec (~line 786):
When the user hasn't provided custom volumeClaimTemplates, inject a default 2Gi PVC named workspace.

Alternative Approaches Considered

Approach Verdict Why
SubPath mount at /sandbox/workspace Rejected Only persists a subdirectory. User-installed packages, venv changes, and dotfiles are still lost. Agents work in /sandbox, not a subdirectory.
Overlay filesystem (fuse-overlayfs) Rejected Requires FUSE device or CAP_SYS_ADMIN mount usage, which weakens the sandbox security posture. Over-engineered for this use case.
Init container + PVC at /sandbox Recommended Standard k8s pattern. Fully persists image contents + user changes. Follows existing apply_supervisor_sideload code pattern.

Patterns to Follow

The apply_supervisor_sideload function (sandbox/mod.rs:678-734) is the exact pattern:

  • Gets spec as mutable object
  • Adds to spec.volumes[]
  • Finds the agent container by name
  • Modifies container volumeMounts
  • The new function would additionally add spec.initContainers[]

Proposed Approach

Add an init container that seeds a PVC with the image's /sandbox contents on first creation. The init container uses the same sandbox image, mounts the PVC at a temp path, and copies /sandbox/* into it if a sentinel file is absent. The agent container then mounts the populated PVC at /sandbox. This preserves both the image's base tools and any user modifications across restarts. Skip injection when the user provides their own volumeClaimTemplates to avoid conflicts.

Scope Assessment

  • Complexity: Medium
  • Confidence: High — clear path, standard k8s patterns, follows existing code
  • Estimated files to change: 1-2 (sandbox/mod.rs, potentially constants.rs)
  • Issue type: feat

Risks & Open Questions

  • Image upgrade staleness: When a user updates their sandbox image, the PVC retains the old image's /sandbox contents. Start with "delete + recreate sandbox" as the upgrade path; a digest-based sentinel could be a follow-up.
  • Opt-out mechanism: Some users may want ephemeral sandboxes. Consider --no-persist flag or server env var. Can be a follow-up.
  • Init container runs as root: Needed to cp -a and preserve ownership. The supervisor sideload already sets runAsUser: 0 on the agent container, so this is consistent.
  • Partial copy on init container kill: If killed mid-copy, PVC is partially initialized. Sentinel is written last, so next attempt re-copies. Acceptable risk.
  • PVC cleanup on sandbox delete: The CRD controller RBAC includes PVC delete and likely sets ownerReferences for garbage collection. Verify with actual controller behavior.
  • Custom images without /sandbox: Init container should be a no-op if /sandbox is empty or absent in the image.

Test Considerations

  • Unit tests: Test apply_workspace_persistence function — verify init container injection, volume mount, sentinel command.
  • Unit tests: Test that default VCT is injected when user doesn't provide one, and skipped when they do.
  • E2E tests: The gateway-resume E2E test should verify sandbox pod data survives stop/start cycles.
  • E2E tests: Verify python, pip, uv still work inside the sandbox (regression test for the shadowing bug).
  • Follow existing test patterns in sandbox/mod.rs tests (lines 1495+).

Created by spike investigation. Use build-from-issue to plan and implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions