fix(bootstrap,server): persist sandbox state across gateway stop/start cycles by drew · Pull Request #739 · NVIDIA/OpenShell

drew · 2026-04-02T19:21:16Z

Summary

Sandbox pod data was lost whenever the gateway was stopped and restarted. Two independent bugs caused this: k3s used the container ID as its node name (which changes on container recreation, triggering PVC deletion), and sandbox pods had no persistent storage by default.

Related Issue

Fixes #738

Changes

Deterministic k3s node name: Added node_name() to constants.rs and pass OPENSHELL_NODE_NAME env var to the gateway container. The entrypoint script uses --node-name so the k3s node identity survives container recreation. clean_stale_nodes() now compares against the expected node name instead of running hostname inside the container.
Default workspace PVC: Sandbox pods now get a default 1Gi volumeClaimTemplate named "workspace" mounted at /sandbox. This ensures user files, installed packages, etc. survive pod rescheduling across gateway restarts.

Testing

mise run pre-commit passes
Unit tests added/updated (2 new tests for workspace mount injection logic)
E2E tests added/updated (if applicable)

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

…t cycles Two changes to preserve sandbox state across gateway restarts: 1. Deterministic k3s node identity: Set the Docker container hostname to a deterministic name derived from the gateway name (openshell-{name}). Pass OPENSHELL_NODE_NAME env var and --node-name flag to k3s via the cluster entrypoint as belt-and-suspenders. Update clean_stale_nodes() to prefer the deterministic name with a fallback to the container hostname for backward compatibility with older cluster images. This prevents clean_stale_nodes() from deleting PVCs (including the server's SQLite database) when the container is recreated after an image upgrade. 2. Default workspace persistence: Inject a 2Gi PVC and init container into every sandbox pod so the /sandbox directory survives pod rescheduling. The init container uses the same sandbox image, mounts the PVC at a temporary path, and copies the image's /sandbox contents (Python venv, dotfiles, skills) into the PVC on first use — guarded by a sentinel file so subsequent restarts are instant. The agent container then mounts the populated PVC at /sandbox. Users who supply custom volumeClaimTemplates are unaffected — the default workspace is skipped. Fixes #738

drew requested a review from a team as a code owner April 2, 2026 19:21

drew self-assigned this Apr 2, 2026

drew added the test:e2e Requires end-to-end coverage label Apr 2, 2026

pimlock previously approved these changes Apr 2, 2026

View reviewed changes

drew dismissed pimlock’s stale review via 28695b3 April 2, 2026 20:14

drew force-pushed the 738-sandbox-persistence-across-gateway-restart branch 6 times, most recently from 871adc9 to 6da0e77 Compare April 3, 2026 04:37

drew force-pushed the 738-sandbox-persistence-across-gateway-restart branch from 6da0e77 to 38892c3 Compare April 3, 2026 04:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bootstrap,server): persist sandbox state across gateway stop/start cycles#739

fix(bootstrap,server): persist sandbox state across gateway stop/start cycles#739
drew wants to merge 1 commit intomainfrom
738-sandbox-persistence-across-gateway-restart

drew commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drew commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drew commented Apr 2, 2026 •

edited

Loading