Skip to content

fix(bootstrap,server): persist sandbox state across gateway stop/start cycles#739

Open
drew wants to merge 1 commit intomainfrom
738-sandbox-persistence-across-gateway-restart
Open

fix(bootstrap,server): persist sandbox state across gateway stop/start cycles#739
drew wants to merge 1 commit intomainfrom
738-sandbox-persistence-across-gateway-restart

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented Apr 2, 2026

Summary

Sandbox pod data was lost whenever the gateway was stopped and restarted. Two independent bugs caused this: k3s used the container ID as its node name (which changes on container recreation, triggering PVC deletion), and sandbox pods had no persistent storage by default.

Related Issue

Fixes #738

Changes

  • Deterministic k3s node name: Added node_name() to constants.rs and pass OPENSHELL_NODE_NAME env var to the gateway container. The entrypoint script uses --node-name so the k3s node identity survives container recreation. clean_stale_nodes() now compares against the expected node name instead of running hostname inside the container.
  • Default workspace PVC: Sandbox pods now get a default 1Gi volumeClaimTemplate named "workspace" mounted at /sandbox. This ensures user files, installed packages, etc. survive pod rescheduling across gateway restarts.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated (2 new tests for workspace mount injection logic)
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@drew drew requested a review from a team as a code owner April 2, 2026 19:21
@drew drew self-assigned this Apr 2, 2026
@drew drew added the test:e2e Requires end-to-end coverage label Apr 2, 2026
pimlock
pimlock previously approved these changes Apr 2, 2026
@drew drew force-pushed the 738-sandbox-persistence-across-gateway-restart branch 6 times, most recently from 871adc9 to 6da0e77 Compare April 3, 2026 04:37
…t cycles

Two changes to preserve sandbox state across gateway restarts:

1. Deterministic k3s node identity: Set the Docker container hostname to
   a deterministic name derived from the gateway name (openshell-{name}).
   Pass OPENSHELL_NODE_NAME env var and --node-name flag to k3s via the
   cluster entrypoint as belt-and-suspenders.  Update clean_stale_nodes()
   to prefer the deterministic name with a fallback to the container
   hostname for backward compatibility with older cluster images.

   This prevents clean_stale_nodes() from deleting PVCs (including the
   server's SQLite database) when the container is recreated after an
   image upgrade.

2. Default workspace persistence: Inject a 2Gi PVC and init container
   into every sandbox pod so the /sandbox directory survives pod
   rescheduling.  The init container uses the same sandbox image, mounts
   the PVC at a temporary path, and copies the image's /sandbox contents
   (Python venv, dotfiles, skills) into the PVC on first use — guarded
   by a sentinel file so subsequent restarts are instant.  The agent
   container then mounts the populated PVC at /sandbox.  Users who
   supply custom volumeClaimTemplates are unaffected — the default
   workspace is skipped.

Fixes #738
@drew drew force-pushed the 738-sandbox-persistence-across-gateway-restart branch from 6da0e77 to 38892c3 Compare April 3, 2026 04:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: sandbox pod state lost across gateway stop/start cycles

2 participants