-
Notifications
You must be signed in to change notification settings - Fork 448
fix: sandbox pod state lost across gateway stop/start cycles #738
Description
Problem Statement
When a user runs gateway stop followed by gateway start, sandbox pods are re-provisioned but all filesystem state inside them is lost. Users expect workspace files, installed packages, and other pod-local data to survive a gateway restart cycle.
There are two independent failure modes:
-
Sandbox pods are ephemeral by default. Pods have no PersistentVolumeClaims, so even a simple pod reschedule (which happens on every k3s restart) loses the writable layer.
-
Container recreation changes k3s node identity. When the gateway image changes between stop and start, the Docker container is recreated with a new container ID. k3s uses the container ID as its node name, so a new node is registered.
clean_stale_nodes()then deletes all PVCs with node affinity for the old node — including the server's own StatefulSet PVC (openshell-data), wiping the SQLite database entirely.
Proposed Design
1. Stabilize k3s node identity across container recreations
Pass a deterministic --node-name to k3s in the cluster entrypoint script, derived from the gateway name rather than the container ID. This prevents node identity churn and stops clean_stale_nodes() from nuking PVCs when the container is recreated.
Files: deploy/docker/cluster-entrypoint.sh, crates/openshell-bootstrap/src/docker.rs
2. Add a default workspace PVC to sandbox pods
Automatically include a volumeClaimTemplate in the sandbox pod spec so that the sandbox's home/workspace directory is backed by persistent storage. The Sandbox CRD already supports volumeClaimTemplates — this just needs to be populated by default during sandbox creation.
Files: crates/openshell-server/src/sandbox/mod.rs
Agent Investigation
Explored the full gateway lifecycle and sandbox provisioning code:
gateway stoponly callsdocker stop(docker.rs:878), preserving container + volume + network.gateway startcallsensure_network()which always destroys/recreates the Docker bridge, thenensure_container()which reuses or recreates the container depending on image match (docker.rs:473-567).- On container recreate,
clean_stale_nodes()(runtime.rs:379-519) deletes NotReady nodes, terminating pods, and PVCs with stale node affinity — including the server's own database PVC. - Sandbox pods are created with only a hostPath (supervisor binary, read-only) and optional TLS secret volume (
sandbox/mod.rs:645-661). No PVCs by default. - The Sandbox CRD supports
volumeClaimTemplates(datamodel.proto:47) but OpenShell doesn't populate this field. - The k3s node name defaults to the container hostname (= container ID), which changes on container recreation.
Definition of Done
- k3s node name is deterministic and survives container recreation
-
clean_stale_nodes()no longer deletes PVCs unnecessarily after image upgrades - Sandbox pods include a default workspace PVC
- Server SQLite database survives gateway image upgrades
- Existing
volumeClaimTemplatespassthrough still works