Skip to content

fix: sandbox pod state lost across gateway stop/start cycles #738

@drew

Description

@drew

Problem Statement

When a user runs gateway stop followed by gateway start, sandbox pods are re-provisioned but all filesystem state inside them is lost. Users expect workspace files, installed packages, and other pod-local data to survive a gateway restart cycle.

There are two independent failure modes:

  1. Sandbox pods are ephemeral by default. Pods have no PersistentVolumeClaims, so even a simple pod reschedule (which happens on every k3s restart) loses the writable layer.

  2. Container recreation changes k3s node identity. When the gateway image changes between stop and start, the Docker container is recreated with a new container ID. k3s uses the container ID as its node name, so a new node is registered. clean_stale_nodes() then deletes all PVCs with node affinity for the old node — including the server's own StatefulSet PVC (openshell-data), wiping the SQLite database entirely.

Proposed Design

1. Stabilize k3s node identity across container recreations

Pass a deterministic --node-name to k3s in the cluster entrypoint script, derived from the gateway name rather than the container ID. This prevents node identity churn and stops clean_stale_nodes() from nuking PVCs when the container is recreated.

Files: deploy/docker/cluster-entrypoint.sh, crates/openshell-bootstrap/src/docker.rs

2. Add a default workspace PVC to sandbox pods

Automatically include a volumeClaimTemplate in the sandbox pod spec so that the sandbox's home/workspace directory is backed by persistent storage. The Sandbox CRD already supports volumeClaimTemplates — this just needs to be populated by default during sandbox creation.

Files: crates/openshell-server/src/sandbox/mod.rs

Agent Investigation

Explored the full gateway lifecycle and sandbox provisioning code:

  • gateway stop only calls docker stop (docker.rs:878), preserving container + volume + network.
  • gateway start calls ensure_network() which always destroys/recreates the Docker bridge, then ensure_container() which reuses or recreates the container depending on image match (docker.rs:473-567).
  • On container recreate, clean_stale_nodes() (runtime.rs:379-519) deletes NotReady nodes, terminating pods, and PVCs with stale node affinity — including the server's own database PVC.
  • Sandbox pods are created with only a hostPath (supervisor binary, read-only) and optional TLS secret volume (sandbox/mod.rs:645-661). No PVCs by default.
  • The Sandbox CRD supports volumeClaimTemplates (datamodel.proto:47) but OpenShell doesn't populate this field.
  • The k3s node name defaults to the container hostname (= container ID), which changes on container recreation.

Definition of Done

  • k3s node name is deterministic and survives container recreation
  • clean_stale_nodes() no longer deletes PVCs unnecessarily after image upgrades
  • Sandbox pods include a default workspace PVC
  • Server SQLite database survives gateway image upgrades
  • Existing volumeClaimTemplates passthrough still works

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions