fix: sandbox pod state lost across gateway stop/start cycles

## Problem Statement

When a user runs `gateway stop` followed by `gateway start`, sandbox pods are re-provisioned but all filesystem state inside them is lost. Users expect workspace files, installed packages, and other pod-local data to survive a gateway restart cycle.

There are two independent failure modes:

1. **Sandbox pods are ephemeral by default.** Pods have no PersistentVolumeClaims, so even a simple pod reschedule (which happens on every k3s restart) loses the writable layer.

2. **Container recreation changes k3s node identity.** When the gateway image changes between stop and start, the Docker container is recreated with a new container ID. k3s uses the container ID as its node name, so a new node is registered. `clean_stale_nodes()` then deletes all PVCs with node affinity for the old node — including the server's own StatefulSet PVC (`openshell-data`), wiping the SQLite database entirely.

## Proposed Design

### 1. Stabilize k3s node identity across container recreations

Pass a deterministic `--node-name` to k3s in the cluster entrypoint script, derived from the gateway name rather than the container ID. This prevents node identity churn and stops `clean_stale_nodes()` from nuking PVCs when the container is recreated.

**Files:** `deploy/docker/cluster-entrypoint.sh`, `crates/openshell-bootstrap/src/docker.rs`

### 2. Add a default workspace PVC to sandbox pods

Automatically include a `volumeClaimTemplate` in the sandbox pod spec so that the sandbox's home/workspace directory is backed by persistent storage. The Sandbox CRD already supports `volumeClaimTemplates` — this just needs to be populated by default during sandbox creation.

**Files:** `crates/openshell-server/src/sandbox/mod.rs`

## Agent Investigation

Explored the full gateway lifecycle and sandbox provisioning code:

- `gateway stop` only calls `docker stop` (`docker.rs:878`), preserving container + volume + network.
- `gateway start` calls `ensure_network()` which always destroys/recreates the Docker bridge, then `ensure_container()` which reuses or recreates the container depending on image match (`docker.rs:473-567`).
- On container recreate, `clean_stale_nodes()` (`runtime.rs:379-519`) deletes NotReady nodes, terminating pods, and PVCs with stale node affinity — including the server's own database PVC.
- Sandbox pods are created with only a hostPath (supervisor binary, read-only) and optional TLS secret volume (`sandbox/mod.rs:645-661`). No PVCs by default.
- The Sandbox CRD supports `volumeClaimTemplates` (`datamodel.proto:47`) but OpenShell doesn't populate this field.
- The k3s node name defaults to the container hostname (= container ID), which changes on container recreation.

## Definition of Done

- [ ] k3s node name is deterministic and survives container recreation
- [ ] `clean_stale_nodes()` no longer deletes PVCs unnecessarily after image upgrades
- [ ] Sandbox pods include a default workspace PVC
- [ ] Server SQLite database survives gateway image upgrades
- [ ] Existing `volumeClaimTemplates` passthrough still works

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: sandbox pod state lost across gateway stop/start cycles #738

Problem Statement

Proposed Design

1. Stabilize k3s node identity across container recreations

2. Add a default workspace PVC to sandbox pods

Agent Investigation

Definition of Done

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fix: sandbox pod state lost across gateway stop/start cycles #738

Description

Problem Statement

Proposed Design

1. Stabilize k3s node identity across container recreations

2. Add a default workspace PVC to sandbox pods

Agent Investigation

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions