feat(bootstrap): resume gateway from existing state and persist SSH handshake secret#488
feat(bootstrap): resume gateway from existing state and persist SSH handshake secret#488
Conversation
46a1bcf to
146ef3c
Compare
|
Thankfully someone already has a PR for this. Came across the exact same problem today after restarting docker and was lost for a second. Hopefully it gets merged soon, until then I will use my local patched version for further testing nemoclaw 👍 |
|
I'm hitting this exact issue running OpenShell v0.0.19. Every VM stop/start cycle (including nightly backups that stop/restart VMs) breaks sandbox SSH with "handshake verification failed." The only recovery is deleting and recreating the sandbox. I've worked around it by switching all automation to kubectl exec (gateway startup, health checks, etc.), but IDE access via openshell sandbox connect --editor is completely blocked until the sandbox is recreated. The persistent SSH secret + gateway resume in this PR would fix both issues. Looking forward to this shipping. |
…andshake secret Add a resume code path to gateway start so existing Docker volume state (k3s, etcd, sandboxes, secrets) is reused instead of requiring a full destroy/recreate cycle. When the container is gone but the volume remains (e.g. Docker restart), the CLI automatically creates a new container with the existing volume and reconciles PKI and secrets. Move the SSH handshake HMAC secret from ephemeral generation in the cluster entrypoint (regenerated on every container start) to a Kubernetes Secret that persists in etcd on the Docker volume. This ensures sandbox SSH sessions survive container restarts. Key changes: - Add DeployOptions.resume flag with resume branch in deploy flow - Add cleanup_gateway_container for volume-preserving failure cleanup - Auto-resume in gateway_admin_deploy (stopped/volume-only states) - Auto-bootstrap tries resume first, falls back to recreate - Add unless-stopped Docker restart policy to gateway container - Reconcile SSH handshake secret as K8s Secret alongside TLS PKI - Update Helm chart to read secret via secretKeyRef - Add SSH handshake secret to cluster health check Closes #487
On resume after container kill, ensure_network destroys and recreates the Docker network with a new ID. The stopped container still referenced the old network ID, causing 'network not found' on start. Fix by reconciling the container's network attachment in ensure_container. Also, reconcile_pki was attempting to load K8s secrets before k3s had booted, failing transiently, and regenerating PKI unnecessarily. This triggered a server rollout restart causing TLS errors. Fix by waiting for the openshell namespace before attempting to read existing secrets. Add gRPC readiness check to gateway_admin_deploy so the CLI waits for the server to accept connections before declaring the gateway ready. Add e2e test covering container kill, stale network, sandbox persistence, and sandbox create after resume.
The wait_for_healthy helper checked for 'healthy', 'running', or '✓' but openshell status outputs 'Connected'. All five gateway_resume tests were failing because the health check never matched.
…ternally The deploy flow now auto-detects whether to resume by checking for existing gateway state inside deploy_gateway_with_logs. Callers no longer need to compute and pass a resume flag. The explicit gateway start path still short-circuits for already-running gateways to avoid redundant work.
The gateway returns HTTP 412 (Precondition Failed) when the sandbox pod exists but hasn't reached Ready phase yet. This is a transient state after allocation. Instead of failing immediately, retry with exponential backoff (1s to 8s) for up to 60 seconds.
- Remove duplicate Duration import and use unqualified Duration in ssh.rs - Prefix unused default_image parameter with underscore in sandbox/mod.rs - Make SecretResolver pub to match its use in pub function signature
146ef3c to
e1bea6d
Compare
…ation When a gateway is stopped and restarted with a different container image, ensure_container() removes the old container and creates a new one. The new container gets a different hostname (Docker default: container ID prefix), which k3s registers as a new node. Pods on the old node remain stuck in Terminating until the eviction timeout expires, causing the 30s health check to fail with 'connection reset by peer'. Preserve the old container's hostname before removal and set it on the replacement container so k3s sees the same node identity. For fresh containers, set the hostname to the container name for a stable default that survives future recreations.
e21e78f to
8c38234
Compare
f967eb8 to
1b881d8
Compare
…deletion Reverts the hostname preservation approach which caused k3s node password validation failures. Instead, makes clean_stale_nodes() reliable by: 1. Retrying with 3s backoff (up to ~45s) until kubectl becomes available after a container restart, instead of firing once and silently giving up. 2. Force-deleting pods stuck in Terminating on removed stale nodes so StatefulSets can immediately reschedule replacements. This fixes gateway resume failures after stop/start when the container image has changed (common in development), where the new container gets a different k3s node identity and pods on the old node never reschedule.
1b881d8 to
4ba1b61
Compare
Summary
Add gateway resume from existing Docker volume state and persist the SSH handshake HMAC secret as a Kubernetes Secret, so
openshell gateway startrecovers gracefully after Docker restarts without losing sandboxes or breaking SSH sessions.Related Issue
Closes #487
Changes
Gateway Resume
DeployOptions.resumeflag with a resume branch indeploy_gateway_with_logsthat falls through to idempotentensure_*calls instead of erroring or destroyinggateway_admin_deployauto-resumes for stopped/volume-only states; already-running returns immediately;--recreatestill destroyssandbox create) tries resume first, falls back to recreate on failure (logged atwarn)cleanup_gateway_containerfor volume-preserving cleanup on resume failureunless-stoppedDocker restart policy so the container auto-restarts on Docker daemon restartSSH Handshake Secret Persistence
reconcile_ssh_handshake_secretin bootstrap — checks if K8s secret exists, reuses if present, generates new if missing (same pattern as TLS PKI reconciliation)OPENSHELL_SSH_HANDSHAKE_SECRETviasecretKeyRefinstead of plain valuecluster-entrypoint.shsshHandshakeSecretfrom HelmChart CR values; addsshHandshakeSecretNametovalues.yamlcluster-deploy-fast.shto create K8s secret directly via kubectlTesting
mise run pre-commitpasses (format, lint, license headers)cargo test --package openshell-bootstrap --package openshell-cli— all 163 tests passmise run e2e) — requires running cluster; these changes affect sandbox lifecycle and should be validated with a running gatewayChecklist