Skip to content

feat(bootstrap): resume gateway from existing state and persist SSH handshake secret#488

Merged
drew merged 11 commits intomainfrom
487-gateway-resume-ssh-secret/drew
Apr 2, 2026
Merged

feat(bootstrap): resume gateway from existing state and persist SSH handshake secret#488
drew merged 11 commits intomainfrom
487-gateway-resume-ssh-secret/drew

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented Mar 19, 2026

Summary

Add gateway resume from existing Docker volume state and persist the SSH handshake HMAC secret as a Kubernetes Secret, so openshell gateway start recovers gracefully after Docker restarts without losing sandboxes or breaking SSH sessions.

Related Issue

Closes #487

Changes

Gateway Resume

  • Add DeployOptions.resume flag with a resume branch in deploy_gateway_with_logs that falls through to idempotent ensure_* calls instead of erroring or destroying
  • gateway_admin_deploy auto-resumes for stopped/volume-only states; already-running returns immediately; --recreate still destroys
  • Auto-bootstrap (sandbox create) tries resume first, falls back to recreate on failure (logged at warn)
  • Add cleanup_gateway_container for volume-preserving cleanup on resume failure
  • Add unless-stopped Docker restart policy so the container auto-restarts on Docker daemon restart

SSH Handshake Secret Persistence

  • Add reconcile_ssh_handshake_secret in bootstrap — checks if K8s secret exists, reuses if present, generates new if missing (same pattern as TLS PKI reconciliation)
  • Update Helm chart StatefulSet to read OPENSHELL_SSH_HANDSHAKE_SECRET via secretKeyRef instead of plain value
  • Remove secret generation and sed injection from cluster-entrypoint.sh
  • Remove sshHandshakeSecret from HelmChart CR values; add sshHandshakeSecretName to values.yaml
  • Update cluster-deploy-fast.sh to create K8s secret directly via kubectl
  • Add SSH handshake secret existence to cluster health check

Testing

  • mise run pre-commit passes (format, lint, license headers)
  • cargo test --package openshell-bootstrap --package openshell-cli — all 163 tests pass
  • E2E tests (mise run e2e) — requires running cluster; these changes affect sandbox lifecycle and should be validated with a running gateway

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@drew drew requested a review from a team as a code owner March 19, 2026 21:59
@drew drew added area:gateway Gateway server and control-plane work area:cluster Related to running OpenShell on k3s/docker labels Mar 19, 2026
@drew drew self-assigned this Mar 19, 2026
@drew drew added the test:e2e Requires end-to-end coverage label Mar 19, 2026
johntmyers
johntmyers previously approved these changes Mar 19, 2026
@ohgeebtw
Copy link
Copy Markdown

Thankfully someone already has a PR for this. Came across the exact same problem today after restarting docker and was lost for a second.

Hopefully it gets merged soon, until then I will use my local patched version for further testing nemoclaw 👍

@rossmorey
Copy link
Copy Markdown

I'm hitting this exact issue running OpenShell v0.0.19. Every VM stop/start cycle (including nightly backups that stop/restart VMs) breaks sandbox SSH with "handshake verification failed." The only recovery is deleting and recreating the sandbox.

I've worked around it by switching all automation to kubectl exec (gateway startup, health checks, etc.), but IDE access via openshell sandbox connect --editor is completely blocked until the sandbox is recreated. The persistent SSH secret + gateway resume in this PR would fix both issues. Looking forward to this shipping.

drew added 7 commits April 1, 2026 09:45
…andshake secret

Add a resume code path to gateway start so existing Docker volume state
(k3s, etcd, sandboxes, secrets) is reused instead of requiring a full
destroy/recreate cycle. When the container is gone but the volume remains
(e.g. Docker restart), the CLI automatically creates a new container with
the existing volume and reconciles PKI and secrets.

Move the SSH handshake HMAC secret from ephemeral generation in the
cluster entrypoint (regenerated on every container start) to a Kubernetes
Secret that persists in etcd on the Docker volume. This ensures sandbox
SSH sessions survive container restarts.

Key changes:
- Add DeployOptions.resume flag with resume branch in deploy flow
- Add cleanup_gateway_container for volume-preserving failure cleanup
- Auto-resume in gateway_admin_deploy (stopped/volume-only states)
- Auto-bootstrap tries resume first, falls back to recreate
- Add unless-stopped Docker restart policy to gateway container
- Reconcile SSH handshake secret as K8s Secret alongside TLS PKI
- Update Helm chart to read secret via secretKeyRef
- Add SSH handshake secret to cluster health check

Closes #487
On resume after container kill, ensure_network destroys and recreates
the Docker network with a new ID. The stopped container still referenced
the old network ID, causing 'network not found' on start. Fix by
reconciling the container's network attachment in ensure_container.

Also, reconcile_pki was attempting to load K8s secrets before k3s had
booted, failing transiently, and regenerating PKI unnecessarily. This
triggered a server rollout restart causing TLS errors. Fix by waiting
for the openshell namespace before attempting to read existing secrets.

Add gRPC readiness check to gateway_admin_deploy so the CLI waits for
the server to accept connections before declaring the gateway ready.

Add e2e test covering container kill, stale network, sandbox persistence,
and sandbox create after resume.
The wait_for_healthy helper checked for 'healthy', 'running', or '✓'
but openshell status outputs 'Connected'. All five gateway_resume tests
were failing because the health check never matched.
…ternally

The deploy flow now auto-detects whether to resume by checking for
existing gateway state inside deploy_gateway_with_logs. Callers no
longer need to compute and pass a resume flag. The explicit gateway
start path still short-circuits for already-running gateways to avoid
redundant work.
The gateway returns HTTP 412 (Precondition Failed) when the sandbox pod
exists but hasn't reached Ready phase yet. This is a transient state
after allocation. Instead of failing immediately, retry with exponential
backoff (1s to 8s) for up to 60 seconds.
- Remove duplicate Duration import and use unqualified Duration in ssh.rs
- Prefix unused default_image parameter with underscore in sandbox/mod.rs
- Make SecretResolver pub to match its use in pub function signature
@drew drew force-pushed the 487-gateway-resume-ssh-secret/drew branch from 146ef3c to e1bea6d Compare April 1, 2026 19:49
drew added 2 commits April 1, 2026 13:04
…ation

When a gateway is stopped and restarted with a different container image,
ensure_container() removes the old container and creates a new one. The
new container gets a different hostname (Docker default: container ID
prefix), which k3s registers as a new node. Pods on the old node remain
stuck in Terminating until the eviction timeout expires, causing the 30s
health check to fail with 'connection reset by peer'.

Preserve the old container's hostname before removal and set it on the
replacement container so k3s sees the same node identity. For fresh
containers, set the hostname to the container name for a stable default
that survives future recreations.
johntmyers
johntmyers previously approved these changes Apr 1, 2026
@drew drew force-pushed the 487-gateway-resume-ssh-secret/drew branch 6 times, most recently from e21e78f to 8c38234 Compare April 2, 2026 03:01
@drew drew force-pushed the 487-gateway-resume-ssh-secret/drew branch 11 times, most recently from f967eb8 to 1b881d8 Compare April 2, 2026 06:30
…deletion

Reverts the hostname preservation approach which caused k3s node password
validation failures. Instead, makes clean_stale_nodes() reliable by:

1. Retrying with 3s backoff (up to ~45s) until kubectl becomes available
   after a container restart, instead of firing once and silently giving up.
2. Force-deleting pods stuck in Terminating on removed stale nodes so
   StatefulSets can immediately reschedule replacements.

This fixes gateway resume failures after stop/start when the container
image has changed (common in development), where the new container gets a
different k3s node identity and pods on the old node never reschedule.
@drew drew force-pushed the 487-gateway-resume-ssh-secret/drew branch from 1b881d8 to 4ba1b61 Compare April 2, 2026 06:46
@drew drew merged commit e837849 into main Apr 2, 2026
19 of 21 checks passed
@drew drew deleted the 487-gateway-resume-ssh-secret/drew branch April 2, 2026 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:cluster Related to running OpenShell on k3s/docker area:gateway Gateway server and control-plane work test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: gateway resume from existing state and persistent SSH handshake secret

4 participants