Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion architecture/gateway-single-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ For the target daemon (local or remote):

After the container starts:

1. **Clean stale nodes**: `clean_stale_nodes()` finds `NotReady` nodes via `kubectl get nodes` and deletes them. This is needed when a container is recreated but reuses the persistent volume -- k3s registers a new node (using the container ID as hostname) while old node entries persist in etcd. Non-fatal on error; returns the count of removed nodes.
1. **Clean stale nodes**: `clean_stale_nodes()` finds nodes whose name does not match the deterministic k3s `--node-name` and deletes them. That node name is derived from the gateway name but normalized to a Kubernetes-safe lowercase form so existing gateway names that contain `_`, `.`, or uppercase characters still produce a valid node identity. This cleanup is needed when a container is recreated but reuses the persistent volume -- old node entries can persist in etcd. Non-fatal on error; returns the count of removed nodes.
2. **Push local images** (optional, local deploy only): If `OPENSHELL_PUSH_IMAGES` is set, the comma-separated image refs are exported from the local Docker daemon as a single tar, uploaded into the container via `docker put_archive`, and imported into containerd via `ctr images import` in the `k8s.io` namespace. After import, `kubectl rollout restart deployment/openshell openshell` is run, followed by `kubectl rollout status --timeout=180s` to wait for completion. See `crates/openshell-bootstrap/src/push.rs`.
3. **Wait for gateway health**: `wait_for_gateway_ready()` polls the Docker HEALTHCHECK status up to 180 times, 2 seconds apart (6 min total). A background task streams container logs during this wait. Failure modes:
- Container exits during polling: error includes recent log lines.
Expand Down
2 changes: 1 addition & 1 deletion architecture/gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -501,7 +501,7 @@ The Helm chart template is at `deploy/helm/openshell/templates/statefulset.yaml`

`SandboxClient` (`crates/openshell-server/src/sandbox/mod.rs`) manages `agents.x-k8s.io/v1alpha1/Sandbox` CRDs.

- **Create**: Translates a `Sandbox` proto into a Kubernetes `DynamicObject` with labels (`openshell.ai/sandbox-id`, `openshell.ai/managed-by: openshell`) and a spec that includes the pod template, environment variables, and gateway-required env vars (`OPENSHELL_SANDBOX_ID`, `OPENSHELL_ENDPOINT`, `OPENSHELL_SSH_LISTEN_ADDR`, etc.).
- **Create**: Translates a `Sandbox` proto into a Kubernetes `DynamicObject` with labels (`openshell.ai/sandbox-id`, `openshell.ai/managed-by: openshell`) and a spec that includes the pod template, environment variables, and gateway-required env vars (`OPENSHELL_SANDBOX_ID`, `OPENSHELL_ENDPOINT`, `OPENSHELL_SSH_LISTEN_ADDR`, etc.). When callers do not provide custom `volumeClaimTemplates`, the server injects a default `workspace` PVC and mounts it at `/sandbox` so the default sandbox home/workdir survives pod rescheduling.
- **Delete**: Calls the Kubernetes API to delete the CRD by name. Returns `false` if already gone (404).
- **Pod IP resolution**: `agent_pod_ip()` fetches the agent pod and reads `status.podIP`.

Expand Down
85 changes: 85 additions & 0 deletions crates/openshell-bootstrap/src/constants.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,100 @@ pub const SERVER_CLIENT_CA_SECRET_NAME: &str = "openshell-server-client-ca";
pub const CLIENT_TLS_SECRET_NAME: &str = "openshell-client-tls";
/// K8s secret holding the SSH handshake HMAC secret (shared by gateway and sandbox pods).
pub const SSH_HANDSHAKE_SECRET_NAME: &str = "openshell-ssh-handshake";
const NODE_NAME_PREFIX: &str = "openshell-";
const NODE_NAME_FALLBACK_SUFFIX: &str = "gateway";
const KUBERNETES_MAX_NAME_LEN: usize = 253;

pub fn container_name(name: &str) -> String {
format!("openshell-cluster-{name}")
}

/// Deterministic k3s node name derived from the gateway name.
///
/// k3s defaults to using the container hostname (= Docker container ID) as
/// the node name. When the container is recreated (e.g. after an image
/// upgrade), the container ID changes, creating a new k3s node. The
/// `clean_stale_nodes` function then deletes PVCs whose backing PVs have
/// node affinity for the old node — wiping the server database and any
/// sandbox persistent volumes.
///
/// By passing a deterministic `--node-name` to k3s, the node identity
/// survives container recreation, and PVCs are never orphaned.
///
/// Gateway names allow Docker-friendly separators and uppercase characters,
/// but Kubernetes node names must be DNS-safe. Normalize the gateway name into
/// a single lowercase RFC 1123 label so previously accepted names such as
/// `prod_us` or `Prod.US` still deploy successfully.
pub fn node_name(name: &str) -> String {
format!("{NODE_NAME_PREFIX}{}", normalize_node_name_suffix(name))
}

fn normalize_node_name_suffix(name: &str) -> String {
let mut normalized = String::with_capacity(name.len());
let mut last_was_separator = false;

for ch in name.chars() {
if ch.is_ascii_alphanumeric() {
normalized.push(ch.to_ascii_lowercase());
last_was_separator = false;
} else if !last_was_separator {
normalized.push('-');
last_was_separator = true;
}
}

let mut normalized = normalized.trim_matches('-').to_string();
if normalized.is_empty() {
normalized.push_str(NODE_NAME_FALLBACK_SUFFIX);
}

let max_suffix_len = KUBERNETES_MAX_NAME_LEN.saturating_sub(NODE_NAME_PREFIX.len());
if normalized.len() > max_suffix_len {
normalized.truncate(max_suffix_len);
normalized.truncate(normalized.trim_end_matches('-').len());
}

if normalized.is_empty() {
normalized.push_str(NODE_NAME_FALLBACK_SUFFIX);
}

normalized
}

pub fn volume_name(name: &str) -> String {
format!("openshell-cluster-{name}")
}

pub fn network_name(name: &str) -> String {
format!("openshell-cluster-{name}")
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn node_name_normalizes_uppercase_and_underscores() {
assert_eq!(node_name("Prod_US"), "openshell-prod-us");
}

#[test]
fn node_name_collapses_and_trims_separator_runs() {
assert_eq!(node_name("._Prod..__-Gateway-."), "openshell-prod-gateway");
}

#[test]
fn node_name_falls_back_when_gateway_name_has_no_alphanumerics() {
assert_eq!(node_name("...___---"), "openshell-gateway");
}

#[test]
fn node_name_truncates_to_kubernetes_name_limit() {
let gateway_name = "A".repeat(400);
let node_name = node_name(&gateway_name);

assert!(node_name.len() <= KUBERNETES_MAX_NAME_LEN);
assert!(node_name.starts_with(NODE_NAME_PREFIX));
assert!(node_name.ends_with('a'));
}
}
59 changes: 41 additions & 18 deletions crates/openshell-bootstrap/src/docker.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
// SPDX-License-Identifier: Apache-2.0

use crate::RemoteOptions;
use crate::constants::{container_name, network_name, volume_name};
use crate::constants::{container_name, network_name, node_name, volume_name};
use crate::image::{self, DEFAULT_IMAGE_REPO_BASE, DEFAULT_REGISTRY, parse_image_ref};
use bollard::API_DEFAULT_VERSION;
use bollard::Docker;
Expand Down Expand Up @@ -482,6 +482,7 @@ pub async fn ensure_container(
registry_username: Option<&str>,
registry_token: Option<&str>,
device_ids: &[String],
resume: bool,
) -> Result<u16> {
let container_name = container_name(name);

Expand All @@ -491,25 +492,34 @@ pub async fn ensure_container(
.await
{
Ok(info) => {
// Container exists — verify it is using the expected image.
// Resolve the desired image ref to its content-addressable ID so we
// can compare against the container's image field (which Docker
// stores as an ID).
let desired_id = docker
.inspect_image(image_ref)
.await
.ok()
.and_then(|img| img.id);
// On resume we always reuse the existing container — the persistent
// volume holds k3s etcd state, and recreating the container with
// different env vars would cause the entrypoint to rewrite the
// HelmChart manifest, triggering a Helm upgrade that changes the
// StatefulSet image reference while the old pod still runs with the
// previous image. Reusing the container avoids this entirely.
//
// On a non-resume path we check whether the image changed and
// recreate only when necessary.
let reuse = if resume {
true
} else {
let desired_id = docker
.inspect_image(image_ref)
.await
.ok()
.and_then(|img| img.id);

let container_image_id = info.image;
let container_image_id = info.image.clone();

let image_matches = match (&desired_id, &container_image_id) {
(Some(desired), Some(current)) => desired == current,
_ => false,
match (&desired_id, &container_image_id) {
(Some(desired), Some(current)) => desired == current,
_ => false,
}
};

if image_matches {
// The container exists with the correct image, but its network
if reuse {
// The container exists and should be reused. Its network
// attachment may be stale. When the gateway is resumed after a
// container kill, `ensure_network` destroys and recreates the
// Docker network (giving it a new ID). The stopped container
Expand Down Expand Up @@ -543,8 +553,8 @@ pub async fn ensure_container(
tracing::info!(
"Container {} exists but uses a different image (container={}, desired={}), recreating",
container_name,
container_image_id.as_deref().map_or("unknown", truncate_id),
desired_id.as_deref().map_or("unknown", truncate_id),
info.image.as_deref().map_or("unknown", truncate_id),
image_ref,
);

let _ = docker.stop_container(&container_name, None).await;
Expand Down Expand Up @@ -684,6 +694,11 @@ pub async fn ensure_container(
format!("REGISTRY_HOST={registry_host}"),
format!("REGISTRY_INSECURE={registry_insecure}"),
format!("IMAGE_REPO_BASE={image_repo_base}"),
// Deterministic k3s node name so the node identity survives container
// recreation (e.g. after an image upgrade). Without this, k3s uses
// the container ID as the hostname/node name, which changes on every
// container recreate and triggers stale-node PVC cleanup.
format!("OPENSHELL_NODE_NAME={}", node_name(name)),
];
if let Some(endpoint) = registry_endpoint {
env_vars.push(format!("REGISTRY_ENDPOINT={endpoint}"));
Expand Down Expand Up @@ -753,6 +768,14 @@ pub async fn ensure_container(

let config = ContainerCreateBody {
image: Some(image_ref.to_string()),
// Set the container hostname to the deterministic node name.
// k3s uses the container hostname as its default node name. Without
// this, Docker defaults to the container ID (first 12 hex chars),
// which changes on every container recreation and can cause
// `clean_stale_nodes` to delete the wrong node on resume. The
// hostname persists across container stop/start cycles, ensuring a
// stable node identity.
hostname: Some(node_name(name)),
cmd: Some(cmd),
env,
exposed_ports: Some(exposed_ports),
Expand Down
68 changes: 41 additions & 27 deletions crates/openshell-bootstrap/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -314,46 +314,59 @@ where
// idempotent and will reuse the volume, create a container if needed,
// and start it)
let mut resume = false;
let mut resume_container_exists = false;
if let Some(existing) = check_existing_gateway(&target_docker, &name).await? {
if recreate {
log("[status] Removing existing gateway".to_string());
destroy_gateway_resources(&target_docker, &name).await?;
} else if existing.container_running {
log("[status] Gateway is already running".to_string());
resume = true;
resume_container_exists = true;
} else {
log("[status] Resuming gateway from existing state".to_string());
resume = true;
resume_container_exists = existing.container_exists;
}
}

// Ensure the image is available on the target Docker daemon
if remote_opts.is_some() {
log("[status] Downloading gateway".to_string());
let on_log_clone = Arc::clone(&on_log);
let progress_cb = move |msg: String| {
if let Ok(mut f) = on_log_clone.lock() {
f(msg);
}
};
image::pull_remote_image(
&target_docker,
&image_ref,
registry_username.as_deref(),
registry_token.as_deref(),
progress_cb,
)
.await?;
} else {
// Local deployment: ensure image exists (pull if needed)
log("[status] Downloading gateway".to_string());
ensure_image(
&target_docker,
&image_ref,
registry_username.as_deref(),
registry_token.as_deref(),
)
.await?;
// Ensure the image is available on the target Docker daemon.
// When both the container and volume exist we can skip the pull entirely
// — the container already references a valid local image. This avoids
// failures when the original image tag (e.g. a local-only
// `openshell/cluster:dev`) is not available from the default registry.
//
// When only the volume survives (container was removed), we still need
// the image to recreate the container, so the pull must happen.
let need_image = !resume || !resume_container_exists;
if need_image {
if remote_opts.is_some() {
log("[status] Downloading gateway".to_string());
let on_log_clone = Arc::clone(&on_log);
let progress_cb = move |msg: String| {
if let Ok(mut f) = on_log_clone.lock() {
f(msg);
}
};
image::pull_remote_image(
&target_docker,
&image_ref,
registry_username.as_deref(),
registry_token.as_deref(),
progress_cb,
)
.await?;
} else {
// Local deployment: ensure image exists (pull if needed)
log("[status] Downloading gateway".to_string());
ensure_image(
&target_docker,
&image_ref,
registry_username.as_deref(),
registry_token.as_deref(),
)
.await?;
}
}

// All subsequent operations use the target Docker (remote or local)
Expand Down Expand Up @@ -444,6 +457,7 @@ where
registry_username.as_deref(),
registry_token.as_deref(),
&device_ids,
resume,
)
.await?;
let port = actual_port;
Expand Down
45 changes: 32 additions & 13 deletions crates/openshell-bootstrap/src/runtime.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// SPDX-License-Identifier: Apache-2.0

use crate::constants::{KUBECONFIG_PATH, container_name};
use crate::constants::{KUBECONFIG_PATH, container_name, node_name};
use bollard::Docker;
use bollard::container::LogOutput;
use bollard::exec::CreateExecOptions;
Expand Down Expand Up @@ -385,11 +385,19 @@ pub async fn clean_stale_nodes(docker: &Docker, name: &str) -> Result<usize> {
let container_name = container_name(name);
let mut stale_nodes: Vec<String> = Vec::new();

// Determine the current node name. With the deterministic `--node-name`
// entrypoint change the k3s node is `openshell-{gateway}`. However, older
// cluster images (built before that change) still use the container hostname
// (= Docker container ID) as the node name. We must handle both:
//
// 1. If the expected deterministic name appears in the node list, use it.
// 2. Otherwise fall back to the container hostname (old behaviour).
//
// This ensures backward compatibility during upgrades where the bootstrap
// CLI is newer than the cluster image.
let deterministic_node = node_name(name);

for attempt in 1..=MAX_ATTEMPTS {
// List ALL node names and the container's own hostname. Any node that
// is not the current container is stale — we cannot rely on the Ready
// condition because k3s may not have marked the old node NotReady yet
// when this runs shortly after container start.
let (output, exit_code) = exec_capture_with_exit(
docker,
&container_name,
Expand All @@ -406,16 +414,27 @@ pub async fn clean_stale_nodes(docker: &Docker, name: &str) -> Result<usize> {
.await?;

if exit_code == 0 {
// Determine the current node name (container hostname).
let (hostname_out, _) =
exec_capture_with_exit(docker, &container_name, vec!["hostname".to_string()])
.await?;
let current_hostname = hostname_out.trim().to_string();

stale_nodes = output
let all_nodes: Vec<&str> = output
.lines()
.map(str::trim)
.filter(|l| !l.is_empty() && *l != current_hostname)
.filter(|l| !l.is_empty())
.collect();

// Pick the current node identity: prefer the deterministic name,
// fall back to the container hostname for older cluster images.
let current_node = if all_nodes.contains(&deterministic_node.as_str()) {
deterministic_node.clone()
} else {
// Older cluster image without --node-name: read hostname.
let (hostname_out, _) =
exec_capture_with_exit(docker, &container_name, vec!["hostname".to_string()])
.await?;
hostname_out.trim().to_string()
};

stale_nodes = all_nodes
.into_iter()
.filter(|n| *n != current_node)
.map(ToString::to_string)
.collect();
break;
Expand Down
Loading
Loading