Why I'm opening this
I hit this while trying to get openshell gateway start working on macOS with Colima.
At first I suspected local networking, because I run Tailscale and another machine had briefly advertised 10.43.0.0/16. I turned that off and retried, but the failure did not change.
What I consistently saw was: the gateway container would start, I would wait for the inner cluster to come up, I would look at the logs, and startup would eventually end with:
Error: gateway container does not expose a health check
What I tried
I went looking for why bootstrap was failing on that exact error.
First, I inspected the published image:
docker image inspect ghcr.io/nvidia/openshell/cluster:0.0.30 --format '{{json .Config.Healthcheck}}'
That returned null for me. I saw the same for ghcr.io/nvidia/openshell/cluster:latest and :0.0.29.
Then I checked the source. crates/openshell-bootstrap/src/runtime.rs treats a missing container healthcheck as a fatal error, and deploy/docker/Dockerfile.images currently defines a HEALTHCHECK for the cluster image.
Because of that, I tried the obvious workaround: build a local image override with a Docker HEALTHCHECK and point OpenShell at it. I tried that because the runtime seemed to require health metadata and the published image did not seem to have it. That did not fully resolve the problem for me; I still ended up on the same error path.
So I am not claiming the whole root cause is definitely "the published image is missing a healthcheck". What I am confident about is that there is at least a mismatch or misleading failure path here:
- bootstrap aborts on missing healthcheck metadata
- the published cluster image I pulled does not expose that metadata
- source currently defines a healthcheck
- adding one locally did not make the error path go away cleanly
Minimal details
Environment:
- macOS
- Colima for Docker and local K3s
openshell 0.0.30
Repro:
openshell gateway start -vv
Observed failure:
Error: gateway container does not expose a health check
Relevant references:
- runtime fatal path:
crates/openshell-bootstrap/src/runtime.rs
- cluster image definition:
deploy/docker/Dockerfile.images
- published image metadata on my machine:
Config.Healthcheck = null
My guess is that one of these is true:
- the published
ghcr.io/nvidia/openshell/cluster image is not carrying the expected healthcheck metadata, or
- bootstrap is hitting this error for a more subtle reason and the message is misleading.
Either way, the current failure path made this much harder to diagnose than it needed to be.
Why I'm opening this
I hit this while trying to get
openshell gateway startworking on macOS with Colima.At first I suspected local networking, because I run Tailscale and another machine had briefly advertised
10.43.0.0/16. I turned that off and retried, but the failure did not change.What I consistently saw was: the gateway container would start, I would wait for the inner cluster to come up, I would look at the logs, and startup would eventually end with:
What I tried
I went looking for why bootstrap was failing on that exact error.
First, I inspected the published image:
docker image inspect ghcr.io/nvidia/openshell/cluster:0.0.30 --format '{{json .Config.Healthcheck}}'That returned
nullfor me. I saw the same forghcr.io/nvidia/openshell/cluster:latestand:0.0.29.Then I checked the source.
crates/openshell-bootstrap/src/runtime.rstreats a missing container healthcheck as a fatal error, anddeploy/docker/Dockerfile.imagescurrently defines aHEALTHCHECKfor the cluster image.Because of that, I tried the obvious workaround: build a local image override with a Docker
HEALTHCHECKand point OpenShell at it. I tried that because the runtime seemed to require health metadata and the published image did not seem to have it. That did not fully resolve the problem for me; I still ended up on the same error path.So I am not claiming the whole root cause is definitely "the published image is missing a healthcheck". What I am confident about is that there is at least a mismatch or misleading failure path here:
Minimal details
Environment:
openshell 0.0.30Repro:
Observed failure:
Relevant references:
crates/openshell-bootstrap/src/runtime.rsdeploy/docker/Dockerfile.imagesConfig.Healthcheck = nullMy guess is that one of these is true:
ghcr.io/nvidia/openshell/clusterimage is not carrying the expected healthcheck metadata, orEither way, the current failure path made this much harder to diagnose than it needed to be.