Skip to content

bug: published cluster image lacks Docker healthcheck, but bootstrap requires one #852

@Eli-Goldberg

Description

@Eli-Goldberg

Why I'm opening this

I hit this while trying to get openshell gateway start working on macOS with Colima.

At first I suspected local networking, because I run Tailscale and another machine had briefly advertised 10.43.0.0/16. I turned that off and retried, but the failure did not change.

What I consistently saw was: the gateway container would start, I would wait for the inner cluster to come up, I would look at the logs, and startup would eventually end with:

Error: gateway container does not expose a health check

What I tried

I went looking for why bootstrap was failing on that exact error.

First, I inspected the published image:

docker image inspect ghcr.io/nvidia/openshell/cluster:0.0.30 --format '{{json .Config.Healthcheck}}'

That returned null for me. I saw the same for ghcr.io/nvidia/openshell/cluster:latest and :0.0.29.

Then I checked the source. crates/openshell-bootstrap/src/runtime.rs treats a missing container healthcheck as a fatal error, and deploy/docker/Dockerfile.images currently defines a HEALTHCHECK for the cluster image.

Because of that, I tried the obvious workaround: build a local image override with a Docker HEALTHCHECK and point OpenShell at it. I tried that because the runtime seemed to require health metadata and the published image did not seem to have it. That did not fully resolve the problem for me; I still ended up on the same error path.

So I am not claiming the whole root cause is definitely "the published image is missing a healthcheck". What I am confident about is that there is at least a mismatch or misleading failure path here:

  • bootstrap aborts on missing healthcheck metadata
  • the published cluster image I pulled does not expose that metadata
  • source currently defines a healthcheck
  • adding one locally did not make the error path go away cleanly

Minimal details

Environment:

  • macOS
  • Colima for Docker and local K3s
  • openshell 0.0.30

Repro:

openshell gateway start -vv

Observed failure:

Error: gateway container does not expose a health check

Relevant references:

  • runtime fatal path: crates/openshell-bootstrap/src/runtime.rs
  • cluster image definition: deploy/docker/Dockerfile.images
  • published image metadata on my machine: Config.Healthcheck = null

My guess is that one of these is true:

  1. the published ghcr.io/nvidia/openshell/cluster image is not carrying the expected healthcheck metadata, or
  2. bootstrap is hitting this error for a more subtle reason and the message is misleading.

Either way, the current failure path made this much harder to diagnose than it needed to be.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions