You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Docker image builds performed inside the OpenShell K3s gateway intermittently fail with a containerd layer export error on an arm64 DGX Spark. The failure occurs at different build steps each run (non-deterministic), suggesting a race condition in containerd's content store on this architecture.
The error pattern is:
Error: × Docker build stream error
╰─▶ Docker stream error: failed to export layer: CreateDiff: mount callback
failed on /var/lib/containerd/tmpmounts/containerd-mountXXXXX:
mount callback failed on /var/lib/containerd/tmpmounts/containerd-mountXXXXX:
failed to commit: rename /var/lib/containerd/io.containerd.content.v1.content/
ingest/<hash>/data
/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/<hash>:
no such file or directory
Onboard triggers a Docker image build inside the K3s gateway (45-step Dockerfile).
Build fails intermittently at various steps (observed at steps 18, 25, 42 across different attempts) with the CreateDiff: failed to commit error above.
Retrying the same build sometimes succeeds, sometimes fails at a different step.
Observations
The error occurs during failed to commit: rename .../ingest/.../data .../blobs/sha256/... — containerd's content store is failing to atomically commit a layer blob.
The failure step is non-deterministic — it hits different Dockerfile steps across runs, suggesting a race condition rather than a specific layer being problematic.
The same Dockerfile builds reliably on x86_64 GitHub Actions runners (ubuntu-latest).
3 out of 5 consecutive build attempts failed during our testing. The 2 successes built the full 45-step image without error.
Expected Behavior
Docker builds inside the K3s gateway should complete reliably on arm64.
Possible Causes
Containerd's content store may have a race condition on arm64 kernels
The overlay2/overlayfs snapshotter may behave differently on aarch64 with this kernel version
K3s's embedded containerd (v2.2.2-k3s1) may have an arm64-specific issue
Impact
This blocks NemoClaw onboard on DGX Spark — users cannot create their first sandbox without retrying multiple times. Combined with #856 (TLS cert mismatch after gateway recreate), the retry cycle can become unrecoverable without manual intervention.
Summary
Docker image builds performed inside the OpenShell K3s gateway intermittently fail with a containerd layer export error on an arm64 DGX Spark. The failure occurs at different build steps each run (non-deterministic), suggesting a race condition in containerd's content store on this architecture.
The error pattern is:
Environment
Reproduction Steps
Set up NemoClaw on a DGX Spark (aarch64):
git clone https://github.com/NVIDIA/NemoClaw.git cd NemoClaw NEMOCLAW_NON_INTERACTIVE=1 NEMOCLAW_SANDBOX_NAME=percy \ NEMOCLAW_PROVIDER=ollama NEMOCLAW_MODEL=nemotron-3-super:120b \ bash install.shOnboard triggers a Docker image build inside the K3s gateway (45-step Dockerfile).
Build fails intermittently at various steps (observed at steps 18, 25, 42 across different attempts) with the
CreateDiff: failed to commiterror above.Retrying the same build sometimes succeeds, sometimes fails at a different step.
Observations
failed to commit: rename .../ingest/.../data .../blobs/sha256/...— containerd's content store is failing to atomically commit a layer blob.ubuntu-latest).openshell gateway destroy && openshell gateway start) sometimes clears the state and allows a subsequent build to succeed, but this creates TLS cert mismatch issues (see OpenShell bug(cli): Gateway recreate invalidates CLI TLS trust — sandbox create fails with BadSignature, no auto-recovery #856).Expected Behavior
Docker builds inside the K3s gateway should complete reliably on arm64.
Possible Causes
Impact
This blocks NemoClaw onboard on DGX Spark — users cannot create their first sandbox without retrying multiple times. Combined with #856 (TLS cert mismatch after gateway recreate), the retry cycle can become unrecoverable without manual intervention.