Skip to content

bug(containerd/arm64): Intermittent layer export failure during Docker build inside K3s on DGX Spark (aarch64) #857

@jyaunches

Description

@jyaunches

Summary

Docker image builds performed inside the OpenShell K3s gateway intermittently fail with a containerd layer export error on an arm64 DGX Spark. The failure occurs at different build steps each run (non-deterministic), suggesting a race condition in containerd's content store on this architecture.

The error pattern is:

Error: × Docker build stream error
╰─▶ Docker stream error: failed to export layer: CreateDiff: mount callback
    failed on /var/lib/containerd/tmpmounts/containerd-mountXXXXX:
    mount callback failed on /var/lib/containerd/tmpmounts/containerd-mountXXXXX:
    failed to commit: rename /var/lib/containerd/io.containerd.content.v1.content/
    ingest/<hash>/data
    /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/<hash>:
    no such file or directory

Environment

  • Hardware: NVIDIA DGX Spark (Founders Edition), GB10, 128 GB unified memory
  • Architecture: aarch64
  • OS: Ubuntu 24.04.4 LTS (Noble Numbat)
  • Docker: 29.2.1
  • NVIDIA Container Toolkit: 1.19.0
  • OpenShell version: 0.0.26
  • K3s version (inside gateway): v1.35.3+k3s1
  • Containerd version (inside gateway): v2.2.2-k3s1

Reproduction Steps

  1. Set up NemoClaw on a DGX Spark (aarch64):

    git clone https://github.com/NVIDIA/NemoClaw.git
    cd NemoClaw
    NEMOCLAW_NON_INTERACTIVE=1 NEMOCLAW_SANDBOX_NAME=percy \
      NEMOCLAW_PROVIDER=ollama NEMOCLAW_MODEL=nemotron-3-super:120b \
      bash install.sh
  2. Onboard triggers a Docker image build inside the K3s gateway (45-step Dockerfile).

  3. Build fails intermittently at various steps (observed at steps 18, 25, 42 across different attempts) with the CreateDiff: failed to commit error above.

  4. Retrying the same build sometimes succeeds, sometimes fails at a different step.

Observations

  • The error occurs during failed to commit: rename .../ingest/.../data .../blobs/sha256/... — containerd's content store is failing to atomically commit a layer blob.
  • The failure step is non-deterministic — it hits different Dockerfile steps across runs, suggesting a race condition rather than a specific layer being problematic.
  • The same Dockerfile builds reliably on x86_64 GitHub Actions runners (ubuntu-latest).
  • A gateway restart (openshell gateway destroy && openshell gateway start) sometimes clears the state and allows a subsequent build to succeed, but this creates TLS cert mismatch issues (see OpenShell bug(cli): Gateway recreate invalidates CLI TLS trust — sandbox create fails with BadSignature, no auto-recovery #856).
  • 3 out of 5 consecutive build attempts failed during our testing. The 2 successes built the full 45-step image without error.

Expected Behavior

Docker builds inside the K3s gateway should complete reliably on arm64.

Possible Causes

  • Containerd's content store may have a race condition on arm64 kernels
  • The overlay2/overlayfs snapshotter may behave differently on aarch64 with this kernel version
  • K3s's embedded containerd (v2.2.2-k3s1) may have an arm64-specific issue

Impact

This blocks NemoClaw onboard on DGX Spark — users cannot create their first sandbox without retrying multiple times. Combined with #856 (TLS cert mismatch after gateway recreate), the retry cycle can become unrecoverable without manual intervention.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions