Skip to content

fix(gpu): add Tegra/Jetson GPU support#625

Open
elezar wants to merge 8 commits intomainfrom
fix/tegra-gpu-support
Open

fix(gpu): add Tegra/Jetson GPU support#625
elezar wants to merge 8 commits intomainfrom
fix/tegra-gpu-support

Conversation

@elezar
Copy link
Copy Markdown
Member

@elezar elezar commented Mar 26, 2026

Summary

Adds GPU support for NVIDIA Tegra/Jetson platforms by bind-mounting the
host-files configuration directory, updating the device plugin image, and
preserving CDI-injected GIDs across privilege drop.

Related Issue

Part of #398 (CDI injection). Depends on #568 (Tegra system support). Should be merged after #495 and #503.

Upstream PRs:

Changes

  • Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d (read-only) into the gateway container when present, so the nvidia runtime inside k3s applies the same host-file injection config as the host — required for Jetson/Tegra CDI spec generation
  • Pin k8s-device-plugin to an image that supports host-files bind-mounts and generates additionalGids in the CDI spec (GID 44 / video, required for /dev/nvmap access on Tegra)
  • Preserve CDI-injected supplemental GIDs across initgroups() during privilege drop, so exec'd processes retain access to GPU devices
  • Fall back to /usr/sbin/nvidia-smi in the GPU e2e test for Tegra systems where nvidia-smi is not on the default PATH

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

elezar added 4 commits March 26, 2026 14:44
Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d
(read-only) into the gateway container when it exists, so the nvidia
runtime running inside k3s can apply the same host-file injection
config as on the host — required for Jetson/Tegra platforms.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Use ghcr.io/nvidia/k8s-device-plugin:2ab68c16 which includes support for
mounting /etc/nvidia-container-runtime/host-files-for-container.d into the
device plugin pod, required for correct CDI spec generation on Tegra-based
systems.

Also included is an nvcdi API bump that ensures that additional GIDs are
included in the generated CDI spec.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
initgroups(3) replaces all supplemental groups with the user's entries
from /etc/group, discarding GIDs injected by the container runtime via
CDI (e.g. GID 44/video needed for /dev/nvmap on Tegra). Snapshot the
container-level GIDs before initgroups runs and merge them back
afterwards, excluding GID 0 (root) to avoid privilege retention.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
On Jetson/Tegra platforms nvidia-smi is installed at /usr/sbin/nvidia-smi
rather than /usr/bin/nvidia-smi and may not be on PATH inside the sandbox.
Fall back to the full path when the bare command is not found.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar self-assigned this Mar 26, 2026
@elezar
Copy link
Copy Markdown
Member Author

elezar commented Mar 26, 2026

cc @johnnynunez

@johnnynunez
Copy link
Copy Markdown

johnnynunez commented Mar 26, 2026

LGTM @elezar
ready to merge @johntmyers

@elezar
Copy link
Copy Markdown
Member Author

elezar commented Mar 27, 2026

This was only tested in conjunction with #495 and #503. Once those are in, there should be no reason to not get this in too.

@elezar elezar marked this pull request as ready for review March 27, 2026 07:16
@elezar elezar requested a review from a team as a code owner March 27, 2026 07:16
@johnnynunez
Copy link
Copy Markdown

This

Yes, i know. I was tracking it. And tested

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 1, 2026

I dug into the GID-preservation change here and I think PR #710 may make it unnecessary.

What I verified locally:

  • Inside a running sandbox, the GPU device nodes are owned by sandbox:sandbox after supervisor setup.
  • The corresponding host and k3s-container device nodes remain root:root 666, so the sandbox-side chown() does not appear to mutate the host devices.
  • That suggests these are container-local CDI-created device nodes, not direct host bind mounts.

If that holds generally, then once #710 adds the needed GPU device paths to filesystem.read_write, prepare_filesystem() will chown(path, uid, gid) before privilege drop and DAC access should come from ownership rather than from preserving CDI-injected supplemental groups.

So I think we should re-check whether the drop_privileges() GID merge is still needed after #710 lands. It may be removable if all required GPU paths (including Tegra-specific ones like /dev/nvmap if applicable) are present and successfully chowned.

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 1, 2026

Follow-up: I removed the checked-in custom ghcr.io/nvidia/k8s-device-plugin:2ab68c16 image override from this branch.

If someone still needs that image on a live gateway for testing, they can patch the running cluster in place:

openshell doctor exec -- kubectl -n kube-system patch helmchart nvidia-device-plugin --type merge -p '{
  "spec": {
    "valuesContent": "image:\n  repository: ghcr.io/nvidia/k8s-device-plugin\n  tag: \"2ab68c16\"\nruntimeClassName: nvidia\ndeviceListStrategy: cdi-cri\ndeviceIDStrategy: index\ncdi:\n  nvidiaHookPath: /usr/bin/nvidia-cdi-hook\nnvidiaDriverRoot: \"/\"\ngfd:\n  enabled: false\nnfd:\n  enabled: false\naffinity: null\n"
  }
}'
openshell doctor exec -- kubectl -n nvidia-device-plugin rollout status ds/nvidia-device-plugin
openshell doctor exec -- kubectl -n nvidia-device-plugin get ds nvidia-device-plugin -o jsonpath='{.spec.template.spec.containers[0].image}{"\\n"}'

That only affects the running gateway. Recreating the gateway reapplies the checked-in manifest.

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 1, 2026

Once the #710 is reviewed and merged, I will add it to here and test it again. I'm getting a lease on colossus for a Jetson-based system.

It's very likely there will be some updates to the policy required, with the #677 now merged and before that, in many contexts landlock policies were not correctly applied.

@pimlock pimlock added the test:e2e Requires end-to-end coverage label Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants