fix(gpu): add Tegra/Jetson GPU support by elezar · Pull Request #625 · NVIDIA/OpenShell

elezar · 2026-03-26T13:56:58Z

Summary

Adds GPU support for NVIDIA Tegra/Jetson platforms by bind-mounting the
host-files configuration directory, updating the device plugin image, and
preserving CDI-injected GIDs across privilege drop.

Related Issue

Part of #398 (CDI injection). Depends on #568 (Tegra system support). Should be merged after #495 and #503.

Upstream PRs:

nvidia-container-toolkit: fix: Additional GIDs are dropped due to file mode mask nvidia-container-toolkit#1745
k8s-device-plugin: Fix CDI spec generation to respect driver root for Tegra CSV files k8s-device-plugin#1675

Changes

Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d (read-only) into the gateway container when present, so the nvidia runtime inside k3s applies the same host-file injection config as the host — required for Jetson/Tegra CDI spec generation
Pin k8s-device-plugin to an image that supports host-files bind-mounts and generates additionalGids in the CDI spec (GID 44 / video, required for /dev/nvmap access on Tegra)
Preserve CDI-injected supplemental GIDs across initgroups() during privilege drop, so exec'd processes retain access to GPU devices
Fall back to /usr/sbin/nvidia-smi in the GPU e2e test for Tegra systems where nvidia-smi is not on the default PATH

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated (if applicable)

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d (read-only) into the gateway container when it exists, so the nvidia runtime running inside k3s can apply the same host-file injection config as on the host — required for Jetson/Tegra platforms. Signed-off-by: Evan Lezar <elezar@nvidia.com>

Use ghcr.io/nvidia/k8s-device-plugin:2ab68c16 which includes support for mounting /etc/nvidia-container-runtime/host-files-for-container.d into the device plugin pod, required for correct CDI spec generation on Tegra-based systems. Also included is an nvcdi API bump that ensures that additional GIDs are included in the generated CDI spec. Signed-off-by: Evan Lezar <elezar@nvidia.com>

initgroups(3) replaces all supplemental groups with the user's entries from /etc/group, discarding GIDs injected by the container runtime via CDI (e.g. GID 44/video needed for /dev/nvmap on Tegra). Snapshot the container-level GIDs before initgroups runs and merge them back afterwards, excluding GID 0 (root) to avoid privilege retention. Signed-off-by: Evan Lezar <elezar@nvidia.com>

On Jetson/Tegra platforms nvidia-smi is installed at /usr/sbin/nvidia-smi rather than /usr/bin/nvidia-smi and may not be on PATH inside the sandbox. Fall back to the full path when the bare command is not found. Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar · 2026-03-26T13:58:13Z

cc @johnnynunez

johnnynunez · 2026-03-26T14:52:25Z

LGTM @elezar
ready to merge @johntmyers

elezar · 2026-03-27T07:15:59Z

This was only tested in conjunction with #495 and #503. Once those are in, there should be no reason to not get this in too.

johnnynunez · 2026-03-27T07:26:08Z

This

Yes, i know. I was tracking it. And tested

pimlock · 2026-04-01T18:48:24Z

I dug into the GID-preservation change here and I think PR #710 may make it unnecessary.

What I verified locally:

Inside a running sandbox, the GPU device nodes are owned by sandbox:sandbox after supervisor setup.
The corresponding host and k3s-container device nodes remain root:root 666, so the sandbox-side chown() does not appear to mutate the host devices.
That suggests these are container-local CDI-created device nodes, not direct host bind mounts.

If that holds generally, then once #710 adds the needed GPU device paths to filesystem.read_write, prepare_filesystem() will chown(path, uid, gid) before privilege drop and DAC access should come from ownership rather than from preserving CDI-injected supplemental groups.

So I think we should re-check whether the drop_privileges() GID merge is still needed after #710 lands. It may be removable if all required GPU paths (including Tegra-specific ones like /dev/nvmap if applicable) are present and successfully chowned.

pimlock · 2026-04-01T19:39:33Z

Follow-up: I removed the checked-in custom ghcr.io/nvidia/k8s-device-plugin:2ab68c16 image override from this branch.

If someone still needs that image on a live gateway for testing, they can patch the running cluster in place:

openshell doctor exec -- kubectl -n kube-system patch helmchart nvidia-device-plugin --type merge -p '{
  "spec": {
    "valuesContent": "image:\n  repository: ghcr.io/nvidia/k8s-device-plugin\n  tag: \"2ab68c16\"\nruntimeClassName: nvidia\ndeviceListStrategy: cdi-cri\ndeviceIDStrategy: index\ncdi:\n  nvidiaHookPath: /usr/bin/nvidia-cdi-hook\nnvidiaDriverRoot: \"/\"\ngfd:\n  enabled: false\nnfd:\n  enabled: false\naffinity: null\n"
  }
}'
openshell doctor exec -- kubectl -n nvidia-device-plugin rollout status ds/nvidia-device-plugin
openshell doctor exec -- kubectl -n nvidia-device-plugin get ds nvidia-device-plugin -o jsonpath='{.spec.template.spec.containers[0].image}{"\\n"}'

That only affects the running gateway. Recreating the gateway reapplies the checked-in manifest.

pimlock · 2026-04-01T20:44:23Z

Once the #710 is reviewed and merged, I will add it to here and test it again. I'm getting a lease on colossus for a Jetson-based system.

It's very likely there will be some updates to the policy required, with the #677 now merged and before that, in many contexts landlock policies were not correctly applied.

This reverts commit 730d104.

elezar added 4 commits March 26, 2026 14:44

elezar self-assigned this Mar 26, 2026

elezar marked this pull request as ready for review March 27, 2026 07:16

elezar requested a review from a team as a code owner March 27, 2026 07:16

pimlock added 2 commits April 1, 2026 11:55

Merge remote-tracking branch 'origin/main' into fix/tegra-gpu-support

2c06aad

fix(gpu): gate Tegra-specific runtime adjustments

730d104

pimlock added 2 commits April 2, 2026 00:24

Revert "fix(gpu): gate Tegra-specific runtime adjustments"

ccb6654

This reverts commit 730d104.

Merge remote-tracking branch 'origin/main' into fix/tegra-gpu-support

4eccf31

pimlock added the test:e2e Requires end-to-end coverage label Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gpu): add Tegra/Jetson GPU support#625

fix(gpu): add Tegra/Jetson GPU support#625
elezar wants to merge 8 commits intomainfrom
fix/tegra-gpu-support

elezar commented Mar 26, 2026 •

edited

Loading

Uh oh!

elezar commented Mar 26, 2026

Uh oh!

johnnynunez commented Mar 26, 2026 •

edited

Loading

Uh oh!

elezar commented Mar 27, 2026

Uh oh!

johnnynunez commented Mar 27, 2026

Uh oh!

pimlock commented Apr 1, 2026 •

edited

Loading

Uh oh!

pimlock commented Apr 1, 2026

Uh oh!

pimlock commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

elezar commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

elezar commented Mar 26, 2026

Uh oh!

johnnynunez commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elezar commented Mar 27, 2026

Uh oh!

johnnynunez commented Mar 27, 2026

Uh oh!

pimlock commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pimlock commented Apr 1, 2026

Uh oh!

pimlock commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elezar commented Mar 26, 2026 •

edited

Loading

johnnynunez commented Mar 26, 2026 •

edited

Loading

pimlock commented Apr 1, 2026 •

edited

Loading