Conversation
Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d (read-only) into the gateway container when it exists, so the nvidia runtime running inside k3s can apply the same host-file injection config as on the host — required for Jetson/Tegra platforms. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Use ghcr.io/nvidia/k8s-device-plugin:2ab68c16 which includes support for mounting /etc/nvidia-container-runtime/host-files-for-container.d into the device plugin pod, required for correct CDI spec generation on Tegra-based systems. Also included is an nvcdi API bump that ensures that additional GIDs are included in the generated CDI spec. Signed-off-by: Evan Lezar <elezar@nvidia.com>
initgroups(3) replaces all supplemental groups with the user's entries from /etc/group, discarding GIDs injected by the container runtime via CDI (e.g. GID 44/video needed for /dev/nvmap on Tegra). Snapshot the container-level GIDs before initgroups runs and merge them back afterwards, excluding GID 0 (root) to avoid privilege retention. Signed-off-by: Evan Lezar <elezar@nvidia.com>
On Jetson/Tegra platforms nvidia-smi is installed at /usr/sbin/nvidia-smi rather than /usr/bin/nvidia-smi and may not be on PATH inside the sandbox. Fall back to the full path when the bare command is not found. Signed-off-by: Evan Lezar <elezar@nvidia.com>
|
cc @johnnynunez |
|
LGTM @elezar |
Yes, i know. I was tracking it. And tested |
|
I dug into the GID-preservation change here and I think PR #710 may make it unnecessary. What I verified locally:
If that holds generally, then once #710 adds the needed GPU device paths to So I think we should re-check whether the |
|
Follow-up: I removed the checked-in custom If someone still needs that image on a live gateway for testing, they can patch the running cluster in place: openshell doctor exec -- kubectl -n kube-system patch helmchart nvidia-device-plugin --type merge -p '{
"spec": {
"valuesContent": "image:\n repository: ghcr.io/nvidia/k8s-device-plugin\n tag: \"2ab68c16\"\nruntimeClassName: nvidia\ndeviceListStrategy: cdi-cri\ndeviceIDStrategy: index\ncdi:\n nvidiaHookPath: /usr/bin/nvidia-cdi-hook\nnvidiaDriverRoot: \"/\"\ngfd:\n enabled: false\nnfd:\n enabled: false\naffinity: null\n"
}
}'
openshell doctor exec -- kubectl -n nvidia-device-plugin rollout status ds/nvidia-device-plugin
openshell doctor exec -- kubectl -n nvidia-device-plugin get ds nvidia-device-plugin -o jsonpath='{.spec.template.spec.containers[0].image}{"\\n"}'That only affects the running gateway. Recreating the gateway reapplies the checked-in manifest. |
|
Once the #710 is reviewed and merged, I will add it to here and test it again. I'm getting a lease on colossus for a Jetson-based system. It's very likely there will be some updates to the policy required, with the #677 now merged and before that, in many contexts landlock policies were not correctly applied. |
Summary
Adds GPU support for NVIDIA Tegra/Jetson platforms by bind-mounting the
host-files configuration directory, updating the device plugin image, and
preserving CDI-injected GIDs across privilege drop.
Related Issue
Part of #398 (CDI injection). Depends on #568 (Tegra system support). Should be merged after #495 and #503.
Upstream PRs:
Changes
/etc/nvidia-container-runtime/host-files-for-container.d(read-only) into the gateway container when present, so the nvidia runtime inside k3s applies the same host-file injection config as the host — required for Jetson/Tegra CDI spec generationadditionalGidsin the CDI spec (GID 44 /video, required for/dev/nvmapaccess on Tegra)initgroups()during privilege drop, so exec'd processes retain access to GPU devices/usr/sbin/nvidia-smiin the GPU e2e test for Tegra systems wherenvidia-smiis not on the defaultPATHTesting
mise run pre-commitpassesChecklist