chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20#965
Conversation
|
🌿 Preview your docs: https://nvidia-preview-chore-gpu-operator-v26-3-1.docs.buildwithfern.com/aicr |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR upgrades the GPU Operator Helm chart to v26.3.1 across overlays, pins the default driver version to 580.126.20, disables RDMA by default in component values and adds ccManager.enabled: false, adjusts GB200 overlay driver pins (inference and training differ), introduces AKS-specific RDMA host-MOFED settings, updates a test assertion to disable RDMA for EKS training, and syncs the container-images documentation. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
24ed5c8 to
1d586e9
Compare
8f2e66e to
1cefd48
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/components/nvidia-dra-driver-gpu/values.yaml`:
- Around line 59-80: Add an automated CI validation that fails the PR if the
aicr.nvidia.com/gpu-operator-chart-version annotation in the values.yaml (under
controller and kubeletPlugin) is not equal to the current gpu-operator chart
version; implement the check as a simple script or GitHub Action step that reads
the gpu-operator chart version (from the chart's Chart.yaml or a central source
of truth) and compares it to the annotation value, returning a non-zero exit
code with a clear message when mismatched so authors must update the annotation
(track this automation work and reference issue `#973` in the CI job description).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 14cad807-a6e1-4952-af30-e6ca238c6b82
📒 Files selected for processing (11)
docs/user/container-images.mdrecipes/components/gpu-operator/values-aks-training.yamlrecipes/components/gpu-operator/values-aks.yamlrecipes/components/gpu-operator/values.yamlrecipes/components/nvidia-dra-driver-gpu/values.yamlrecipes/overlays/aks.yamlrecipes/overlays/base.yamlrecipes/overlays/gb200-eks-inference.yamlrecipes/overlays/gb200-eks-training.yamlrecipes/overlays/oke.yamltests/chainsaw/cli/cuj1-training/assert-bundle-scheduling.yaml
💤 Files with no reviewable changes (2)
- recipes/overlays/gb200-eks-training.yaml
- recipes/overlays/gb200-eks-inference.yaml
1cefd48 to
21778df
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
21778df to
9dc9fb9
Compare
9dc9fb9 to
bbb6ced
Compare
bbb6ced to
bccefa7
Compare
97c0698 to
9cae76f
Compare
|
Deployed and validated end-to-end on three clusters:
Rebased onto latest |
7ec5dc9 to
570bf0a
Compare
…126.20 (NVIDIA#894) Lift every leaf recipe to the current upstream stable gpu-operator chart (v26.3.1, published 2026-04-18). The registry already defaulted to v26.3.1, but base.yaml shadowed it with v25.10.1 and the aks/oke overlays held v26.3.0 — so leaves actually shipped v25.10.1 / v26.3.0 images even though the BOM advertised v26.3.1. - recipes/overlays/base.yaml: v25.10.1 -> v26.3.1 - recipes/overlays/aks.yaml: v26.3.0 -> v26.3.1 - recipes/overlays/oke.yaml: v26.3.0 -> v26.3.1 Driver pin moves to NVIDIA's v26.3.1-qualified version (580.126.20). This also matches the GB200+EFA floor, so the per-overlay driver.version: 580.126.20 in gb200-eks-{training, inference}.yaml becomes redundant and is dropped; kernelModuleConfig stays. Upstream v26.3.x flipped ccManager to enabled=true,defaultMode=on by default. Pin ccManager.enabled: false in the global values.yaml so we do not silently turn on Confidential Compute Manager on clusters without CC-capable hardware. Revisit when AICR adds explicit CC support. Flip driver.rdma.enabled global default true -> false. v26.3.1's driver-validation init container now hard-gates on lsmod | grep nvidia_peermem; the module only loads against Mellanox MOFED symbols, so AWS EFA (EKS p4d/p5/p5e) and Linode hosts trap the validator forever even though NCCL on EFA uses aws-ofi-nccl / libfabric and does not need nvidia_peermem. Re-enable rdma explicitly in components/gpu-operator/values-aks.yaml and values-aks-training.yaml — AKS overlays deploy network-operator which installs MOFED on ND-series InfiniBand nodes, so peermem has Mellanox symbols to bind against. Set driver.rdma.useHostMofed: true alongside, so the driver container binds nvidia_peermem against the network-operator-installed host MOFED instead of building its own bundled MOFED inside the container. GKE-cos / OKE / kind keep driver.enabled: false (host-managed driver) so the flag is moot; EKS / LKE inherit the new safe default. cuj1-training chainsaw assertion updated to match. Add gpu-operator-chart-version pod annotation to the DRA driver in recipes/components/nvidia-dra-driver-gpu/values.yaml (kubeletPlugin and controller). gpu-operator's k8s-driver-manager reloads the host NVIDIA kernel modules during a driver bump but does NOT restart the sibling nvidia-dra-driver-gpu DaemonSet because its chart template is unchanged. The DRA kubelet plugin loads libnvidia-ml.so at pod start and pins to the running driver version, so a kernel-module reload leaves the pod with a stale NVML handle; CDI spec generation then fails with Driver/library version mismatch and DRA-allocated workloads stay in ContainerCreating. Bumping the annotation value on every gpu-operator chart bump forces a DaemonSet re-roll across every deployer (Helm, helmfile, Flux, Argo CD), refreshing NVML against the now-running driver. The deploy.sh Helm-deployer template already restarts the kubelet plugin post-install; the annotation closes the gap for GitOps-deployer artifacts. Manual coupling of the annotation value to the chart version is a known maintenance gap tracked in issue NVIDIA#973 (bundler-derived annotation as the durable fix). Verified live on aicr2: applying the new bundle rolled both DRA pods cleanly, and aicr validate --phase performance passed end-to-end (inference-perf 39.7k tok/s, TTFT p99 122ms). Gate the existing post-install DRA kubelet-plugin restart on gpu-operator's per-node driver migration completing. The annotation-based re-roll above and the existing kubectl rollout restart in deploy.sh both fire at helm-upgrade time, but k8s-driver-manager runs the per-node module reload asynchronously after `helm upgrade gpu-operator` returns. On a multi-GPU-node cluster, the DRA plugin can re-roll on a node whose driver migration has not yet started, get its NVML handle stuck to the pre-migration state, and produce "invalid CDI Spec: empty device edits" once the modules reload underneath it. Reproduced live on a GB200 EKS cluster (yljtrxpmzu) during PR validation: the chart-version annotation re-rolled the DRA pods correctly, but the second GB200 node's driver migration ran after that, leaving its kubelet-plugin pod with a stale NVML view. Adding a kubectl wait for nvidia.com/gpu-driver-upgrade-state=upgrade-done on every GPU node before the existing post-install rollout-restart closes this timing race for the default Helm-deployer flow. See NVIDIA#973 for broader cross-deployer coverage (a Helm post-install hook on gpu-operator-post would cover helmfile/Flux/Argo CD too). BOM (docs/user/container-images.md) regenerated; only the driver image moves, since the registry-driven BOM was already on v26.3.1. The BOM-vs-leaf drift class itself is tracked separately in NVIDIA#966.
8e73cc2 to
7c920d0
Compare
…ote chart name (NVIDIA#1034) Two argocd-helm OCI publishing bugs surfaced by Codex review against the post-NVIDIA#1032 contract. P1 — Path-based child Applications resolve to a non-existent artifact NVIDIA#1032 changed the --set repoURL contract so callers pass only the parent namespace (e.g., oci://reg/org); the parent Application appends .Chart.Name via its separate source.chart field. The parent-App template at parentAppTemplate / line 407 implements this correctly, but the path-based child-App template in injectValuesIntoSingleSource at line 703 was left emitting only .Values.repoURL. Argo CD's generic OCI source (used by path-based children since they have no source.chart) treats spec.source.repoURL as the full artifact reference and resolves it directly, so under the new contract a child source pointed at oci://reg/org:tag — an artifact that doesn't exist — and the child Application failed to sync. Fix: append /{{ .Chart.Name }} to the rendered repoURL value so the assembled URL matches the artifact the parent App's repoURL/chart:tag triple resolves to. The error-message text in the required directive is updated to say "this template appends .Chart.Name" (the path-based template is doing the appending, not the parent App). P2 — Unquoted name and version in generated Chart.yaml writeChartYAML emitted "name: %s" / "version: %s" with raw fmt.Fprintf. (*Reference).ChartName() returns path.Base(Repository), so a valid OCI artifact path whose last segment is a YAML reserved scalar — "null", "true", "false", "yes", "no", numeric strings — produced an unquoted YAML reserved word as the chart's name. Helm's loader then parsed name: null as YAML null, chart.Metadata.Name became empty, and helm package / helm push rejected the chart with "chart.metadata.name is required". Same trap for the version field ("1.0" parses as float). Fix: emit both via %q so the values round-trip as YAML strings. OCI artifact path segments are constrained to printable ASCII by the docker reference grammar, which is the safe charset for Go's %q. Tests - TestInjectValuesIntoSingleSource_AppendsChartName — asserts the rendered path-based child source.repoURL appends /{{ .Chart.Name }} after the user-supplied .Values.repoURL. - TestWriteChartYAML_QuotesYAMLReservedScalarsAsName — table-driven case across null / true / false / numeric / yes / normal name; for each, writes Chart.yaml and verifies yaml.Unmarshal round-trips name back as the expected string. - TestHelmTemplate_RendersWithSetRepoURL — updated to exercise the new contract: --set repoURL=oci://reg/org (parent namespace only). Asserts the parent App's repoURL equals the parent namespace and the child App's repoURL equals parent-namespace + /aicr-bundle. - TestGenerate_CustomChartName and the existing Chart.yaml-version assertion updated to expect quoted scalars. - Golden templates and Chart.yaml fixtures regenerated. Closes NVIDIA#1034 Related: NVIDIA#1032 (the contract change this PR completes), NVIDIA#1019, NVIDIA#965
…oss all deployers (NVIDIA#980) Add a Helm post-install,post-upgrade Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. This extends the kubectl wait + rollout-restart block that pkg/bundler/deployer/helm/templates/deploy.sh.tmpl already runs for the default Helm deployer (./deploy.sh) to every other deployer — helmfile, Flux, Argo CD, argocd-helm — by encoding the same logic as a Helm hook inside the rendered chart. Helm/helmfile/Flux/Argo CD all execute Helm-defined hooks as part of the release lifecycle, so users who consume the bundle via GitOps reconcilers gain the same protection without needing to run the post-deploy script. The hook mirrors deploy.sh's logic: - Detects host-managed-driver overlays (GKE COS, OKE, Kind, Talos) by absence of nvidia-driver-daemonset and skips the migration wait. - Waits up to 15m for every managed GPU node to reach the nvidia.com/gpu-driver-upgrade-state=upgrade-done label. - Issues kubectl rollout restart on the DRA kubelet plugin DaemonSet and waits up to 5m for the new generation to be ready. - Exits 0 if the DRA DaemonSet is not yet present (fresh install before nvidia-dra-driver-gpu's chart has applied), since the rollout-restart is only meaningful in the upgrade path. RBAC is shipped alongside the Job: - ServiceAccount in the gpu-operator namespace. - ClusterRole for cluster-wide DaemonSet list/get and Node list/watch (needed to discover nvidia-driver-daemonset cross-namespace and wait on per-node upgrade-state labels). - Role+RoleBinding in the nvidia-dra-driver namespace scoped to patch on DaemonSets and list/watch on Pods (rollout-status). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once the hook has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. WIP / draft: parity test (assert the hook Job appears in rendered output for at least two deployers), full make qualify, and cluster verification on a fresh gpu-operator chart bump under a GitOps deployer all still TBD before flipping to ready-for-review. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this hook generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug).
…pe (NVIDIA#973) Replace the hand-maintained aicr.nvidia.com/gpu-operator-chart-version annotation in recipes/components/nvidia-dra-driver-gpu/values.yaml with a bundler-side injection that reads the value from the resolved gpu-operator componentRef and writes it onto both controller and kubeletPlugin podAnnotations of the rendered DRA values. The annotation cannot drift from the actual chart pin any more — a gpu-operator chart bump produces a rendered pod-template diff automatically, helm upgrade (and every other deployer) re-rolls the DaemonSet, and the kubelet plugin's NVML handle rebinds to the post-migration driver state. Why this matters PR NVIDIA#965 mitigated the stale-NVML class of bug (gpu-operator chart bump → k8s-driver-manager reloads kernel modules async → DRA kubelet plugin's NVML handle goes stale → CDI spec generation fails with "Driver/library version mismatch" → DRA-allocated pods stuck in ContainerCreating) by hard-coding the gpu-operator chart version into the DRA pod-template annotation. The annotation worked, but only as long as a maintainer remembered to bump both pins in lockstep. A future PR that bumped gpu-operator and forgot the annotation would produce identical rendered DaemonSet manifests, helm upgrade would skip the re-roll, and stale NVML would return silently. make qualify did not catch the drift because no static check verified the two pins. Implementation injectDRAChartVersionAnnotation iterates the filtered recipeResult.ComponentRefs (Make has already filtered out disabled components by this point), extracts the gpu-operator chart version, and gates on both gpu-operator and nvidia-dra-driver-gpu being enabled. When both are present, it writes aicr.nvidia.com/gpu-operator-chart-version onto controller and kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map, creating nested maps lazily so existing values (priorityClassName, other annotations) are preserved. Injection happens in DefaultBundler.Make AFTER extractComponentValues — so the values map is populated and user --set overrides have already landed — and BEFORE buildDeployer — so every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives the same final componentValues map. Placing the call after --set means a user override of this specific annotation key is intentionally NOT honored; the annotation must always reflect the actual resolved gpu-operator chart version, or the rollout-trigger semantics break. The injection mirrors the GKE critical-priority quota helper (pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so callers familiar with that pattern read this one for free. Tests - Unit tests (bundler_dra_annotation_test.go): positive case both pod paths, negative case DRA disabled, negative case gpu-operator disabled, preserves-existing-values, overrides-user-set, version-variants (v-prefixed / unprefixed / pre-release / build metadata), nil-tolerant inputs, lazy-DRA-values-map creation. Helper coverage: 100%. - Generated-artifact parity test (bundler_dra_annotation_parity_test.go): bundles the same recipe through the default Helm deployer AND the helmfile deployer; asserts the annotation lands in the rendered DRA values.yaml for BOTH outputs. Catches the cross-deployer parity risk if a future refactor moves the injection into a deployer-specific render path. Plus a disabled-recipe negative case that asserts the gpu-operator values do not leak any DRA-related annotation when DRA is filtered out. Closes NVIDIA#973 Related: NVIDIA#965 (the chart-version annotation this durabilizes), NVIDIA#980 (orthogonal cross-deployer corrective restart), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug)
…oss all deployers (NVIDIA#980) Add a Helm post-install,post-upgrade Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. This extends the kubectl wait + rollout-restart block that pkg/bundler/deployer/helm/templates/deploy.sh.tmpl already runs for the default Helm deployer (./deploy.sh) to every other deployer — helmfile, Flux, Argo CD, argocd-helm — by encoding the same logic as a Helm hook inside the rendered chart. Helm/helmfile/Flux/Argo CD all execute Helm-defined hooks as part of the release lifecycle, so users who consume the bundle via GitOps reconcilers gain the same protection without needing to run the post-deploy script. The hook mirrors deploy.sh's logic: - Detects host-managed-driver overlays (GKE COS, OKE, Kind, Talos) by absence of nvidia-driver-daemonset and skips the migration wait. - Waits up to 15m for every managed GPU node to reach the nvidia.com/gpu-driver-upgrade-state=upgrade-done label. - Issues kubectl rollout restart on the DRA kubelet plugin DaemonSet and waits up to 5m for the new generation to be ready. - Exits 0 if the DRA DaemonSet is not yet present (fresh install before nvidia-dra-driver-gpu's chart has applied), since the rollout-restart is only meaningful in the upgrade path. RBAC is shipped alongside the Job: - ServiceAccount in the gpu-operator namespace. - ClusterRole for cluster-wide DaemonSet list/get and Node list/watch (needed to discover nvidia-driver-daemonset cross-namespace and wait on per-node upgrade-state labels). - Role+RoleBinding in the nvidia-dra-driver namespace scoped to patch on DaemonSets and list/watch on Pods (rollout-status). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once the hook has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. WIP / draft: parity test (assert the hook Job appears in rendered output for at least two deployers), full make qualify, and cluster verification on a fresh gpu-operator chart bump under a GitOps deployer all still TBD before flipping to ready-for-review. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this hook generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug).
…pe (NVIDIA#973) Replace the hand-maintained aicr.nvidia.com/gpu-operator-chart-version annotation in recipes/components/nvidia-dra-driver-gpu/values.yaml with a bundler-side injection that reads the value from the resolved gpu-operator componentRef and writes it onto both controller and kubeletPlugin podAnnotations of the rendered DRA values. The annotation cannot drift from the actual chart pin any more — a gpu-operator chart bump produces a rendered pod-template diff automatically, helm upgrade (and every other deployer) re-rolls the DaemonSet, and the kubelet plugin's NVML handle rebinds to the post-migration driver state. Why this matters PR NVIDIA#965 mitigated the stale-NVML class of bug (gpu-operator chart bump → k8s-driver-manager reloads kernel modules async → DRA kubelet plugin's NVML handle goes stale → CDI spec generation fails with "Driver/library version mismatch" → DRA-allocated pods stuck in ContainerCreating) by hard-coding the gpu-operator chart version into the DRA pod-template annotation. The annotation worked, but only as long as a maintainer remembered to bump both pins in lockstep. A future PR that bumped gpu-operator and forgot the annotation would produce identical rendered DaemonSet manifests, helm upgrade would skip the re-roll, and stale NVML would return silently. make qualify did not catch the drift because no static check verified the two pins. Implementation injectDRAChartVersionAnnotation iterates the filtered recipeResult.ComponentRefs (Make has already filtered out disabled components by this point), extracts the gpu-operator chart version, and gates on both gpu-operator and nvidia-dra-driver-gpu being enabled. When both are present, it writes aicr.nvidia.com/gpu-operator-chart-version onto controller and kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map, creating nested maps lazily so existing values (priorityClassName, other annotations) are preserved. Injection happens in DefaultBundler.Make AFTER extractComponentValues — so the values map is populated and user --set overrides have already landed — and BEFORE buildDeployer — so every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives the same final componentValues map. Placing the call after --set means a user override of this specific annotation key is intentionally NOT honored; the annotation must always reflect the actual resolved gpu-operator chart version, or the rollout-trigger semantics break. The injection mirrors the GKE critical-priority quota helper (pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so callers familiar with that pattern read this one for free. Tests - Unit tests (bundler_dra_annotation_test.go): positive case both pod paths, negative case DRA disabled, negative case gpu-operator disabled, preserves-existing-values, overrides-user-set, version-variants (v-prefixed / unprefixed / pre-release / build metadata), nil-tolerant inputs, lazy-DRA-values-map creation. Helper coverage: 100%. - Generated-artifact parity test (bundler_dra_annotation_parity_test.go): bundles the same recipe through the default Helm deployer AND the helmfile deployer; asserts the annotation lands in the rendered DRA values.yaml for BOTH outputs. Catches the cross-deployer parity risk if a future refactor moves the injection into a deployer-specific render path. Plus a disabled-recipe negative case that asserts the gpu-operator values do not leak any DRA-related annotation when DRA is filtered out. Closes NVIDIA#973 Related: NVIDIA#965 (the chart-version annotation this durabilizes), NVIDIA#980 (orthogonal cross-deployer corrective restart), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug)
…pe (NVIDIA#973) Replace the hand-maintained aicr.nvidia.com/gpu-operator-chart-version annotation in recipes/components/nvidia-dra-driver-gpu/values.yaml with a bundler-side injection that reads the value from the resolved gpu-operator componentRef and writes it onto both controller and kubeletPlugin podAnnotations of the rendered DRA values. The annotation cannot drift from the actual chart pin any more — a gpu-operator chart bump produces a rendered pod-template diff automatically, helm upgrade (and every other deployer) re-rolls the DaemonSet, and the kubelet plugin's NVML handle rebinds to the post-migration driver state. Why this matters PR NVIDIA#965 mitigated the stale-NVML class of bug (gpu-operator chart bump → k8s-driver-manager reloads kernel modules async → DRA kubelet plugin's NVML handle goes stale → CDI spec generation fails with "Driver/library version mismatch" → DRA-allocated pods stuck in ContainerCreating) by hard-coding the gpu-operator chart version into the DRA pod-template annotation. The annotation worked, but only as long as a maintainer remembered to bump both pins in lockstep. A future PR that bumped gpu-operator and forgot the annotation would produce identical rendered DaemonSet manifests, helm upgrade would skip the re-roll, and stale NVML would return silently. make qualify did not catch the drift because no static check verified the two pins. Implementation injectDRAChartVersionAnnotation iterates the filtered recipeResult.ComponentRefs (Make has already filtered out disabled components by this point), extracts the gpu-operator chart version, and gates on both gpu-operator and nvidia-dra-driver-gpu being enabled. When both are present, it writes aicr.nvidia.com/gpu-operator-chart-version onto controller and kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map, creating nested maps lazily so existing values (priorityClassName, other annotations) are preserved. Injection happens in DefaultBundler.Make AFTER extractComponentValues — so the values map is populated and user --set overrides have already landed — and BEFORE buildDeployer — so every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives the same final componentValues map. Placing the call after --set means a user override of this specific annotation key is intentionally NOT honored; the annotation must always reflect the actual resolved gpu-operator chart version, or the rollout-trigger semantics break. The injection mirrors the GKE critical-priority quota helper (pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so callers familiar with that pattern read this one for free. Tests - Unit tests (bundler_dra_annotation_test.go): positive case both pod paths, negative case DRA disabled, negative case gpu-operator disabled, preserves-existing-values, overrides-user-set, version-variants (v-prefixed / unprefixed / pre-release / build metadata), nil-tolerant inputs, lazy-DRA-values-map creation. Helper coverage: 100%. - Generated-artifact parity test (bundler_dra_annotation_parity_test.go): bundles the same recipe through the default Helm deployer AND the helmfile deployer; asserts the annotation lands in the rendered DRA values.yaml for BOTH outputs. Catches the cross-deployer parity risk if a future refactor moves the injection into a deployer-specific render path. Plus a disabled-recipe negative case that asserts the gpu-operator values do not leak any DRA-related annotation when DRA is filtered out. Closes NVIDIA#973 Related: NVIDIA#965 (the chart-version annotation this durabilizes), NVIDIA#980 (orthogonal cross-deployer corrective restart), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug)
…pe (NVIDIA#973) Replace the hand-maintained aicr.nvidia.com/gpu-operator-chart-version annotation in recipes/components/nvidia-dra-driver-gpu/values.yaml with a bundler-side injection that reads the value from the resolved gpu-operator componentRef and writes it onto both controller and kubeletPlugin podAnnotations of the rendered DRA values. The annotation cannot drift from the actual chart pin any more — a gpu-operator chart bump produces a rendered pod-template diff automatically, helm upgrade (and every other deployer) re-rolls the DaemonSet, and the kubelet plugin's NVML handle rebinds to the post-migration driver state. Why this matters PR NVIDIA#965 mitigated the stale-NVML class of bug (gpu-operator chart bump → k8s-driver-manager reloads kernel modules async → DRA kubelet plugin's NVML handle goes stale → CDI spec generation fails with "Driver/library version mismatch" → DRA-allocated pods stuck in ContainerCreating) by hard-coding the gpu-operator chart version into the DRA pod-template annotation. The annotation worked, but only as long as a maintainer remembered to bump both pins in lockstep. A future PR that bumped gpu-operator and forgot the annotation would produce identical rendered DaemonSet manifests, helm upgrade would skip the re-roll, and stale NVML would return silently. make qualify did not catch the drift because no static check verified the two pins. Implementation injectDRAChartVersionAnnotation iterates the filtered recipeResult.ComponentRefs (Make has already filtered out disabled components by this point), extracts the gpu-operator chart version, and gates on both gpu-operator and nvidia-dra-driver-gpu being enabled. When both are present, it writes aicr.nvidia.com/gpu-operator-chart-version onto controller and kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map, creating nested maps lazily so existing values (priorityClassName, other annotations) are preserved. Injection happens in DefaultBundler.Make AFTER extractComponentValues — so the values map is populated and user --set overrides have already landed — and BEFORE buildDeployer — so every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives the same final componentValues map. Placing the call after --set means a user override of this specific annotation key is intentionally NOT honored; the annotation must always reflect the actual resolved gpu-operator chart version, or the rollout-trigger semantics break. The injection mirrors the GKE critical-priority quota helper (pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so callers familiar with that pattern read this one for free. Tests - Unit tests (bundler_dra_annotation_test.go): positive case both pod paths, negative case DRA disabled, negative case gpu-operator disabled, preserves-existing-values, overrides-user-set, version-variants (v-prefixed / unprefixed / pre-release / build metadata), nil-tolerant inputs, lazy-DRA-values-map creation. Helper coverage: 100%. - Generated-artifact parity test (bundler_dra_annotation_parity_test.go): bundles the same recipe through the default Helm deployer AND the helmfile deployer; asserts the annotation lands in the rendered DRA values.yaml for BOTH outputs. Catches the cross-deployer parity risk if a future refactor moves the injection into a deployer-specific render path. Plus a disabled-recipe negative case that asserts the gpu-operator values do not leak any DRA-related annotation when DRA is filtered out. Closes NVIDIA#973 Related: NVIDIA#965 (the chart-version annotation this durabilizes), NVIDIA#980 (orthogonal cross-deployer corrective restart), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug)
…pe (NVIDIA#973) Replace the hand-maintained aicr.nvidia.com/gpu-operator-chart-version annotation in recipes/components/nvidia-dra-driver-gpu/values.yaml with a bundler-side injection that reads the value from the resolved gpu-operator componentRef and writes it onto both controller and kubeletPlugin podAnnotations of the rendered DRA values. The annotation cannot drift from the actual chart pin any more — a gpu-operator chart bump produces a rendered pod-template diff automatically, helm upgrade (and every other deployer) re-rolls the DaemonSet, and the kubelet plugin's NVML handle rebinds to the post-migration driver state. Why this matters PR NVIDIA#965 mitigated the stale-NVML class of bug (gpu-operator chart bump → k8s-driver-manager reloads kernel modules async → DRA kubelet plugin's NVML handle goes stale → CDI spec generation fails with "Driver/library version mismatch" → DRA-allocated pods stuck in ContainerCreating) by hard-coding the gpu-operator chart version into the DRA pod-template annotation. The annotation worked, but only as long as a maintainer remembered to bump both pins in lockstep. A future PR that bumped gpu-operator and forgot the annotation would produce identical rendered DaemonSet manifests, helm upgrade would skip the re-roll, and stale NVML would return silently. make qualify did not catch the drift because no static check verified the two pins. Implementation injectDRAChartVersionAnnotation iterates the filtered recipeResult.ComponentRefs (Make has already filtered out disabled components by this point), extracts the gpu-operator chart version, and gates on both gpu-operator and nvidia-dra-driver-gpu being enabled. When both are present, it writes aicr.nvidia.com/gpu-operator-chart-version onto controller and kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map, creating nested maps lazily so existing values (priorityClassName, other annotations) are preserved. Injection happens in DefaultBundler.Make AFTER extractComponentValues — so the values map is populated and user --set overrides have already landed — and BEFORE buildDeployer — so every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives the same final componentValues map. Placing the call after --set means a user override of this specific annotation key is intentionally NOT honored; the annotation must always reflect the actual resolved gpu-operator chart version, or the rollout-trigger semantics break. The injection mirrors the GKE critical-priority quota helper (pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so callers familiar with that pattern read this one for free. Tests - Unit tests (bundler_dra_annotation_test.go): positive case both pod paths, negative case DRA disabled, negative case gpu-operator disabled, preserves-existing-values, overrides-user-set, version-variants (v-prefixed / unprefixed / pre-release / build metadata), nil-tolerant inputs, lazy-DRA-values-map creation. Helper coverage: 100%. - Generated-artifact parity test (bundler_dra_annotation_parity_test.go): bundles the same recipe through the default Helm deployer AND the helmfile deployer; asserts the annotation lands in the rendered DRA values.yaml for BOTH outputs. Catches the cross-deployer parity risk if a future refactor moves the injection into a deployer-specific render path. Plus a disabled-recipe negative case that asserts the gpu-operator values do not leak any DRA-related annotation when DRA is filtered out. Closes NVIDIA#973 Related: NVIDIA#965 (the chart-version annotation this durabilizes), NVIDIA#980 (orthogonal cross-deployer corrective restart), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug)
…oss all deployers (NVIDIA#980) Add a Helm post-install,post-upgrade Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. This extends the kubectl wait + rollout-restart block that pkg/bundler/deployer/helm/templates/deploy.sh.tmpl already runs for the default Helm deployer (./deploy.sh) to every other deployer — helmfile, Flux, Argo CD, argocd-helm — by encoding the same logic as a Helm hook inside the rendered chart. Helm/helmfile/Flux/Argo CD all execute Helm-defined hooks as part of the release lifecycle, so users who consume the bundle via GitOps reconcilers gain the same protection without needing to run the post-deploy script. The hook mirrors deploy.sh's logic: - Detects host-managed-driver overlays (GKE COS, OKE, Kind, Talos) by absence of nvidia-driver-daemonset and skips the migration wait. - Waits up to 15m for every managed GPU node to reach the nvidia.com/gpu-driver-upgrade-state=upgrade-done label. - Issues kubectl rollout restart on the DRA kubelet plugin DaemonSet and waits up to 5m for the new generation to be ready. - Exits 0 if the DRA DaemonSet is not yet present (fresh install before nvidia-dra-driver-gpu's chart has applied), since the rollout-restart is only meaningful in the upgrade path. RBAC is shipped alongside the Job: - ServiceAccount in the gpu-operator namespace. - ClusterRole for cluster-wide DaemonSet list/get and Node list/watch (needed to discover nvidia-driver-daemonset cross-namespace and wait on per-node upgrade-state labels). - Role+RoleBinding in the nvidia-dra-driver namespace scoped to patch on DaemonSets and list/watch on Pods (rollout-status). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once the hook has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. WIP / draft: parity test (assert the hook Job appears in rendered output for at least two deployers), full make qualify, and cluster verification on a fresh gpu-operator chart bump under a GitOps deployer all still TBD before flipping to ready-for-review. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this hook generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug).
…oss all deployers (NVIDIA#980) Add a Helm post-install,post-upgrade Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. This extends the kubectl wait + rollout-restart block that pkg/bundler/deployer/helm/templates/deploy.sh.tmpl already runs for the default Helm deployer (./deploy.sh) to every other deployer — helmfile, Flux, Argo CD, argocd-helm — by encoding the same logic as a Helm hook inside the rendered chart. Helm/helmfile/Flux/Argo CD all execute Helm-defined hooks as part of the release lifecycle, so users who consume the bundle via GitOps reconcilers gain the same protection without needing to run the post-deploy script. The hook mirrors deploy.sh's logic: - Detects host-managed-driver overlays (GKE COS, OKE, Kind, Talos) by absence of nvidia-driver-daemonset and skips the migration wait. - Waits up to 15m for every managed GPU node to reach the nvidia.com/gpu-driver-upgrade-state=upgrade-done label. - Issues kubectl rollout restart on the DRA kubelet plugin DaemonSet and waits up to 5m for the new generation to be ready. - Exits 0 if the DRA DaemonSet is not yet present (fresh install before nvidia-dra-driver-gpu's chart has applied), since the rollout-restart is only meaningful in the upgrade path. RBAC is shipped alongside the Job: - ServiceAccount in the gpu-operator namespace. - ClusterRole for cluster-wide DaemonSet list/get and Node list/watch (needed to discover nvidia-driver-daemonset cross-namespace and wait on per-node upgrade-state labels). - Role+RoleBinding in the nvidia-dra-driver namespace scoped to patch on DaemonSets and list/watch on Pods (rollout-status). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once the hook has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. WIP / draft: parity test (assert the hook Job appears in rendered output for at least two deployers), full make qualify, and cluster verification on a fresh gpu-operator chart bump under a GitOps deployer all still TBD before flipping to ready-for-review. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this hook generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug).
…ers (NVIDIA#980) Add a Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. Extends the kubectl wait + rollout-restart block already in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl to every deployer the bundler emits (Helm, helmfile, Flux, Argo CD, argocd-helm), not just the default Helm deployer that runs ./deploy.sh. Re-fire mechanism — dual approach across bundler paths: 1. Non-vendored localformat (default for Helm/helmfile/Argo CD/ argocd-helm). local_helm.stripHelmHooks unconditionally removes sync-phase helm.sh/hook annotations because Argo CD's syncPolicy.automated would otherwise treat the resource as a PostSync hook that never fires under path-based sources. After stripping, the Job ships as a regular chart resource; re-fire on upgrade is achieved by keying the Job name on the parent gpu-operator chart version, which the bundler injects into the synthesized chart's values map at gpu-operator._aicrParentChartVersion (see DefaultBundler.injectDRAParentChartVersionValue). The values-map path is used in preference to .Chart.Version because the synthesized chart's Chart.yaml hardcodes 0.1.0 (which would defeat re-fire) and because flux-oci's source-controller suffixes .Chart.Version with "+<artifact-sha>" (invalid in K8s label values). 2. Vendored localformat (--vendor-charts). vendor_folder.go calls injectPostInstallHooks on each mixed recipe-side manifest, which auto-tags resources with helm.sh/hook: post-install only UNLESS the document already declares helm.sh/hook (vendor_wrapper.go injectHookOnDoc:223). Without an author-declared hook, the auto-injected post-install-only annotation means the Job runs on first install and is silently skipped on every subsequent helm upgrade — exactly when the stale-NVML race fires. The Job therefore author-declares helm.sh/hook: post-install,post-upgrade helm.sh/hook-delete-policy: before-hook-creation so vendor_wrapper honors the value and the Job fires on both phases. stripHelmHooks on the non-vendored path strips both phases together (both are in syncPhaseHooks), so this declaration is inert there and the version-keyed name from path 1 remains the re-fire mechanism. Shell script is fail-closed (POSIX sh, alpine/k8s busybox): - Exits 0 only when state is positively determined as a no-op: DRA DaemonSet absent (fresh install), driver DaemonSet absent (host-managed driver mode), or zero managed nodes labeled nvidia.com/gpu.deploy.driver=true. - Exits 1 on every other failure: any classification kubectl-get fails, kubectl-wait for upgrade-done times out, rollout-restart fails, or rollout-status times out within 5m. - Uses a capture-then-check pattern (KUBECTL_OUT=$(kubectl ... 2>&1) || { echo "..."; exit 1; }) to distinguish "API call failed" from "result is empty" — POSIX sh lacks `set -o pipefail`, so the previous `kubectl ... | grep ... | head` patterns silently masked kubectl exit codes and the warning-only timeouts could exit 0 after an incomplete mitigation, contradicting the safety story. - backoffLimit: 2 on the Job is load-bearing: a transient apiserver hiccup auto-retries twice before surfacing as a failed release. RBAC (ClusterRole + ClusterRoleBinding only): - daemonsets.apps — get/list/watch/patch (cluster-wide; watch required by kubectl rollout status). - nodes — get/list/watch (for the upgrade-done label wait). - pods — get/list/watch (rollout-status pod readiness). A namespaced Role was rejected because gpu-operator-post installs before nvidia-dra-driver-gpu's chart, so a Role inside the nvidia-dra-driver namespace would fail on a fresh cluster (namespace doesn't exist yet). Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest list digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be, matching the repo's digest-pinning standard. alpine/k8s ships kubectl plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no pipefail). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once this Job has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this Job generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug).
…ers (NVIDIA#980) Add a Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. Extends the kubectl wait + rollout-restart block already in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl to every deployer the bundler emits (Helm, helmfile, Flux, Argo CD, argocd-helm), not just the default Helm deployer that runs ./deploy.sh. Re-fire mechanism — dual approach across bundler paths: 1. Non-vendored localformat (default for Helm/helmfile/Argo CD/ argocd-helm). local_helm.stripHelmHooks unconditionally removes sync-phase helm.sh/hook annotations because Argo CD's syncPolicy.automated would otherwise treat the resource as a PostSync hook that never fires under path-based sources. After stripping, the Job ships as a regular chart resource; re-fire on upgrade is achieved by keying the Job name on the parent gpu-operator chart version, which the bundler injects into the synthesized chart's values map at gpu-operator._aicrParentChartVersion (see DefaultBundler.injectDRAParentChartVersionValue). The values-map path is used in preference to .Chart.Version because the synthesized chart's Chart.yaml hardcodes 0.1.0 (which would defeat re-fire) and because flux-oci's source-controller suffixes .Chart.Version with "+<artifact-sha>" (invalid in K8s label values). 2. Vendored localformat (--vendor-charts). vendor_folder.go calls injectPostInstallHooks on each mixed recipe-side manifest, which auto-tags resources with helm.sh/hook: post-install only UNLESS the document already declares helm.sh/hook (vendor_wrapper.go injectHookOnDoc:223), and auto-assigns sequential hook-weights starting at postInstallHookWeightBase=100. Two consequences for this manifest: (a) auto-injected post-install hooks fire only on first install and are silently skipped on every helm upgrade — exactly when the stale-NVML race occurs. (b) if only the Job author-declares a hook, the SA / Cluster Role / ClusterRoleBinding still get auto-injected weights 100/101/102 while the Job takes Helm's default weight 0, putting the Job BEFORE its ServiceAccount/RBAC in install order and failing with "ServiceAccount not found". Both are addressed by author-declaring helm.sh/hook: post-install,post-upgrade plus before-hook-creation on ALL FOUR resources with ascending hook-weights: ServiceAccount=1, ClusterRole=2, ClusterRoleBinding=3, Job=10 vendor_wrapper honors the author-declared hooks; the explicit weights guarantee the Job's prerequisites exist before it runs; post-upgrade re-fires the Job (and refreshes the RBAC) on every helm upgrade. stripHelmHooks on the non-vendored path strips ALL of these annotations (every phase here is in syncPhaseHooks at hooks.go:109-120), so this declaration is inert there and Helm's default kind-based ordering (SA → ClusterRole → ClusterRole Binding → Job) plus the chart-version-keyed Job name (path 1) provide the same install order and re-fire mechanism without hooks. helm.sh/resource-policy: keep applies to path 1 only and prevents Helm's three-way merge from pruning prior-version Jobs mid-rollout. It is a no-op for the vendored hook path (Helm does not apply resource-policy to hook resources); retention there is governed by helm.sh/hook-delete-policy=before-hook-creation, which deletes a hook resource ONLY when a new hook with the same name is created. SA / CR / CRB names are stable so they cycle on every install/upgrade (idempotent delete-and-recreate); Job names are chart-version-keyed so prior- version Jobs survive cross-version upgrades as Succeeded records. ttlSecondsAfterFinished is intentionally NOT set: the Job is part of GitOps desired state, so a Kubernetes TTL delete would make the resource missing relative to the rendered manifest and trigger an immediate recreate on the next reconcile. Shell script is fail-closed (POSIX sh, alpine/k8s busybox): - Exits 0 only when state is positively determined as a no-op: DRA DaemonSet absent (fresh install), driver DaemonSet absent (host-managed driver mode), or zero managed nodes labeled nvidia.com/gpu.deploy.driver=true. - Exits 1 on every other failure: any classification kubectl-get fails, kubectl-wait for upgrade-done times out, rollout-restart fails, or rollout-status times out within 5m. - Uses a capture-then-check pattern (KUBECTL_OUT=$(kubectl ... 2>&1) || { echo "..."; exit 1; }) to distinguish "API call failed" from "result is empty" — POSIX sh lacks `set -o pipefail`, so the previous `kubectl ... | grep ... | head` patterns silently masked kubectl exit codes and the warning-only timeouts could exit 0 after an incomplete mitigation. - backoffLimit: 2 on the Job is load-bearing: a transient apiserver hiccup auto-retries twice before surfacing as a failed release. RBAC (ClusterRole + ClusterRoleBinding only): - daemonsets.apps — get/list/watch/patch (cluster-wide; watch required by kubectl rollout status). - nodes — get/list/watch (for the upgrade-done label wait). - pods — get/list/watch (rollout-status pod readiness). A namespaced Role was rejected because gpu-operator-post installs before nvidia-dra-driver-gpu's chart, so a Role inside the nvidia-dra-driver namespace would fail on a fresh cluster (namespace doesn't exist yet). Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest list digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be, matching the repo's digest-pinning standard. alpine/k8s ships kubectl plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no pipefail). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once this Job has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this Job generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug).
…ers (NVIDIA#980) Add a Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. Extends the kubectl wait + rollout-restart block already in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl to every deployer the bundler emits (Helm, helmfile, Flux, Argo CD, argocd-helm), not just the default Helm deployer that runs ./deploy.sh. Re-fire mechanism — dual approach across bundler paths: 1. Non-vendored localformat (default for Helm/helmfile/Argo CD/ argocd-helm). local_helm.stripHelmHooks unconditionally removes sync-phase helm.sh/hook annotations because Argo CD's syncPolicy.automated would otherwise treat the resource as a PostSync hook that never fires under path-based sources. After stripping, the Job ships as a regular chart resource; re-fire on upgrade is achieved by keying the Job name on the parent gpu-operator chart version, which the bundler injects into the synthesized chart's values map at gpu-operator._aicrParentChartVersion (see DefaultBundler.injectDRAParentChartVersionValue). The values-map path is used in preference to .Chart.Version because the synthesized chart's Chart.yaml hardcodes 0.1.0 (which would defeat re-fire) and because flux-oci's source-controller suffixes .Chart.Version with "+<artifact-sha>" (invalid in K8s label values). 2. Vendored localformat (--vendor-charts). vendor_folder.go calls injectPostInstallHooks on each mixed recipe-side manifest, which auto-tags resources with helm.sh/hook: post-install only UNLESS the document already declares helm.sh/hook (vendor_wrapper.go injectHookOnDoc:223), and auto-assigns sequential hook-weights starting at postInstallHookWeightBase=100. Two consequences for this manifest: (a) auto-injected post-install hooks fire only on first install and are silently skipped on every helm upgrade — exactly when the stale-NVML race occurs. (b) if only the Job author-declares a hook, the SA / Cluster Role / ClusterRoleBinding still get auto-injected weights 100/101/102 while the Job takes Helm's default weight 0, putting the Job BEFORE its ServiceAccount/RBAC in install order and failing with "ServiceAccount not found". Both are addressed by author-declaring helm.sh/hook: post-install,post-upgrade plus before-hook-creation on ALL FOUR resources with ascending hook-weights: ServiceAccount=1, ClusterRole=2, ClusterRoleBinding=3, Job=10 vendor_wrapper honors the author-declared hooks; the explicit weights guarantee the Job's prerequisites exist before it runs; post-upgrade re-fires the Job (and refreshes the RBAC) on every helm upgrade. stripHelmHooks on the non-vendored path strips ALL of these annotations (every phase here is in syncPhaseHooks at hooks.go:109-120), so this declaration is inert there and Helm's default kind-based ordering (SA → ClusterRole → ClusterRole Binding → Job) plus the chart-version-keyed Job name (path 1) provide the same install order and re-fire mechanism without hooks. helm.sh/resource-policy: keep applies to path 1 only and prevents Helm's three-way merge from pruning prior-version Jobs mid-rollout. It is a no-op for the vendored hook path (Helm does not apply resource-policy to hook resources); retention there is governed by helm.sh/hook-delete-policy=before-hook-creation, which deletes a hook resource ONLY when a new hook with the same name is created. SA / CR / CRB names are stable so they cycle on every install/upgrade (idempotent delete-and-recreate); Job names are chart-version-keyed so prior- version Jobs survive cross-version upgrades as Succeeded records. ttlSecondsAfterFinished is intentionally NOT set: the Job is part of GitOps desired state, so a Kubernetes TTL delete would make the resource missing relative to the rendered manifest and trigger an immediate recreate on the next reconcile. Shell script is fail-closed (POSIX sh, alpine/k8s busybox): - Exits 0 only when state is positively determined as a no-op: DRA DaemonSet absent (fresh install), driver DaemonSet absent (host-managed driver mode), or zero managed nodes labeled nvidia.com/gpu.deploy.driver=true. - Exits 1 on every other failure: any classification kubectl-get fails, kubectl-wait for upgrade-done times out, rollout-restart fails, or rollout-status times out within 5m. - Uses a capture-then-check pattern (KUBECTL_OUT=$(kubectl ... 2>&1) || { echo "..."; exit 1; }) to distinguish "API call failed" from "result is empty" — POSIX sh lacks `set -o pipefail`, so the previous `kubectl ... | grep ... | head` patterns silently masked kubectl exit codes and the warning-only timeouts could exit 0 after an incomplete mitigation. - backoffLimit: 2 on the Job is load-bearing: a transient apiserver hiccup auto-retries twice before surfacing as a failed release. RBAC (ClusterRole + ClusterRoleBinding only): - daemonsets.apps — get/list/watch/patch (cluster-wide; watch required by kubectl rollout status). - nodes — get/list/watch (for the upgrade-done label wait). - pods — get/list/watch (rollout-status pod readiness). A namespaced Role was rejected because gpu-operator-post installs before nvidia-dra-driver-gpu's chart, so a Role inside the nvidia-dra-driver namespace would fail on a fresh cluster (namespace doesn't exist yet). Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest list digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be, matching the repo's digest-pinning standard. alpine/k8s ships kubectl plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no pipefail). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once this Job has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this Job generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug). Cross-review iteration 3 — cross-deployer gating + values-only re-fire: - [P1] Cross-deployer install-time gating wired across every bundler path. The prior revision relied on the Job's eventual rollout-restart to self-correct the transient race; without install-time gating, downstream releases (notably nvidia-dra-driver-gpu) could install against stale-NVML driver state for the 15-20m window between gpu-operator-post applying its chart resources and the Job's kubectl wait + rollout restart returning. Fix: - pkg/bundler/deployer/helm/templates/deploy.sh.tmpl: per-name case for gpu-operator-post sets COMPONENT_HELM_TIMEOUT=25m + COMPONENT_EXTRA_WAIT_ARGS=--wait-for-jobs + COMPONENT_FORCE_WAIT=true (overrides --no-wait — the flag is a readiness-convenience knob, not a correctness-bypass knob; components that exist specifically to close a known race must always wait). Vendored gpu-operator case lifted to 25m for the --vendor-charts path where the Job ships as a hook on the primary chart. - pkg/bundler/deployer/helmfile/releases.go: new releaseOverrides map keyed by synthesized folder name; gpu-operator-post -> {timeout: 25m, waitForJobs: true}; gpu-operator (vendored coverage) -> {timeout: 25m}. Mirrors the deploy.sh case switch. - pkg/bundler/deployer/flux/helm.go: new Timeout field on HelmReleaseData / ChartRefHelmReleaseData + releaseTimeoutOverrides map (gpu-operator-post -> 25m). helmrelease.yaml.tmpl and helmrelease-chartref.yaml.tmpl emit spec.timeout conditionally. helm-controller already waits for hook Jobs by default, so no waitForJobs knob is needed on this path — only the outer timeout. - Argo CD path is naturally gated by sync-wave + Job-Healthy default: app-of-apps waits for each child Application to reach Healthy, which requires the Job's status.succeeded. No explicit wiring needed. - [medium] Driver-version-only overrides now re-fire the Job on every deployer. The prior key was _aicrParentChartVersion (= gpu-operator chart version) only. On non-hook deployers (Helmfile / Argo CD / argocd-helm where stripHelmHooks removes the upgrade-hook annotations), the Job is a regular chart resource — kubectl apply on the same name is a no-op, so --set gpuoperator:driver.version=<X> did NOT re-run the migration mitigation when the gpu-operator chart pin was unchanged. Fix: new draDriverVersionValueKey = _aicrDriverVersion is injected alongside the parent-chart version (reading the merged componentValues at driver.version, so any source — registry default / overlay / user --set — flows through). The Job-name slug becomes "<chartSlug>-d<driverSlug>" when driver.version is set; host-managed-driver overlays (driver.enabled=false, version unset) keep the original 1-component shape. The chart-version label (aicr.nvidia.com/gpu-operator-chart-version) stays semantically tied to the chart pin only. - [low] Empty gpu-operator.Version no longer produces an unbundleable manifest. The prior helper early-returned on empty Version, but the hook ships unconditionally on the gpu-operator componentRef, so the template would render "<nil>" into the Job name and chart-version label — invalid DNS-1123 / label values. Fix: fall through with NormalizeVersionWithDefault's "0.1.0" default + slog.Warn so operators get a debuggable signal of the underlying recipe gap. - User-facing doc: --no-wait correctness-gate exception documented in docs/user/cli-reference.md (deploy-script section) and pkg/bundler/deployer/helm/templates/README.md.tmpl so operators reading the doc know the flag does not bypass gpu-operator-post. - Tests: TestInjectDRAParentChartVersionValue_DriverVersionKey, _DriverVersionAbsent, and _EmptyVersionFallback added to pin the new behavior. All existing bundler / manifest / recipes tests pass with -race; golangci-lint clean. End-to-end render verified on a real EKS H100 training recipe: gpu-operator componentValues carries _aicrParentChartVersion: 26.3.1 _aicrDriverVersion: 580.126.20 and the rendered Job name is aicr-dra-rollout-hook-26-3-1-d580-126-20-r1 — a driver.version-only override now produces a uniquely-named Job and triggers re-fire on every deployer.
…ers (NVIDIA#980) Add a Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. Extends the kubectl wait + rollout-restart block already in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl to every deployer the bundler emits (Helm, helmfile, Flux, Argo CD, argocd-helm), not just the default Helm deployer that runs ./deploy.sh. Re-fire mechanism — dual approach across bundler paths: 1. Non-vendored localformat (default for Helm/helmfile/Argo CD/ argocd-helm). local_helm.stripHelmHooks unconditionally removes sync-phase helm.sh/hook annotations because Argo CD's syncPolicy.automated would otherwise treat the resource as a PostSync hook that never fires under path-based sources. After stripping, the Job ships as a regular chart resource; re-fire on upgrade is achieved by keying the Job name on the parent gpu-operator chart version, which the bundler injects into the synthesized chart's values map at gpu-operator._aicrParentChartVersion (see DefaultBundler.injectDRAParentChartVersionValue). The values-map path is used in preference to .Chart.Version because the synthesized chart's Chart.yaml hardcodes 0.1.0 (which would defeat re-fire) and because flux-oci's source-controller suffixes .Chart.Version with "+<artifact-sha>" (invalid in K8s label values). 2. Vendored localformat (--vendor-charts). vendor_folder.go calls injectPostInstallHooks on each mixed recipe-side manifest, which auto-tags resources with helm.sh/hook: post-install only UNLESS the document already declares helm.sh/hook (vendor_wrapper.go injectHookOnDoc:223), and auto-assigns sequential hook-weights starting at postInstallHookWeightBase=100. Two consequences for this manifest: (a) auto-injected post-install hooks fire only on first install and are silently skipped on every helm upgrade — exactly when the stale-NVML race occurs. (b) if only the Job author-declares a hook, the SA / Cluster Role / ClusterRoleBinding still get auto-injected weights 100/101/102 while the Job takes Helm's default weight 0, putting the Job BEFORE its ServiceAccount/RBAC in install order and failing with "ServiceAccount not found". Both are addressed by author-declaring helm.sh/hook: post-install,post-upgrade plus before-hook-creation on ALL FOUR resources with ascending hook-weights: ServiceAccount=1, ClusterRole=2, ClusterRoleBinding=3, Job=10 vendor_wrapper honors the author-declared hooks; the explicit weights guarantee the Job's prerequisites exist before it runs; post-upgrade re-fires the Job (and refreshes the RBAC) on every helm upgrade. stripHelmHooks on the non-vendored path strips ALL of these annotations (every phase here is in syncPhaseHooks at hooks.go:109-120), so this declaration is inert there and Helm's default kind-based ordering (SA → ClusterRole → ClusterRole Binding → Job) plus the chart-version-keyed Job name (path 1) provide the same install order and re-fire mechanism without hooks. helm.sh/resource-policy: keep applies to path 1 only and prevents Helm's three-way merge from pruning prior-version Jobs mid-rollout. It is a no-op for the vendored hook path (Helm does not apply resource-policy to hook resources); retention there is governed by helm.sh/hook-delete-policy=before-hook-creation, which deletes a hook resource ONLY when a new hook with the same name is created. SA / CR / CRB names are stable so they cycle on every install/upgrade (idempotent delete-and-recreate); Job names are chart-version-keyed so prior- version Jobs survive cross-version upgrades as Succeeded records. ttlSecondsAfterFinished is intentionally NOT set: the Job is part of GitOps desired state, so a Kubernetes TTL delete would make the resource missing relative to the rendered manifest and trigger an immediate recreate on the next reconcile. Shell script is fail-closed (POSIX sh, alpine/k8s busybox): - Exits 0 only when state is positively determined as a no-op: DRA DaemonSet absent (fresh install), driver DaemonSet absent (host-managed driver mode), or zero managed nodes labeled nvidia.com/gpu.deploy.driver=true. - Exits 1 on every other failure: any classification kubectl-get fails, kubectl-wait for upgrade-done times out, rollout-restart fails, or rollout-status times out within 5m. - Uses a capture-then-check pattern (KUBECTL_OUT=$(kubectl ... 2>&1) || { echo "..."; exit 1; }) to distinguish "API call failed" from "result is empty" — POSIX sh lacks `set -o pipefail`, so the previous `kubectl ... | grep ... | head` patterns silently masked kubectl exit codes and the warning-only timeouts could exit 0 after an incomplete mitigation. - backoffLimit: 2 on the Job is load-bearing: a transient apiserver hiccup auto-retries twice before surfacing as a failed release. RBAC (ClusterRole + ClusterRoleBinding only): - daemonsets.apps — get/list/watch/patch (cluster-wide; watch required by kubectl rollout status). - nodes — get/list/watch (for the upgrade-done label wait). - pods — get/list/watch (rollout-status pod readiness). A namespaced Role was rejected because gpu-operator-post installs before nvidia-dra-driver-gpu's chart, so a Role inside the nvidia-dra-driver namespace would fail on a fresh cluster (namespace doesn't exist yet). Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest list digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be, matching the repo's digest-pinning standard. alpine/k8s ships kubectl plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no pipefail). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once this Job has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this Job generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug). Cross-review iteration 3 — cross-deployer gating + values-only re-fire: - [P1] Cross-deployer install-time gating wired across every bundler path. The prior revision relied on the Job's eventual rollout-restart to self-correct the transient race; without install-time gating, downstream releases (notably nvidia-dra-driver-gpu) could install against stale-NVML driver state for the 15-20m window between gpu-operator-post applying its chart resources and the Job's kubectl wait + rollout restart returning. Fix: - pkg/bundler/deployer/helm/templates/deploy.sh.tmpl: per-name case for gpu-operator-post sets COMPONENT_HELM_TIMEOUT=25m + COMPONENT_EXTRA_WAIT_ARGS=--wait-for-jobs + COMPONENT_FORCE_WAIT=true (overrides --no-wait — the flag is a readiness-convenience knob, not a correctness-bypass knob; components that exist specifically to close a known race must always wait). Vendored gpu-operator case lifted to 25m for the --vendor-charts path where the Job ships as a hook on the primary chart. - pkg/bundler/deployer/helmfile/releases.go: new releaseOverrides map keyed by synthesized folder name; gpu-operator-post -> {timeout: 25m, waitForJobs: true}; gpu-operator (vendored coverage) -> {timeout: 25m}. Mirrors the deploy.sh case switch. - pkg/bundler/deployer/flux/helm.go: new Timeout field on HelmReleaseData / ChartRefHelmReleaseData + releaseTimeoutOverrides map (gpu-operator-post -> 25m). helmrelease.yaml.tmpl and helmrelease-chartref.yaml.tmpl emit spec.timeout conditionally. helm-controller already waits for hook Jobs by default, so no waitForJobs knob is needed on this path — only the outer timeout. - Argo CD path is naturally gated by sync-wave + Job-Healthy default: app-of-apps waits for each child Application to reach Healthy, which requires the Job's status.succeeded. No explicit wiring needed. - [medium] Driver-version-only overrides now re-fire the Job on every deployer. The prior key was _aicrParentChartVersion (= gpu-operator chart version) only. On non-hook deployers (Helmfile / Argo CD / argocd-helm where stripHelmHooks removes the upgrade-hook annotations), the Job is a regular chart resource — kubectl apply on the same name is a no-op, so --set gpuoperator:driver.version=<X> did NOT re-run the migration mitigation when the gpu-operator chart pin was unchanged. Fix: new draDriverVersionValueKey = _aicrDriverVersion is injected alongside the parent-chart version (reading the merged componentValues at driver.version, so any source — registry default / overlay / user --set — flows through). The Job-name slug becomes "<chartSlug>-d<driverSlug>" when driver.version is set; host-managed-driver overlays (driver.enabled=false, version unset) keep the original 1-component shape. The chart-version label (aicr.nvidia.com/gpu-operator-chart-version) stays semantically tied to the chart pin only. - [low] Empty gpu-operator.Version no longer produces an unbundleable manifest. The prior helper early-returned on empty Version, but the hook ships unconditionally on the gpu-operator componentRef, so the template would render "<nil>" into the Job name and chart-version label — invalid DNS-1123 / label values. Fix: fall through with NormalizeVersionWithDefault's "0.1.0" default + slog.Warn so operators get a debuggable signal of the underlying recipe gap. - User-facing doc: --no-wait correctness-gate exception documented in docs/user/cli-reference.md (deploy-script section) and pkg/bundler/deployer/helm/templates/README.md.tmpl so operators reading the doc know the flag does not bypass gpu-operator-post. - Tests: TestInjectDRAParentChartVersionValue_DriverVersionKey, _DriverVersionAbsent, and _EmptyVersionFallback added to pin the new behavior. All existing bundler / manifest / recipes tests pass with -race; golangci-lint clean. End-to-end render verified on a real EKS H100 training recipe: gpu-operator componentValues carries _aicrParentChartVersion: 26.3.1 _aicrDriverVersion: 580.126.20 and the rendered Job name is aicr-dra-rollout-hook-26-3-1-d580-126-20-r1 — a driver.version-only override now produces a uniquely-named Job and triggers re-fire on every deployer. Document known limits of the cross-deployer gate: - Argo CD app-of-apps. Sync-waves give ordering but not gating after Argo CD 1.8 removed default Health assessment for argoproj.io/Application (https://argo-cd.readthedocs.io/en/stable/operator-manual/upgrading/1.7-1.8/#health-assessment-of-argoprojioapplication-crd-has-been-removed). Restoring gating needs a resource.customizations Lua block on argocd-cm — out of scope for the bundle (cluster-wide config). Without it, Argo CD users fall back to the Job's fail-closed semantics: exit 1 surfaces gpu-operator-post Application as unHealthy, but nvidia-dra-driver-gpu may still sync in parallel. - Post-bundle driver.version edits. The Job-name slug is baked at bundle time by pkg/manifest.Render via pkg/bundler/deployer/localformat/local_helm.go:155, so a driver.version change applied after `aicr bundle` does not update the Job name (re-bundle picks it up; install-time `--set` / `--dynamic` does not). Closing this fully needs deferring this manifest's render to helm install time — a structural change to the localformat render pipeline tracked as a follow-up. PR description updated to reflect both limitations and the already-wired install-time gating across helm / helmfile / Flux.
Summary
Lift every leaf recipe to the current upstream stable gpu-operator chart (v26.3.1, published 2026-04-18), align the driver pin with NVIDIA's qualified version (580.126.20), and mitigate three v26.3.1-introduced regressions: stricter
driver-validationinit container (peermem gate), stale DRA NVML handles after a kernel-module reload, and the timing race between gpu-operator's async per-node driver migration and the DRA kubelet plugin's post-install restart.Motivation / Context
The registry already defaulted to
v26.3.1, butbase.yamlshadowed it withv25.10.1and the AKS/OKE overlays heldv26.3.0. As a result, leaves actually shippedv25.10.1/v26.3.0images even thoughdocs/user/container-images.mdadvertisedv26.3.1— BOM-vs-leaf drift introduced by chart-pin duplication between the registry andbase.yaml. This PR resolves that drift by single-sourcing the version pin inbase.yamlat the upstream-stable level. The broader resolver/BOM cleanup is tracked separately in #966.Coupled value updates so the bump matches NVIDIA's qualified stack and avoids v26.3.1 regressions:
driver.versionadvances from580.105.08to580.126.20— the v26.3.1 chart default and the GB200+EFA floor. The per-overlay GB200 driver pin (previously580.126.20) becomes redundant and is dropped;kernelModuleConfigstays.ccManager.enabled: falsepinned in global values. Upstream v26.3.x flipped defaults from{enabled: false, defaultMode: "off"}to{enabled: true, defaultMode: "on"}. AICR keeps it off until we have explicit CC-capable hardware support.driver.rdma.enabledglobal default flippedtrue → false. v26.3.1'sdriver-validationinit container now hard-gates onlsmod | grep nvidia_peermem. That module only loads against Mellanox MOFED symbols. On AWS EFA (EKS p4d/p5/p5e) and Linode there is no MOFED, so peermem fails to load (modprobe ... Invalid argument) and the rest of the GPU stack stays stuck inInitforever. NCCL on EFA usesaws-ofi-nccl/libfabric, which does not consumenvidia_peermem, so there is no functional path lost.driver.rdma.useHostMofed: trueexplicitly set on AKS (values-aks.yaml,values-aks-training.yaml) alongsideenabled: true. AKS overlays deploynetwork-operator, which installs MOFED kernel modules on the host via a privileged DaemonSet. WithuseHostMofed: true, peermem binds against host symbols.recipes/components/nvidia-dra-driver-gpu/values.yamlon bothcontroller.podAnnotationsandkubeletPlugin.podAnnotations—aicr.nvidia.com/gpu-operator-chart-version: v26.3.1. Bumping the annotation value on every gpu-operator chart change forces a DaemonSet re-roll across every deployer (Helm, helmfile, Flux, Argo CD), refreshing NVML against the now-running driver. Manual coupling of the annotation value to the chart version is a known maintenance gap — tracked in chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973 with a bundler-derived annotation as the durable fix.pkg/bundler/deployer/helm/templates/deploy.sh.tmpl). gpu-operator'sk8s-driver-managerruns its per-node module reload asynchronously afterhelm upgrade gpu-operatorreturns. On a multi-GPU-node cluster, the previously-existing DRA rollout-restart indeploy.sh(and the annotation re-roll above) could fire before some nodes had finished migration — leaving the freshly-rolled DRA kubelet plugin pinned to a now-stale NVML handle. Reproduced live on a GB200 EKS cluster during this PR's validation: the chart-version annotation re-rolled the DRA pods, then the second GB200 node's driver migration ran after the re-roll, leaving its kubelet-plugin pod with stale NVML and triggering the same "invalid CDI Spec: empty device edits" failure mode the annotation was meant to prevent. The fix is ~10 lines in the deploy.sh template:kubectl wait --for=jsonpathuntil everynvidia.com/gpu.present=truenode carriesnvidia.com/gpu-driver-upgrade-state=upgrade-done(15-min timeout, best-effort fall-through), then run the existing rollout-restart. Helm-deployer only; chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973 will deliver a Helm post-install hook ongpu-operator-postto extend the protection to GitOps deployers.Effective resolved values per leaf, post-PR (verified via
aicr bundle+ values inspection):driver.enableddriver.rdma.enableduseHostMofedtruefalsetruetrue(explicit override)true(explicit override; binds peermem to host MOFED from network-operator)false(host driver)false(host driver)false(host driver)falsetruefalseChart version on every supported leaf resolves to v26.3.1 (verified via
aicr query ... --selector components.gpu-operator.version).Fixes: #894
Related: #698, #966, #973
Type of Change
Component(s) Affected
pkg/recipe)pkg/bundler/...— deploy.sh template)docs/,examples/)Implementation Notes
Deployment.gpu-operator.version >= v24.6.0(8 overlays) is still satisfied.kubeVersionin Chart.yaml is unchanged (>= 1.16.0-0); AICR's>= 1.32.4floor inbase.yamlremains binding.h100-eks-training,gb200-eks-training, and AKS leaves: renderedgpu-operator/values.yamlshowsversion: 580.126.20,ccManager.enabled: false, the correct per-cloudrdma.enabled+useHostMofed, and (for gb200)kernelModuleConfig.name: nvidia-kernel-module-params.controller.podAnnotations.aicr.nvidia.com/gpu-operator-chart-versionandkubeletPlugin.podAnnotations.aicr.nvidia.com/gpu-operator-chart-versionset tov26.3.1. Confirmed on aicr2 and the yljtrxpmzu GB200 cluster thathelm upgradeon the new bundle force-rolls both DRA pods.kubectl waitblock indeploy.sh.tmpluses a 15-min timeout andWARNING; proceeding anywaysemantics rather than a hard fail, so a misconfigured cluster (e.g. nodes never reachupgrade-done) still gets the rollout-restart attempt. The guard aroundGPU_NODE_COUNT > 0avoidskubectl wait's "no matching resources" error on clusters with no GPU nodes yet (initial install ordering).nvcr.io/nvidia/driver:580.105.08→:580.126.20moves, because the registry-driven BOM was already rendering against v26.3.1 (the BOM-vs-leaf drift this PR also resolves at the leaf layer; the BOM tool itself is rewritten in chore(recipes): make resolved recipes the single source of truth for chart versions #966).tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling.yamlupdated to match the new EKS expectation (rdma.enabled: false), with a comment explaining the EFA /aws-ofi-ncclrationale.--set gpuoperator:driver.rdma.enabled=trueand--set gpuoperator:driver.rdma.useHostMofed=true. Documented inline inrecipes/components/gpu-operator/values.yaml.Testing
Live cluster validation:
aicr2 (EKS H100
p5.48xlarge, ad-hoc dev cluster):580.105.08. Post-upgrade:580.126.20. All GPU pods Running,nvidia-cuda-validatorCompleted.aicr validate: deployment passed (4/4, 28.8s), performance passed (inference-perf: 39,776.81 tok/s, TTFT p99 122.74 ms, 3m 15.7s), conformance passed (1m 49.5s).yljtrxpmzu (EKS GB200
p6e-gb200.36xlarge× 2, DGX Cloud dev cluster):kubectl waitblock: the chart-version annotation re-rolled DRA pods correctly, but a subsequent driver-module reload on the second GB200 node left the kubelet-plugin pod with stale NVML, producinginvalid CDI Spec: empty device editson the nextaicr validate --phase performancerun. Adding the wait closes that gap.GPU CI follow-up: H100 nvkind run is required per #894 DoD; will trigger via CI label after maintainer review.
Risk Assessment
nvidia.com/gpu.present=truenodes. Drivers move 580.105.08 → 580.126.20 (node-side change; cordon/reboot on upgrade).ccManageranddriver.rdma.enabledexplicitly disabled to preserve behavior or to mitigate v26.3.1 regressions. AKS preserves the priorrdma.enabled: truebehavior via explicit per-cloud overrides plus theuseHostMofed: truecorrectness fix. DRA annotation closes the cross-deployer GPU-allocation regression introduced by the driver bump; the newkubectl waitcloses the residual same-deployer timing race.Rollout notes:
k8s-driver-managerhandles driver replacement withmaxParallelUpgrades = 5). No CRD migrations are introduced.rdma.enabledflip fromtrue → falseis a behavior change. They can opt back in with--set gpuoperator:driver.rdma.enabled=true(and adduseHostMofed=trueif MOFED is on the host). Documented inrecipes/components/gpu-operator/values.yaml.kubectl waitadds at most 15 min todeploy.shon clusters whose nodes never reachupgrade-done(best-effort; logs WARNING then proceeds). On a healthy cluster the wait is sub-second once migration completes.Known follow-ups
tools/bomand recipe resolver both readhelm.defaultVersiontoday, so chart-pin drift between registry and overlays can happen silently. This PR fixes the immediategpu-operatorinstance; chore(recipes): make resolved recipes the single source of truth for chart versions #966 fixes the resolver and BOM layers so the class can't recur.kubectl waitin this PR cover the default Helm-deployer flow. chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973 tracks the durable fix that works for helmfile / Flux / Argo CD too — a Helmpost-install,post-upgradehook ongpu-operator-postthat runs the same wait-and-restart logic from inside the cluster, so non-Helm-deployer pipelines get the same protection.Checklist
make qualify)make lint— included in qualify)docs/user/container-images.mdregenerated; rationale comments invalues.yaml,values-aks*.yaml,nvidia-dra-driver-gpu/values.yaml, anddeploy.sh.tmpl)git commit -S)