Skip to content

chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20#965

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:chore/gpu-operator-v26.3.1
May 21, 2026
Merged

chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20#965
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:chore/gpu-operator-v26.3.1

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented May 19, 2026

Summary

Lift every leaf recipe to the current upstream stable gpu-operator chart (v26.3.1, published 2026-04-18), align the driver pin with NVIDIA's qualified version (580.126.20), and mitigate three v26.3.1-introduced regressions: stricter driver-validation init container (peermem gate), stale DRA NVML handles after a kernel-module reload, and the timing race between gpu-operator's async per-node driver migration and the DRA kubelet plugin's post-install restart.

Motivation / Context

The registry already defaulted to v26.3.1, but base.yaml shadowed it with v25.10.1 and the AKS/OKE overlays held v26.3.0. As a result, leaves actually shipped v25.10.1 / v26.3.0 images even though docs/user/container-images.md advertised v26.3.1 — BOM-vs-leaf drift introduced by chart-pin duplication between the registry and base.yaml. This PR resolves that drift by single-sourcing the version pin in base.yaml at the upstream-stable level. The broader resolver/BOM cleanup is tracked separately in #966.

Coupled value updates so the bump matches NVIDIA's qualified stack and avoids v26.3.1 regressions:

  • driver.version advances from 580.105.08 to 580.126.20 — the v26.3.1 chart default and the GB200+EFA floor. The per-overlay GB200 driver pin (previously 580.126.20) becomes redundant and is dropped; kernelModuleConfig stays.
  • ccManager.enabled: false pinned in global values. Upstream v26.3.x flipped defaults from {enabled: false, defaultMode: "off"} to {enabled: true, defaultMode: "on"}. AICR keeps it off until we have explicit CC-capable hardware support.
  • driver.rdma.enabled global default flipped true → false. v26.3.1's driver-validation init container now hard-gates on lsmod | grep nvidia_peermem. That module only loads against Mellanox MOFED symbols. On AWS EFA (EKS p4d/p5/p5e) and Linode there is no MOFED, so peermem fails to load (modprobe ... Invalid argument) and the rest of the GPU stack stays stuck in Init forever. NCCL on EFA uses aws-ofi-nccl/libfabric, which does not consume nvidia_peermem, so there is no functional path lost.
  • driver.rdma.useHostMofed: true explicitly set on AKS (values-aks.yaml, values-aks-training.yaml) alongside enabled: true. AKS overlays deploy network-operator, which installs MOFED kernel modules on the host via a privileged DaemonSet. With useHostMofed: true, peermem binds against host symbols.
  • DRA pod-template annotation added to recipes/components/nvidia-dra-driver-gpu/values.yaml on both controller.podAnnotations and kubeletPlugin.podAnnotationsaicr.nvidia.com/gpu-operator-chart-version: v26.3.1. Bumping the annotation value on every gpu-operator chart change forces a DaemonSet re-roll across every deployer (Helm, helmfile, Flux, Argo CD), refreshing NVML against the now-running driver. Manual coupling of the annotation value to the chart version is a known maintenance gap — tracked in chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973 with a bundler-derived annotation as the durable fix.
  • Wait for per-node driver migration before the DRA post-install rollout-restart (pkg/bundler/deployer/helm/templates/deploy.sh.tmpl). gpu-operator's k8s-driver-manager runs its per-node module reload asynchronously after helm upgrade gpu-operator returns. On a multi-GPU-node cluster, the previously-existing DRA rollout-restart in deploy.sh (and the annotation re-roll above) could fire before some nodes had finished migration — leaving the freshly-rolled DRA kubelet plugin pinned to a now-stale NVML handle. Reproduced live on a GB200 EKS cluster during this PR's validation: the chart-version annotation re-rolled the DRA pods, then the second GB200 node's driver migration ran after the re-roll, leaving its kubelet-plugin pod with stale NVML and triggering the same "invalid CDI Spec: empty device edits" failure mode the annotation was meant to prevent. The fix is ~10 lines in the deploy.sh template: kubectl wait --for=jsonpath until every nvidia.com/gpu.present=true node carries nvidia.com/gpu-driver-upgrade-state=upgrade-done (15-min timeout, best-effort fall-through), then run the existing rollout-restart. Helm-deployer only; chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973 will deliver a Helm post-install hook on gpu-operator-post to extend the protection to GitOps deployers.

Effective resolved values per leaf, post-PR (verified via aicr bundle + values inspection):

Service driver.enabled driver.rdma.enabled useHostMofed
EKS h100 / gb200 (training, inference, dynamo) true false n/a
AKS h100 (training, inference-dynamo) true true (explicit override) true (explicit override; binds peermem to host MOFED from network-operator)
GKE COS h100 (training, inference) false (host driver) n/a n/a
OKE gb200 (training, inference) false (host driver) n/a n/a
Kind false (host driver) false n/a
LKE rtx-pro-6000 (training, inference) true false n/a

Chart version on every supported leaf resolves to v26.3.1 (verified via aicr query ... --selector components.gpu-operator.version).

Fixes: #894
Related: #698, #966, #973

Type of Change

  • Bug fix (non-breaking change that fixes an issue — v26.3.1 validator trap on EKS / LKE + stale DRA NVML after driver migration)
  • Build/CI/tooling (component version maintenance)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler/... — deploy.sh template)
  • Docs/examples (docs/, examples/)

Implementation Notes

  • Constraint floor Deployment.gpu-operator.version >= v24.6.0 (8 overlays) is still satisfied.
  • kubeVersion in Chart.yaml is unchanged (>= 1.16.0-0); AICR's >= 1.32.4 floor in base.yaml remains binding.
  • Bundle render verified for h100-eks-training, gb200-eks-training, and AKS leaves: rendered gpu-operator/values.yaml shows version: 580.126.20, ccManager.enabled: false, the correct per-cloud rdma.enabled + useHostMofed, and (for gb200) kernelModuleConfig.name: nvidia-kernel-module-params.
  • DRA values render with both controller.podAnnotations.aicr.nvidia.com/gpu-operator-chart-version and kubeletPlugin.podAnnotations.aicr.nvidia.com/gpu-operator-chart-version set to v26.3.1. Confirmed on aicr2 and the yljtrxpmzu GB200 cluster that helm upgrade on the new bundle force-rolls both DRA pods.
  • The new kubectl wait block in deploy.sh.tmpl uses a 15-min timeout and WARNING; proceeding anyway semantics rather than a hard fail, so a misconfigured cluster (e.g. nodes never reach upgrade-done) still gets the rollout-restart attempt. The guard around GPU_NODE_COUNT > 0 avoids kubectl wait's "no matching resources" error on clusters with no GPU nodes yet (initial install ordering).
  • BOM diff is narrow: only nvcr.io/nvidia/driver:580.105.08:580.126.20 moves, because the registry-driven BOM was already rendering against v26.3.1 (the BOM-vs-leaf drift this PR also resolves at the leaf layer; the BOM tool itself is rewritten in chore(recipes): make resolved recipes the single source of truth for chart versions #966).
  • tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling.yaml updated to match the new EKS expectation (rdma.enabled: false), with a comment explaining the EFA / aws-ofi-nccl rationale.
  • Off-overlay implication: anyone running AICR on bare-metal MOFED hosts without using the AKS overlay (which ships network-operator) would need to opt in via --set gpuoperator:driver.rdma.enabled=true and --set gpuoperator:driver.rdma.useHostMofed=true. Documented inline in recipes/components/gpu-operator/values.yaml.

Testing

make qualify   # PASS — tests + lint + e2e + scan + repo-specific checks
make bom-docs  # regenerated docs/user/container-images.md

Live cluster validation:

aicr2 (EKS H100 p5.48xlarge, ad-hoc dev cluster):

  • Pre-upgrade driver: 580.105.08. Post-upgrade: 580.126.20. All GPU pods Running, nvidia-cuda-validator Completed.
  • Full aicr validate: deployment passed (4/4, 28.8s), performance passed (inference-perf: 39,776.81 tok/s, TTFT p99 122.74 ms, 3m 15.7s), conformance passed (1m 49.5s).

yljtrxpmzu (EKS GB200 p6e-gb200.36xlarge × 2, DGX Cloud dev cluster):

  • Validated the upgrade-time timing race that motivated the new kubectl wait block: the chart-version annotation re-rolled DRA pods correctly, but a subsequent driver-module reload on the second GB200 node left the kubelet-plugin pod with stale NVML, producing invalid CDI Spec: empty device edits on the next aicr validate --phase performance run. Adding the wait closes that gap.
  • After the new wait block: deployment passed (4/4, 26.5s), performance passed (16,705.82 tok/s, TTFT p99 233.31 ms, 2m 21.8s — against the placeholder thresholds added in feat(recipes): add inference-perf to gb200-eks-ubuntu-inference-dynamo #977).

GPU CI follow-up: H100 nvkind run is required per #894 DoD; will trigger via CI label after maintainer review.

Risk Assessment

  • Medium — Single coherent chart bump touching every leaf, plus a localized deploy.sh template change that only fires for clusters that have any nvidia.com/gpu.present=true nodes. Drivers move 580.105.08 → 580.126.20 (node-side change; cordon/reboot on upgrade). ccManager and driver.rdma.enabled explicitly disabled to preserve behavior or to mitigate v26.3.1 regressions. AKS preserves the prior rdma.enabled: true behavior via explicit per-cloud overrides plus the useHostMofed: true correctness fix. DRA annotation closes the cross-deployer GPU-allocation regression introduced by the driver bump; the new kubectl wait closes the residual same-deployer timing race.

Rollout notes:

  • Existing deployments using the prior chart should follow the standard gpu-operator upgrade path (the chart's k8s-driver-manager handles driver replacement with maxParallelUpgrades = 5). No CRD migrations are introduced.
  • For any user running AICR on bare-metal MOFED hosts off-overlay, the global rdma.enabled flip from true → false is a behavior change. They can opt back in with --set gpuoperator:driver.rdma.enabled=true (and add useHostMofed=true if MOFED is on the host). Documented in recipes/components/gpu-operator/values.yaml.
  • The new kubectl wait adds at most 15 min to deploy.sh on clusters whose nodes never reach upgrade-done (best-effort; logs WARNING then proceeds). On a healthy cluster the wait is sub-second once migration completes.

Known follow-ups

Checklist

  • Tests pass locally (make qualify)
  • Linter passes (make lint — included in qualify)
  • I did not skip/disable tests to make CI green
  • I updated docs if user-facing behavior changed (docs/user/container-images.md regenerated; rationale comments in values.yaml, values-aks*.yaml, nvidia-dra-driver-gpu/values.yaml, and deploy.sh.tmpl)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested review from a team as code owners May 19, 2026 14:49
@yuanchen8911 yuanchen8911 added enhancement New feature or request area/recipes labels May 19, 2026
@yuanchen8911 yuanchen8911 changed the title chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20 WIP: chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20 May 19, 2026
@yuanchen8911 yuanchen8911 marked this pull request as draft May 19, 2026 14:49
@github-actions
Copy link
Copy Markdown
Contributor

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR upgrades the GPU Operator Helm chart to v26.3.1 across overlays, pins the default driver version to 580.126.20, disables RDMA by default in component values and adds ccManager.enabled: false, adjusts GB200 overlay driver pins (inference and training differ), introduces AKS-specific RDMA host-MOFED settings, updates a test assertion to disable RDMA for EKS training, and syncs the container-images documentation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Suggested reviewers

  • mchmarny
  • lalitadithya
  • njhensley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is comprehensive and directly related to the changeset, detailing the gpu-operator chart bump from v26.3.0/v25.10.1 to v26.3.1, driver version update to 580.126.20, and regression mitigations.
Title check ✅ Passed The title directly and accurately summarizes the main changes: bumping gpu-operator chart to v26.3.1 and driver to 580.126.20.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/nvidia-dra-driver-gpu/values.yaml`:
- Around line 59-80: Add an automated CI validation that fails the PR if the
aicr.nvidia.com/gpu-operator-chart-version annotation in the values.yaml (under
controller and kubeletPlugin) is not equal to the current gpu-operator chart
version; implement the check as a simple script or GitHub Action step that reads
the gpu-operator chart version (from the chart's Chart.yaml or a central source
of truth) and compares it to the annotation value, returning a non-zero exit
code with a clear message when mismatched so authors must update the annotation
(track this automation work and reference issue `#973` in the CI job description).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 14cad807-a6e1-4952-af30-e6ca238c6b82

📥 Commits

Reviewing files that changed from the base of the PR and between 8f2e66e and 1cefd48.

📒 Files selected for processing (11)
  • docs/user/container-images.md
  • recipes/components/gpu-operator/values-aks-training.yaml
  • recipes/components/gpu-operator/values-aks.yaml
  • recipes/components/gpu-operator/values.yaml
  • recipes/components/nvidia-dra-driver-gpu/values.yaml
  • recipes/overlays/aks.yaml
  • recipes/overlays/base.yaml
  • recipes/overlays/gb200-eks-inference.yaml
  • recipes/overlays/gb200-eks-training.yaml
  • recipes/overlays/oke.yaml
  • tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling.yaml
💤 Files with no reviewable changes (2)
  • recipes/overlays/gb200-eks-training.yaml
  • recipes/overlays/gb200-eks-inference.yaml

Comment thread recipes/components/nvidia-dra-driver-gpu/values.yaml
@yuanchen8911 yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch from 1cefd48 to 21778df Compare May 19, 2026 18:33
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@yuanchen8911 yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch from 21778df to 9dc9fb9 Compare May 19, 2026 20:32
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 19, 2026 22:08
@yuanchen8911 yuanchen8911 marked this pull request as draft May 19, 2026 22:09
@yuanchen8911 yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch from 9dc9fb9 to bbb6ced Compare May 19, 2026 22:13
@yuanchen8911 yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch from bbb6ced to bccefa7 Compare May 19, 2026 22:40
@yuanchen8911 yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch from 97c0698 to 9cae76f Compare May 20, 2026 18:35
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Deployed and validated end-to-end on three clusters:

  • EKS H100 (aicr2)
  • EKS GB200
  • GKE H100 (COS) — both inference (Dynamo) and training (Kubeflow) stacks

aicr validate --phase deployment passes on all three; --phase performance passes where applicable (inference-perf for Dynamo, NCCL all-reduce-bw for training).

Rebased onto latest main and removing WIP — ready for review.

@yuanchen8911 yuanchen8911 changed the title WIP: chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20 chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20 May 20, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 20, 2026 18:35
@yuanchen8911 yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch 2 times, most recently from 7ec5dc9 to 570bf0a Compare May 20, 2026 20:26
@mchmarny mchmarny enabled auto-merge (squash) May 21, 2026 01:42
…126.20 (NVIDIA#894)

Lift every leaf recipe to the current upstream stable gpu-operator
chart (v26.3.1, published 2026-04-18). The registry already defaulted
to v26.3.1, but base.yaml shadowed it with v25.10.1 and the aks/oke
overlays held v26.3.0 — so leaves actually shipped v25.10.1 / v26.3.0
images even though the BOM advertised v26.3.1.

- recipes/overlays/base.yaml:    v25.10.1 -> v26.3.1
- recipes/overlays/aks.yaml:     v26.3.0  -> v26.3.1
- recipes/overlays/oke.yaml:     v26.3.0  -> v26.3.1

Driver pin moves to NVIDIA's v26.3.1-qualified version
(580.126.20). This also matches the GB200+EFA floor, so the
per-overlay driver.version: 580.126.20 in gb200-eks-{training,
inference}.yaml becomes redundant and is dropped; kernelModuleConfig
stays.

Upstream v26.3.x flipped ccManager to enabled=true,defaultMode=on by
default. Pin ccManager.enabled: false in the global values.yaml so we
do not silently turn on Confidential Compute Manager on clusters
without CC-capable hardware. Revisit when AICR adds explicit CC
support.

Flip driver.rdma.enabled global default true -> false. v26.3.1's
driver-validation init container now hard-gates on lsmod | grep
nvidia_peermem; the module only loads against Mellanox MOFED
symbols, so AWS EFA (EKS p4d/p5/p5e) and Linode hosts trap the
validator forever even though NCCL on EFA uses aws-ofi-nccl /
libfabric and does not need nvidia_peermem. Re-enable rdma
explicitly in components/gpu-operator/values-aks.yaml and
values-aks-training.yaml — AKS overlays deploy network-operator
which installs MOFED on ND-series InfiniBand nodes, so peermem
has Mellanox symbols to bind against. Set driver.rdma.useHostMofed:
true alongside, so the driver container binds nvidia_peermem
against the network-operator-installed host MOFED instead of
building its own bundled MOFED inside the container. GKE-cos /
OKE / kind keep driver.enabled: false (host-managed driver) so
the flag is moot; EKS / LKE inherit the new safe default.
cuj1-training chainsaw assertion updated to match.

Add gpu-operator-chart-version pod annotation to the DRA driver
in recipes/components/nvidia-dra-driver-gpu/values.yaml
(kubeletPlugin and controller). gpu-operator's k8s-driver-manager
reloads the host NVIDIA kernel modules during a driver bump but
does NOT restart the sibling nvidia-dra-driver-gpu DaemonSet
because its chart template is unchanged. The DRA kubelet plugin
loads libnvidia-ml.so at pod start and pins to the running driver
version, so a kernel-module reload leaves the pod with a stale
NVML handle; CDI spec generation then fails with Driver/library
version mismatch and DRA-allocated workloads stay in
ContainerCreating. Bumping the annotation value on every
gpu-operator chart bump forces a DaemonSet re-roll across every
deployer (Helm, helmfile, Flux, Argo CD), refreshing NVML against
the now-running driver. The deploy.sh Helm-deployer template
already restarts the kubelet plugin post-install; the annotation
closes the gap for GitOps-deployer artifacts. Manual coupling of
the annotation value to the chart version is a known maintenance
gap tracked in issue NVIDIA#973 (bundler-derived annotation as the
durable fix). Verified live on aicr2: applying the new bundle
rolled both DRA pods cleanly, and aicr validate --phase
performance passed end-to-end (inference-perf 39.7k tok/s,
TTFT p99 122ms).

Gate the existing post-install DRA kubelet-plugin restart on
gpu-operator's per-node driver migration completing. The
annotation-based re-roll above and the existing kubectl rollout
restart in deploy.sh both fire at helm-upgrade time, but
k8s-driver-manager runs the per-node module reload asynchronously
after `helm upgrade gpu-operator` returns. On a multi-GPU-node
cluster, the DRA plugin can re-roll on a node whose driver
migration has not yet started, get its NVML handle stuck to the
pre-migration state, and produce "invalid CDI Spec: empty device
edits" once the modules reload underneath it. Reproduced live on
a GB200 EKS cluster (yljtrxpmzu) during PR validation: the
chart-version annotation re-rolled the DRA pods correctly, but
the second GB200 node's driver migration ran after that, leaving
its kubelet-plugin pod with a stale NVML view. Adding a kubectl
wait for nvidia.com/gpu-driver-upgrade-state=upgrade-done on every
GPU node before the existing post-install rollout-restart closes
this timing race for the default Helm-deployer flow. See NVIDIA#973 for
broader cross-deployer coverage (a Helm post-install hook on
gpu-operator-post would cover helmfile/Flux/Argo CD too).

BOM (docs/user/container-images.md) regenerated; only the driver
image moves, since the registry-driven BOM was already on v26.3.1.
The BOM-vs-leaf drift class itself is tracked separately in NVIDIA#966.
@yuanchen8911 yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch from 8e73cc2 to 7c920d0 Compare May 21, 2026 13:57
@mchmarny mchmarny merged commit 496e2db into NVIDIA:main May 21, 2026
142 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 26, 2026
…ote chart name (NVIDIA#1034)

Two argocd-helm OCI publishing bugs surfaced by Codex review against
the post-NVIDIA#1032 contract.

P1 — Path-based child Applications resolve to a non-existent artifact

NVIDIA#1032 changed the --set repoURL contract so callers pass only the
parent namespace (e.g., oci://reg/org); the parent Application
appends .Chart.Name via its separate source.chart field. The
parent-App template at parentAppTemplate / line 407 implements this
correctly, but the path-based child-App template in
injectValuesIntoSingleSource at line 703 was left emitting only
.Values.repoURL. Argo CD's generic OCI source (used by path-based
children since they have no source.chart) treats spec.source.repoURL
as the full artifact reference and resolves it directly, so under
the new contract a child source pointed at oci://reg/org:tag — an
artifact that doesn't exist — and the child Application failed to
sync.

Fix: append /{{ .Chart.Name }} to the rendered repoURL value so the
assembled URL matches the artifact the parent App's repoURL/chart:tag
triple resolves to. The error-message text in the required directive
is updated to say "this template appends .Chart.Name" (the path-based
template is doing the appending, not the parent App).

P2 — Unquoted name and version in generated Chart.yaml

writeChartYAML emitted "name: %s" / "version: %s" with raw
fmt.Fprintf. (*Reference).ChartName() returns path.Base(Repository),
so a valid OCI artifact path whose last segment is a YAML reserved
scalar — "null", "true", "false", "yes", "no", numeric strings —
produced an unquoted YAML reserved word as the chart's name. Helm's
loader then parsed name: null as YAML null, chart.Metadata.Name
became empty, and helm package / helm push rejected the chart with
"chart.metadata.name is required". Same trap for the version field
("1.0" parses as float).

Fix: emit both via %q so the values round-trip as YAML strings. OCI
artifact path segments are constrained to printable ASCII by the
docker reference grammar, which is the safe charset for Go's %q.

Tests

- TestInjectValuesIntoSingleSource_AppendsChartName — asserts the
  rendered path-based child source.repoURL appends /{{ .Chart.Name }}
  after the user-supplied .Values.repoURL.
- TestWriteChartYAML_QuotesYAMLReservedScalarsAsName — table-driven
  case across null / true / false / numeric / yes / normal name; for
  each, writes Chart.yaml and verifies yaml.Unmarshal round-trips
  name back as the expected string.
- TestHelmTemplate_RendersWithSetRepoURL — updated to exercise the
  new contract: --set repoURL=oci://reg/org (parent namespace only).
  Asserts the parent App's repoURL equals the parent namespace and
  the child App's repoURL equals parent-namespace + /aicr-bundle.
- TestGenerate_CustomChartName and the existing Chart.yaml-version
  assertion updated to expect quoted scalars.
- Golden templates and Chart.yaml fixtures regenerated.

Closes NVIDIA#1034
Related: NVIDIA#1032 (the contract change this PR completes), NVIDIA#1019, NVIDIA#965
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 26, 2026
…oss all deployers (NVIDIA#980)

Add a Helm post-install,post-upgrade Job to the bundler-synthesized
gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager
to finish migrating drivers on every managed GPU node, then restarts
the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle
rebinds to the post-migration driver state.

This extends the kubectl wait + rollout-restart block that
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl already runs for the
default Helm deployer (./deploy.sh) to every other deployer — helmfile,
Flux, Argo CD, argocd-helm — by encoding the same logic as a Helm hook
inside the rendered chart. Helm/helmfile/Flux/Argo CD all execute
Helm-defined hooks as part of the release lifecycle, so users who
consume the bundle via GitOps reconcilers gain the same protection
without needing to run the post-deploy script.

The hook mirrors deploy.sh's logic:
  - Detects host-managed-driver overlays (GKE COS, OKE, Kind, Talos)
    by absence of nvidia-driver-daemonset and skips the migration wait.
  - Waits up to 15m for every managed GPU node to reach the
    nvidia.com/gpu-driver-upgrade-state=upgrade-done label.
  - Issues kubectl rollout restart on the DRA kubelet plugin DaemonSet
    and waits up to 5m for the new generation to be ready.
  - Exits 0 if the DRA DaemonSet is not yet present (fresh install
    before nvidia-dra-driver-gpu's chart has applied), since the
    rollout-restart is only meaningful in the upgrade path.

RBAC is shipped alongside the Job:
  - ServiceAccount in the gpu-operator namespace.
  - ClusterRole for cluster-wide DaemonSet list/get and Node
    list/watch (needed to discover nvidia-driver-daemonset
    cross-namespace and wait on per-node upgrade-state labels).
  - Role+RoleBinding in the nvidia-dra-driver namespace scoped to
    patch on DaemonSets and list/watch on Pods (rollout-status).

The existing deploy.sh.tmpl block is intentionally kept as
defense-in-depth for Helm-deployer users; the extra rollout-restart on
an already-rolled DaemonSet is a no-op. A follow-up will evaluate
removing the script-side block once the hook has shipped in a release
and proven reliable.

Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml
so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via
the additive manifestFiles merge.

WIP / draft: parity test (assert the hook Job appears in rendered output
for at least two deployers), full make qualify, and cluster verification
on a fresh gpu-operator chart bump under a GitOps deployer all still TBD
before flipping to ready-for-review.

Refs NVIDIA#980
Related: NVIDIA#965 (deploy.sh block this hook generalizes), NVIDIA#973 (orthogonal
durability for the chart-version annotation), NVIDIA#894 (chart bump that
first surfaced the stale-NVML class of bug).
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…pe (NVIDIA#973)

Replace the hand-maintained
aicr.nvidia.com/gpu-operator-chart-version annotation in
recipes/components/nvidia-dra-driver-gpu/values.yaml with a
bundler-side injection that reads the value from the resolved
gpu-operator componentRef and writes it onto both controller and
kubeletPlugin podAnnotations of the rendered DRA values. The
annotation cannot drift from the actual chart pin any more — a
gpu-operator chart bump produces a rendered pod-template diff
automatically, helm upgrade (and every other deployer) re-rolls the
DaemonSet, and the kubelet plugin's NVML handle rebinds to the
post-migration driver state.

Why this matters

PR NVIDIA#965 mitigated the stale-NVML class of bug (gpu-operator chart
bump → k8s-driver-manager reloads kernel modules async → DRA kubelet
plugin's NVML handle goes stale → CDI spec generation fails with
"Driver/library version mismatch" → DRA-allocated pods stuck in
ContainerCreating) by hard-coding the gpu-operator chart version into
the DRA pod-template annotation. The annotation worked, but only as
long as a maintainer remembered to bump both pins in lockstep. A
future PR that bumped gpu-operator and forgot the annotation would
produce identical rendered DaemonSet manifests, helm upgrade would
skip the re-roll, and stale NVML would return silently. make qualify
did not catch the drift because no static check verified the two
pins.

Implementation

injectDRAChartVersionAnnotation iterates the filtered
recipeResult.ComponentRefs (Make has already filtered out disabled
components by this point), extracts the gpu-operator chart version,
and gates on both gpu-operator and nvidia-dra-driver-gpu being
enabled. When both are present, it writes
aicr.nvidia.com/gpu-operator-chart-version onto controller and
kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map,
creating nested maps lazily so existing values
(priorityClassName, other annotations) are preserved.

Injection happens in DefaultBundler.Make AFTER
extractComponentValues — so the values map is populated and user
--set overrides have already landed — and BEFORE buildDeployer — so
every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives
the same final componentValues map. Placing the call after --set
means a user override of this specific annotation key is
intentionally NOT honored; the annotation must always reflect the
actual resolved gpu-operator chart version, or the rollout-trigger
semantics break.

The injection mirrors the GKE critical-priority quota helper
(pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so callers
familiar with that pattern read this one for free.

Tests

- Unit tests (bundler_dra_annotation_test.go): positive case both
  pod paths, negative case DRA disabled, negative case gpu-operator
  disabled, preserves-existing-values, overrides-user-set,
  version-variants (v-prefixed / unprefixed / pre-release / build
  metadata), nil-tolerant inputs, lazy-DRA-values-map creation.
  Helper coverage: 100%.

- Generated-artifact parity test
  (bundler_dra_annotation_parity_test.go): bundles the same recipe
  through the default Helm deployer AND the helmfile deployer;
  asserts the annotation lands in the rendered DRA values.yaml for
  BOTH outputs. Catches the cross-deployer parity risk if a future
  refactor moves the injection into a deployer-specific render path.
  Plus a disabled-recipe negative case that asserts the gpu-operator
  values do not leak any DRA-related annotation when DRA is filtered
  out.

Closes NVIDIA#973
Related: NVIDIA#965 (the chart-version annotation this durabilizes),
NVIDIA#980 (orthogonal cross-deployer corrective restart), NVIDIA#894
(chart bump that first surfaced the stale-NVML class of bug)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…oss all deployers (NVIDIA#980)

Add a Helm post-install,post-upgrade Job to the bundler-synthesized
gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager
to finish migrating drivers on every managed GPU node, then restarts
the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle
rebinds to the post-migration driver state.

This extends the kubectl wait + rollout-restart block that
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl already runs for the
default Helm deployer (./deploy.sh) to every other deployer — helmfile,
Flux, Argo CD, argocd-helm — by encoding the same logic as a Helm hook
inside the rendered chart. Helm/helmfile/Flux/Argo CD all execute
Helm-defined hooks as part of the release lifecycle, so users who
consume the bundle via GitOps reconcilers gain the same protection
without needing to run the post-deploy script.

The hook mirrors deploy.sh's logic:
  - Detects host-managed-driver overlays (GKE COS, OKE, Kind, Talos)
    by absence of nvidia-driver-daemonset and skips the migration wait.
  - Waits up to 15m for every managed GPU node to reach the
    nvidia.com/gpu-driver-upgrade-state=upgrade-done label.
  - Issues kubectl rollout restart on the DRA kubelet plugin DaemonSet
    and waits up to 5m for the new generation to be ready.
  - Exits 0 if the DRA DaemonSet is not yet present (fresh install
    before nvidia-dra-driver-gpu's chart has applied), since the
    rollout-restart is only meaningful in the upgrade path.

RBAC is shipped alongside the Job:
  - ServiceAccount in the gpu-operator namespace.
  - ClusterRole for cluster-wide DaemonSet list/get and Node
    list/watch (needed to discover nvidia-driver-daemonset
    cross-namespace and wait on per-node upgrade-state labels).
  - Role+RoleBinding in the nvidia-dra-driver namespace scoped to
    patch on DaemonSets and list/watch on Pods (rollout-status).

The existing deploy.sh.tmpl block is intentionally kept as
defense-in-depth for Helm-deployer users; the extra rollout-restart on
an already-rolled DaemonSet is a no-op. A follow-up will evaluate
removing the script-side block once the hook has shipped in a release
and proven reliable.

Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml
so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via
the additive manifestFiles merge.

WIP / draft: parity test (assert the hook Job appears in rendered output
for at least two deployers), full make qualify, and cluster verification
on a fresh gpu-operator chart bump under a GitOps deployer all still TBD
before flipping to ready-for-review.

Refs NVIDIA#980
Related: NVIDIA#965 (deploy.sh block this hook generalizes), NVIDIA#973 (orthogonal
durability for the chart-version annotation), NVIDIA#894 (chart bump that
first surfaced the stale-NVML class of bug).
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…pe (NVIDIA#973)

Replace the hand-maintained
aicr.nvidia.com/gpu-operator-chart-version annotation in
recipes/components/nvidia-dra-driver-gpu/values.yaml with a
bundler-side injection that reads the value from the resolved
gpu-operator componentRef and writes it onto both controller and
kubeletPlugin podAnnotations of the rendered DRA values. The
annotation cannot drift from the actual chart pin any more — a
gpu-operator chart bump produces a rendered pod-template diff
automatically, helm upgrade (and every other deployer) re-rolls the
DaemonSet, and the kubelet plugin's NVML handle rebinds to the
post-migration driver state.

Why this matters

PR NVIDIA#965 mitigated the stale-NVML class of bug (gpu-operator chart
bump → k8s-driver-manager reloads kernel modules async → DRA kubelet
plugin's NVML handle goes stale → CDI spec generation fails with
"Driver/library version mismatch" → DRA-allocated pods stuck in
ContainerCreating) by hard-coding the gpu-operator chart version into
the DRA pod-template annotation. The annotation worked, but only as
long as a maintainer remembered to bump both pins in lockstep. A
future PR that bumped gpu-operator and forgot the annotation would
produce identical rendered DaemonSet manifests, helm upgrade would
skip the re-roll, and stale NVML would return silently. make qualify
did not catch the drift because no static check verified the two
pins.

Implementation

injectDRAChartVersionAnnotation iterates the filtered
recipeResult.ComponentRefs (Make has already filtered out disabled
components by this point), extracts the gpu-operator chart version,
and gates on both gpu-operator and nvidia-dra-driver-gpu being
enabled. When both are present, it writes
aicr.nvidia.com/gpu-operator-chart-version onto controller and
kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map,
creating nested maps lazily so existing values
(priorityClassName, other annotations) are preserved.

Injection happens in DefaultBundler.Make AFTER
extractComponentValues — so the values map is populated and user
--set overrides have already landed — and BEFORE buildDeployer — so
every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives
the same final componentValues map. Placing the call after --set
means a user override of this specific annotation key is
intentionally NOT honored; the annotation must always reflect the
actual resolved gpu-operator chart version, or the rollout-trigger
semantics break.

The injection mirrors the GKE critical-priority quota helper
(pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so callers
familiar with that pattern read this one for free.

Tests

- Unit tests (bundler_dra_annotation_test.go): positive case both
  pod paths, negative case DRA disabled, negative case gpu-operator
  disabled, preserves-existing-values, overrides-user-set,
  version-variants (v-prefixed / unprefixed / pre-release / build
  metadata), nil-tolerant inputs, lazy-DRA-values-map creation.
  Helper coverage: 100%.

- Generated-artifact parity test
  (bundler_dra_annotation_parity_test.go): bundles the same recipe
  through the default Helm deployer AND the helmfile deployer;
  asserts the annotation lands in the rendered DRA values.yaml for
  BOTH outputs. Catches the cross-deployer parity risk if a future
  refactor moves the injection into a deployer-specific render path.
  Plus a disabled-recipe negative case that asserts the gpu-operator
  values do not leak any DRA-related annotation when DRA is filtered
  out.

Closes NVIDIA#973
Related: NVIDIA#965 (the chart-version annotation this durabilizes),
NVIDIA#980 (orthogonal cross-deployer corrective restart), NVIDIA#894
(chart bump that first surfaced the stale-NVML class of bug)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…pe (NVIDIA#973)

Replace the hand-maintained
aicr.nvidia.com/gpu-operator-chart-version annotation in
recipes/components/nvidia-dra-driver-gpu/values.yaml with a
bundler-side injection that reads the value from the resolved
gpu-operator componentRef and writes it onto both controller and
kubeletPlugin podAnnotations of the rendered DRA values. The
annotation cannot drift from the actual chart pin any more — a
gpu-operator chart bump produces a rendered pod-template diff
automatically, helm upgrade (and every other deployer) re-rolls the
DaemonSet, and the kubelet plugin's NVML handle rebinds to the
post-migration driver state.

Why this matters

PR NVIDIA#965 mitigated the stale-NVML class of bug (gpu-operator chart
bump → k8s-driver-manager reloads kernel modules async → DRA kubelet
plugin's NVML handle goes stale → CDI spec generation fails with
"Driver/library version mismatch" → DRA-allocated pods stuck in
ContainerCreating) by hard-coding the gpu-operator chart version into
the DRA pod-template annotation. The annotation worked, but only as
long as a maintainer remembered to bump both pins in lockstep. A
future PR that bumped gpu-operator and forgot the annotation would
produce identical rendered DaemonSet manifests, helm upgrade would
skip the re-roll, and stale NVML would return silently. make qualify
did not catch the drift because no static check verified the two
pins.

Implementation

injectDRAChartVersionAnnotation iterates the filtered
recipeResult.ComponentRefs (Make has already filtered out disabled
components by this point), extracts the gpu-operator chart version,
and gates on both gpu-operator and nvidia-dra-driver-gpu being
enabled. When both are present, it writes
aicr.nvidia.com/gpu-operator-chart-version onto controller and
kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map,
creating nested maps lazily so existing values
(priorityClassName, other annotations) are preserved.

Injection happens in DefaultBundler.Make AFTER
extractComponentValues — so the values map is populated and user
--set overrides have already landed — and BEFORE buildDeployer — so
every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives
the same final componentValues map. Placing the call after --set
means a user override of this specific annotation key is
intentionally NOT honored; the annotation must always reflect the
actual resolved gpu-operator chart version, or the rollout-trigger
semantics break.

The injection mirrors the GKE critical-priority quota helper
(pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so callers
familiar with that pattern read this one for free.

Tests

- Unit tests (bundler_dra_annotation_test.go): positive case both
  pod paths, negative case DRA disabled, negative case gpu-operator
  disabled, preserves-existing-values, overrides-user-set,
  version-variants (v-prefixed / unprefixed / pre-release / build
  metadata), nil-tolerant inputs, lazy-DRA-values-map creation.
  Helper coverage: 100%.

- Generated-artifact parity test
  (bundler_dra_annotation_parity_test.go): bundles the same recipe
  through the default Helm deployer AND the helmfile deployer;
  asserts the annotation lands in the rendered DRA values.yaml for
  BOTH outputs. Catches the cross-deployer parity risk if a future
  refactor moves the injection into a deployer-specific render path.
  Plus a disabled-recipe negative case that asserts the gpu-operator
  values do not leak any DRA-related annotation when DRA is filtered
  out.

Closes NVIDIA#973
Related: NVIDIA#965 (the chart-version annotation this durabilizes),
NVIDIA#980 (orthogonal cross-deployer corrective restart), NVIDIA#894
(chart bump that first surfaced the stale-NVML class of bug)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…pe (NVIDIA#973)

Replace the hand-maintained
aicr.nvidia.com/gpu-operator-chart-version annotation in
recipes/components/nvidia-dra-driver-gpu/values.yaml with a
bundler-side injection that reads the value from the resolved
gpu-operator componentRef and writes it onto both controller and
kubeletPlugin podAnnotations of the rendered DRA values. The
annotation cannot drift from the actual chart pin any more — a
gpu-operator chart bump produces a rendered pod-template diff
automatically, helm upgrade (and every other deployer) re-rolls the
DaemonSet, and the kubelet plugin's NVML handle rebinds to the
post-migration driver state.

Why this matters

PR NVIDIA#965 mitigated the stale-NVML class of bug (gpu-operator chart
bump → k8s-driver-manager reloads kernel modules async → DRA kubelet
plugin's NVML handle goes stale → CDI spec generation fails with
"Driver/library version mismatch" → DRA-allocated pods stuck in
ContainerCreating) by hard-coding the gpu-operator chart version into
the DRA pod-template annotation. The annotation worked, but only as
long as a maintainer remembered to bump both pins in lockstep. A
future PR that bumped gpu-operator and forgot the annotation would
produce identical rendered DaemonSet manifests, helm upgrade would
skip the re-roll, and stale NVML would return silently. make qualify
did not catch the drift because no static check verified the two
pins.

Implementation

injectDRAChartVersionAnnotation iterates the filtered
recipeResult.ComponentRefs (Make has already filtered out disabled
components by this point), extracts the gpu-operator chart version,
and gates on both gpu-operator and nvidia-dra-driver-gpu being
enabled. When both are present, it writes
aicr.nvidia.com/gpu-operator-chart-version onto controller and
kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map,
creating nested maps lazily so existing values
(priorityClassName, other annotations) are preserved.

Injection happens in DefaultBundler.Make AFTER
extractComponentValues — so the values map is populated and user
--set overrides have already landed — and BEFORE buildDeployer — so
every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives
the same final componentValues map. Placing the call after --set
means a user override of this specific annotation key is
intentionally NOT honored; the annotation must always reflect the
actual resolved gpu-operator chart version, or the rollout-trigger
semantics break.

The injection mirrors the GKE critical-priority quota helper
(pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so callers
familiar with that pattern read this one for free.

Tests

- Unit tests (bundler_dra_annotation_test.go): positive case both
  pod paths, negative case DRA disabled, negative case gpu-operator
  disabled, preserves-existing-values, overrides-user-set,
  version-variants (v-prefixed / unprefixed / pre-release / build
  metadata), nil-tolerant inputs, lazy-DRA-values-map creation.
  Helper coverage: 100%.

- Generated-artifact parity test
  (bundler_dra_annotation_parity_test.go): bundles the same recipe
  through the default Helm deployer AND the helmfile deployer;
  asserts the annotation lands in the rendered DRA values.yaml for
  BOTH outputs. Catches the cross-deployer parity risk if a future
  refactor moves the injection into a deployer-specific render path.
  Plus a disabled-recipe negative case that asserts the gpu-operator
  values do not leak any DRA-related annotation when DRA is filtered
  out.

Closes NVIDIA#973
Related: NVIDIA#965 (the chart-version annotation this durabilizes),
NVIDIA#980 (orthogonal cross-deployer corrective restart), NVIDIA#894
(chart bump that first surfaced the stale-NVML class of bug)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…pe (NVIDIA#973)

Replace the hand-maintained
aicr.nvidia.com/gpu-operator-chart-version annotation in
recipes/components/nvidia-dra-driver-gpu/values.yaml with a
bundler-side injection that reads the value from the resolved
gpu-operator componentRef and writes it onto both controller and
kubeletPlugin podAnnotations of the rendered DRA values. The
annotation cannot drift from the actual chart pin any more — a
gpu-operator chart bump produces a rendered pod-template diff
automatically, helm upgrade (and every other deployer) re-rolls the
DaemonSet, and the kubelet plugin's NVML handle rebinds to the
post-migration driver state.

Why this matters

PR NVIDIA#965 mitigated the stale-NVML class of bug (gpu-operator chart
bump → k8s-driver-manager reloads kernel modules async → DRA kubelet
plugin's NVML handle goes stale → CDI spec generation fails with
"Driver/library version mismatch" → DRA-allocated pods stuck in
ContainerCreating) by hard-coding the gpu-operator chart version into
the DRA pod-template annotation. The annotation worked, but only as
long as a maintainer remembered to bump both pins in lockstep. A
future PR that bumped gpu-operator and forgot the annotation would
produce identical rendered DaemonSet manifests, helm upgrade would
skip the re-roll, and stale NVML would return silently. make qualify
did not catch the drift because no static check verified the two
pins.

Implementation

injectDRAChartVersionAnnotation iterates the filtered
recipeResult.ComponentRefs (Make has already filtered out disabled
components by this point), extracts the gpu-operator chart version,
and gates on both gpu-operator and nvidia-dra-driver-gpu being
enabled. When both are present, it writes
aicr.nvidia.com/gpu-operator-chart-version onto controller and
kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map,
creating nested maps lazily so existing values
(priorityClassName, other annotations) are preserved.

Injection happens in DefaultBundler.Make AFTER
extractComponentValues — so the values map is populated and user
--set overrides have already landed — and BEFORE buildDeployer — so
every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives
the same final componentValues map. Placing the call after --set
means a user override of this specific annotation key is
intentionally NOT honored; the annotation must always reflect the
actual resolved gpu-operator chart version, or the rollout-trigger
semantics break.

The injection mirrors the GKE critical-priority quota helper
(pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so callers
familiar with that pattern read this one for free.

Tests

- Unit tests (bundler_dra_annotation_test.go): positive case both
  pod paths, negative case DRA disabled, negative case gpu-operator
  disabled, preserves-existing-values, overrides-user-set,
  version-variants (v-prefixed / unprefixed / pre-release / build
  metadata), nil-tolerant inputs, lazy-DRA-values-map creation.
  Helper coverage: 100%.

- Generated-artifact parity test
  (bundler_dra_annotation_parity_test.go): bundles the same recipe
  through the default Helm deployer AND the helmfile deployer;
  asserts the annotation lands in the rendered DRA values.yaml for
  BOTH outputs. Catches the cross-deployer parity risk if a future
  refactor moves the injection into a deployer-specific render path.
  Plus a disabled-recipe negative case that asserts the gpu-operator
  values do not leak any DRA-related annotation when DRA is filtered
  out.

Closes NVIDIA#973
Related: NVIDIA#965 (the chart-version annotation this durabilizes),
NVIDIA#980 (orthogonal cross-deployer corrective restart), NVIDIA#894
(chart bump that first surfaced the stale-NVML class of bug)
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…oss all deployers (NVIDIA#980)

Add a Helm post-install,post-upgrade Job to the bundler-synthesized
gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager
to finish migrating drivers on every managed GPU node, then restarts
the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle
rebinds to the post-migration driver state.

This extends the kubectl wait + rollout-restart block that
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl already runs for the
default Helm deployer (./deploy.sh) to every other deployer — helmfile,
Flux, Argo CD, argocd-helm — by encoding the same logic as a Helm hook
inside the rendered chart. Helm/helmfile/Flux/Argo CD all execute
Helm-defined hooks as part of the release lifecycle, so users who
consume the bundle via GitOps reconcilers gain the same protection
without needing to run the post-deploy script.

The hook mirrors deploy.sh's logic:
  - Detects host-managed-driver overlays (GKE COS, OKE, Kind, Talos)
    by absence of nvidia-driver-daemonset and skips the migration wait.
  - Waits up to 15m for every managed GPU node to reach the
    nvidia.com/gpu-driver-upgrade-state=upgrade-done label.
  - Issues kubectl rollout restart on the DRA kubelet plugin DaemonSet
    and waits up to 5m for the new generation to be ready.
  - Exits 0 if the DRA DaemonSet is not yet present (fresh install
    before nvidia-dra-driver-gpu's chart has applied), since the
    rollout-restart is only meaningful in the upgrade path.

RBAC is shipped alongside the Job:
  - ServiceAccount in the gpu-operator namespace.
  - ClusterRole for cluster-wide DaemonSet list/get and Node
    list/watch (needed to discover nvidia-driver-daemonset
    cross-namespace and wait on per-node upgrade-state labels).
  - Role+RoleBinding in the nvidia-dra-driver namespace scoped to
    patch on DaemonSets and list/watch on Pods (rollout-status).

The existing deploy.sh.tmpl block is intentionally kept as
defense-in-depth for Helm-deployer users; the extra rollout-restart on
an already-rolled DaemonSet is a no-op. A follow-up will evaluate
removing the script-side block once the hook has shipped in a release
and proven reliable.

Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml
so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via
the additive manifestFiles merge.

WIP / draft: parity test (assert the hook Job appears in rendered output
for at least two deployers), full make qualify, and cluster verification
on a fresh gpu-operator chart bump under a GitOps deployer all still TBD
before flipping to ready-for-review.

Refs NVIDIA#980
Related: NVIDIA#965 (deploy.sh block this hook generalizes), NVIDIA#973 (orthogonal
durability for the chart-version annotation), NVIDIA#894 (chart bump that
first surfaced the stale-NVML class of bug).
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…oss all deployers (NVIDIA#980)

Add a Helm post-install,post-upgrade Job to the bundler-synthesized
gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager
to finish migrating drivers on every managed GPU node, then restarts
the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle
rebinds to the post-migration driver state.

This extends the kubectl wait + rollout-restart block that
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl already runs for the
default Helm deployer (./deploy.sh) to every other deployer — helmfile,
Flux, Argo CD, argocd-helm — by encoding the same logic as a Helm hook
inside the rendered chart. Helm/helmfile/Flux/Argo CD all execute
Helm-defined hooks as part of the release lifecycle, so users who
consume the bundle via GitOps reconcilers gain the same protection
without needing to run the post-deploy script.

The hook mirrors deploy.sh's logic:
  - Detects host-managed-driver overlays (GKE COS, OKE, Kind, Talos)
    by absence of nvidia-driver-daemonset and skips the migration wait.
  - Waits up to 15m for every managed GPU node to reach the
    nvidia.com/gpu-driver-upgrade-state=upgrade-done label.
  - Issues kubectl rollout restart on the DRA kubelet plugin DaemonSet
    and waits up to 5m for the new generation to be ready.
  - Exits 0 if the DRA DaemonSet is not yet present (fresh install
    before nvidia-dra-driver-gpu's chart has applied), since the
    rollout-restart is only meaningful in the upgrade path.

RBAC is shipped alongside the Job:
  - ServiceAccount in the gpu-operator namespace.
  - ClusterRole for cluster-wide DaemonSet list/get and Node
    list/watch (needed to discover nvidia-driver-daemonset
    cross-namespace and wait on per-node upgrade-state labels).
  - Role+RoleBinding in the nvidia-dra-driver namespace scoped to
    patch on DaemonSets and list/watch on Pods (rollout-status).

The existing deploy.sh.tmpl block is intentionally kept as
defense-in-depth for Helm-deployer users; the extra rollout-restart on
an already-rolled DaemonSet is a no-op. A follow-up will evaluate
removing the script-side block once the hook has shipped in a release
and proven reliable.

Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml
so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via
the additive manifestFiles merge.

WIP / draft: parity test (assert the hook Job appears in rendered output
for at least two deployers), full make qualify, and cluster verification
on a fresh gpu-operator chart bump under a GitOps deployer all still TBD
before flipping to ready-for-review.

Refs NVIDIA#980
Related: NVIDIA#965 (deploy.sh block this hook generalizes), NVIDIA#973 (orthogonal
durability for the chart-version annotation), NVIDIA#894 (chart bump that
first surfaced the stale-NVML class of bug).
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…ers (NVIDIA#980)

Add a Job to the bundler-synthesized gpu-operator-post chart that waits
for gpu-operator's k8s-driver-manager to finish migrating drivers on
every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet
plugin DaemonSet so its NVML handle rebinds to the post-migration
driver state. Extends the kubectl wait + rollout-restart block already
in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl to every deployer
the bundler emits (Helm, helmfile, Flux, Argo CD, argocd-helm), not
just the default Helm deployer that runs ./deploy.sh.

Re-fire mechanism — dual approach across bundler paths:

  1. Non-vendored localformat (default for Helm/helmfile/Argo CD/
     argocd-helm). local_helm.stripHelmHooks unconditionally removes
     sync-phase helm.sh/hook annotations because Argo CD's
     syncPolicy.automated would otherwise treat the resource as a
     PostSync hook that never fires under path-based sources. After
     stripping, the Job ships as a regular chart resource; re-fire on
     upgrade is achieved by keying the Job name on the parent
     gpu-operator chart version, which the bundler injects into the
     synthesized chart's values map at
     gpu-operator._aicrParentChartVersion (see
     DefaultBundler.injectDRAParentChartVersionValue). The values-map
     path is used in preference to .Chart.Version because the
     synthesized chart's Chart.yaml hardcodes 0.1.0 (which would
     defeat re-fire) and because flux-oci's source-controller
     suffixes .Chart.Version with "+<artifact-sha>" (invalid in K8s
     label values).

  2. Vendored localformat (--vendor-charts). vendor_folder.go calls
     injectPostInstallHooks on each mixed recipe-side manifest, which
     auto-tags resources with helm.sh/hook: post-install only UNLESS
     the document already declares helm.sh/hook (vendor_wrapper.go
     injectHookOnDoc:223). Without an author-declared hook, the
     auto-injected post-install-only annotation means the Job runs on
     first install and is silently skipped on every subsequent
     helm upgrade — exactly when the stale-NVML race fires. The Job
     therefore author-declares
       helm.sh/hook: post-install,post-upgrade
       helm.sh/hook-delete-policy: before-hook-creation
     so vendor_wrapper honors the value and the Job fires on both
     phases. stripHelmHooks on the non-vendored path strips both
     phases together (both are in syncPhaseHooks), so this declaration
     is inert there and the version-keyed name from path 1 remains
     the re-fire mechanism.

Shell script is fail-closed (POSIX sh, alpine/k8s busybox):

  - Exits 0 only when state is positively determined as a no-op:
    DRA DaemonSet absent (fresh install), driver DaemonSet absent
    (host-managed driver mode), or zero managed nodes labeled
    nvidia.com/gpu.deploy.driver=true.
  - Exits 1 on every other failure: any classification kubectl-get
    fails, kubectl-wait for upgrade-done times out, rollout-restart
    fails, or rollout-status times out within 5m.
  - Uses a capture-then-check pattern (KUBECTL_OUT=$(kubectl ... 2>&1)
    || { echo "..."; exit 1; }) to distinguish "API call failed" from
    "result is empty" — POSIX sh lacks `set -o pipefail`, so the
    previous `kubectl ... | grep ... | head` patterns silently masked
    kubectl exit codes and the warning-only timeouts could exit 0
    after an incomplete mitigation, contradicting the safety story.
  - backoffLimit: 2 on the Job is load-bearing: a transient apiserver
    hiccup auto-retries twice before surfacing as a failed release.

RBAC (ClusterRole + ClusterRoleBinding only):

  - daemonsets.apps — get/list/watch/patch (cluster-wide; watch
    required by kubectl rollout status).
  - nodes — get/list/watch (for the upgrade-done label wait).
  - pods  — get/list/watch (rollout-status pod readiness).

A namespaced Role was rejected because gpu-operator-post installs
before nvidia-dra-driver-gpu's chart, so a Role inside the
nvidia-dra-driver namespace would fail on a fresh cluster (namespace
doesn't exist yet).

Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest list
digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be,
matching the repo's digest-pinning standard. alpine/k8s ships kubectl
plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no
pipefail).

The existing deploy.sh.tmpl block is intentionally kept as
defense-in-depth for Helm-deployer users; the extra rollout-restart on
an already-rolled DaemonSet is a no-op. A follow-up will evaluate
removing the script-side block once this Job has shipped in a release
and proven reliable.

Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml
so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via
the additive manifestFiles merge.

Refs NVIDIA#980
Related: NVIDIA#965 (deploy.sh block this Job generalizes), NVIDIA#973 (orthogonal
durability for the chart-version annotation), NVIDIA#894 (chart bump that
first surfaced the stale-NVML class of bug).
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 27, 2026
…ers (NVIDIA#980)

Add a Job to the bundler-synthesized gpu-operator-post chart that waits
for gpu-operator's k8s-driver-manager to finish migrating drivers on
every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet
plugin DaemonSet so its NVML handle rebinds to the post-migration
driver state. Extends the kubectl wait + rollout-restart block already
in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl to every deployer
the bundler emits (Helm, helmfile, Flux, Argo CD, argocd-helm), not
just the default Helm deployer that runs ./deploy.sh.

Re-fire mechanism — dual approach across bundler paths:

  1. Non-vendored localformat (default for Helm/helmfile/Argo CD/
     argocd-helm). local_helm.stripHelmHooks unconditionally removes
     sync-phase helm.sh/hook annotations because Argo CD's
     syncPolicy.automated would otherwise treat the resource as a
     PostSync hook that never fires under path-based sources. After
     stripping, the Job ships as a regular chart resource; re-fire on
     upgrade is achieved by keying the Job name on the parent
     gpu-operator chart version, which the bundler injects into the
     synthesized chart's values map at
     gpu-operator._aicrParentChartVersion (see
     DefaultBundler.injectDRAParentChartVersionValue). The values-map
     path is used in preference to .Chart.Version because the
     synthesized chart's Chart.yaml hardcodes 0.1.0 (which would
     defeat re-fire) and because flux-oci's source-controller
     suffixes .Chart.Version with "+<artifact-sha>" (invalid in K8s
     label values).

  2. Vendored localformat (--vendor-charts). vendor_folder.go calls
     injectPostInstallHooks on each mixed recipe-side manifest, which
     auto-tags resources with helm.sh/hook: post-install only UNLESS
     the document already declares helm.sh/hook (vendor_wrapper.go
     injectHookOnDoc:223), and auto-assigns sequential hook-weights
     starting at postInstallHookWeightBase=100. Two consequences for
     this manifest:
       (a) auto-injected post-install hooks fire only on first install
           and are silently skipped on every helm upgrade — exactly
           when the stale-NVML race occurs.
       (b) if only the Job author-declares a hook, the SA / Cluster
           Role / ClusterRoleBinding still get auto-injected weights
           100/101/102 while the Job takes Helm's default weight 0,
           putting the Job BEFORE its ServiceAccount/RBAC in install
           order and failing with "ServiceAccount not found".
     Both are addressed by author-declaring helm.sh/hook:
     post-install,post-upgrade plus before-hook-creation on ALL FOUR
     resources with ascending hook-weights:
         ServiceAccount=1, ClusterRole=2, ClusterRoleBinding=3, Job=10
     vendor_wrapper honors the author-declared hooks; the explicit
     weights guarantee the Job's prerequisites exist before it runs;
     post-upgrade re-fires the Job (and refreshes the RBAC) on every
     helm upgrade. stripHelmHooks on the non-vendored path strips ALL
     of these annotations (every phase here is in syncPhaseHooks at
     hooks.go:109-120), so this declaration is inert there and Helm's
     default kind-based ordering (SA → ClusterRole → ClusterRole
     Binding → Job) plus the chart-version-keyed Job name (path 1)
     provide the same install order and re-fire mechanism without
     hooks.

helm.sh/resource-policy: keep applies to path 1 only and prevents
Helm's three-way merge from pruning prior-version Jobs mid-rollout.
It is a no-op for the vendored hook path (Helm does not apply
resource-policy to hook resources); retention there is governed by
helm.sh/hook-delete-policy=before-hook-creation, which deletes a hook
resource ONLY when a new hook with the same name is created. SA / CR /
CRB names are stable so they cycle on every install/upgrade (idempotent
delete-and-recreate); Job names are chart-version-keyed so prior-
version Jobs survive cross-version upgrades as Succeeded records.
ttlSecondsAfterFinished is intentionally NOT set: the Job is part of
GitOps desired state, so a Kubernetes TTL delete would make the
resource missing relative to the rendered manifest and trigger an
immediate recreate on the next reconcile.

Shell script is fail-closed (POSIX sh, alpine/k8s busybox):

  - Exits 0 only when state is positively determined as a no-op:
    DRA DaemonSet absent (fresh install), driver DaemonSet absent
    (host-managed driver mode), or zero managed nodes labeled
    nvidia.com/gpu.deploy.driver=true.
  - Exits 1 on every other failure: any classification kubectl-get
    fails, kubectl-wait for upgrade-done times out, rollout-restart
    fails, or rollout-status times out within 5m.
  - Uses a capture-then-check pattern (KUBECTL_OUT=$(kubectl ... 2>&1)
    || { echo "..."; exit 1; }) to distinguish "API call failed" from
    "result is empty" — POSIX sh lacks `set -o pipefail`, so the
    previous `kubectl ... | grep ... | head` patterns silently masked
    kubectl exit codes and the warning-only timeouts could exit 0
    after an incomplete mitigation.
  - backoffLimit: 2 on the Job is load-bearing: a transient apiserver
    hiccup auto-retries twice before surfacing as a failed release.

RBAC (ClusterRole + ClusterRoleBinding only):

  - daemonsets.apps — get/list/watch/patch (cluster-wide; watch
    required by kubectl rollout status).
  - nodes — get/list/watch (for the upgrade-done label wait).
  - pods  — get/list/watch (rollout-status pod readiness).

A namespaced Role was rejected because gpu-operator-post installs
before nvidia-dra-driver-gpu's chart, so a Role inside the
nvidia-dra-driver namespace would fail on a fresh cluster (namespace
doesn't exist yet).

Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest list
digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be,
matching the repo's digest-pinning standard. alpine/k8s ships kubectl
plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no
pipefail).

The existing deploy.sh.tmpl block is intentionally kept as
defense-in-depth for Helm-deployer users; the extra rollout-restart on
an already-rolled DaemonSet is a no-op. A follow-up will evaluate
removing the script-side block once this Job has shipped in a release
and proven reliable.

Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml
so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via
the additive manifestFiles merge.

Refs NVIDIA#980
Related: NVIDIA#965 (deploy.sh block this Job generalizes), NVIDIA#973 (orthogonal
durability for the chart-version annotation), NVIDIA#894 (chart bump that
first surfaced the stale-NVML class of bug).
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
…ers (NVIDIA#980)

Add a Job to the bundler-synthesized gpu-operator-post chart that waits
for gpu-operator's k8s-driver-manager to finish migrating drivers on
every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet
plugin DaemonSet so its NVML handle rebinds to the post-migration
driver state. Extends the kubectl wait + rollout-restart block already
in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl to every deployer
the bundler emits (Helm, helmfile, Flux, Argo CD, argocd-helm), not
just the default Helm deployer that runs ./deploy.sh.

Re-fire mechanism — dual approach across bundler paths:

  1. Non-vendored localformat (default for Helm/helmfile/Argo CD/
     argocd-helm). local_helm.stripHelmHooks unconditionally removes
     sync-phase helm.sh/hook annotations because Argo CD's
     syncPolicy.automated would otherwise treat the resource as a
     PostSync hook that never fires under path-based sources. After
     stripping, the Job ships as a regular chart resource; re-fire on
     upgrade is achieved by keying the Job name on the parent
     gpu-operator chart version, which the bundler injects into the
     synthesized chart's values map at
     gpu-operator._aicrParentChartVersion (see
     DefaultBundler.injectDRAParentChartVersionValue). The values-map
     path is used in preference to .Chart.Version because the
     synthesized chart's Chart.yaml hardcodes 0.1.0 (which would
     defeat re-fire) and because flux-oci's source-controller
     suffixes .Chart.Version with "+<artifact-sha>" (invalid in K8s
     label values).

  2. Vendored localformat (--vendor-charts). vendor_folder.go calls
     injectPostInstallHooks on each mixed recipe-side manifest, which
     auto-tags resources with helm.sh/hook: post-install only UNLESS
     the document already declares helm.sh/hook (vendor_wrapper.go
     injectHookOnDoc:223), and auto-assigns sequential hook-weights
     starting at postInstallHookWeightBase=100. Two consequences for
     this manifest:
       (a) auto-injected post-install hooks fire only on first install
           and are silently skipped on every helm upgrade — exactly
           when the stale-NVML race occurs.
       (b) if only the Job author-declares a hook, the SA / Cluster
           Role / ClusterRoleBinding still get auto-injected weights
           100/101/102 while the Job takes Helm's default weight 0,
           putting the Job BEFORE its ServiceAccount/RBAC in install
           order and failing with "ServiceAccount not found".
     Both are addressed by author-declaring helm.sh/hook:
     post-install,post-upgrade plus before-hook-creation on ALL FOUR
     resources with ascending hook-weights:
         ServiceAccount=1, ClusterRole=2, ClusterRoleBinding=3, Job=10
     vendor_wrapper honors the author-declared hooks; the explicit
     weights guarantee the Job's prerequisites exist before it runs;
     post-upgrade re-fires the Job (and refreshes the RBAC) on every
     helm upgrade. stripHelmHooks on the non-vendored path strips ALL
     of these annotations (every phase here is in syncPhaseHooks at
     hooks.go:109-120), so this declaration is inert there and Helm's
     default kind-based ordering (SA → ClusterRole → ClusterRole
     Binding → Job) plus the chart-version-keyed Job name (path 1)
     provide the same install order and re-fire mechanism without
     hooks.

helm.sh/resource-policy: keep applies to path 1 only and prevents
Helm's three-way merge from pruning prior-version Jobs mid-rollout.
It is a no-op for the vendored hook path (Helm does not apply
resource-policy to hook resources); retention there is governed by
helm.sh/hook-delete-policy=before-hook-creation, which deletes a hook
resource ONLY when a new hook with the same name is created. SA / CR /
CRB names are stable so they cycle on every install/upgrade (idempotent
delete-and-recreate); Job names are chart-version-keyed so prior-
version Jobs survive cross-version upgrades as Succeeded records.
ttlSecondsAfterFinished is intentionally NOT set: the Job is part of
GitOps desired state, so a Kubernetes TTL delete would make the
resource missing relative to the rendered manifest and trigger an
immediate recreate on the next reconcile.

Shell script is fail-closed (POSIX sh, alpine/k8s busybox):

  - Exits 0 only when state is positively determined as a no-op:
    DRA DaemonSet absent (fresh install), driver DaemonSet absent
    (host-managed driver mode), or zero managed nodes labeled
    nvidia.com/gpu.deploy.driver=true.
  - Exits 1 on every other failure: any classification kubectl-get
    fails, kubectl-wait for upgrade-done times out, rollout-restart
    fails, or rollout-status times out within 5m.
  - Uses a capture-then-check pattern (KUBECTL_OUT=$(kubectl ... 2>&1)
    || { echo "..."; exit 1; }) to distinguish "API call failed" from
    "result is empty" — POSIX sh lacks `set -o pipefail`, so the
    previous `kubectl ... | grep ... | head` patterns silently masked
    kubectl exit codes and the warning-only timeouts could exit 0
    after an incomplete mitigation.
  - backoffLimit: 2 on the Job is load-bearing: a transient apiserver
    hiccup auto-retries twice before surfacing as a failed release.

RBAC (ClusterRole + ClusterRoleBinding only):

  - daemonsets.apps — get/list/watch/patch (cluster-wide; watch
    required by kubectl rollout status).
  - nodes — get/list/watch (for the upgrade-done label wait).
  - pods  — get/list/watch (rollout-status pod readiness).

A namespaced Role was rejected because gpu-operator-post installs
before nvidia-dra-driver-gpu's chart, so a Role inside the
nvidia-dra-driver namespace would fail on a fresh cluster (namespace
doesn't exist yet).

Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest list
digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be,
matching the repo's digest-pinning standard. alpine/k8s ships kubectl
plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no
pipefail).

The existing deploy.sh.tmpl block is intentionally kept as
defense-in-depth for Helm-deployer users; the extra rollout-restart on
an already-rolled DaemonSet is a no-op. A follow-up will evaluate
removing the script-side block once this Job has shipped in a release
and proven reliable.

Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml
so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via
the additive manifestFiles merge.

Refs NVIDIA#980
Related: NVIDIA#965 (deploy.sh block this Job generalizes), NVIDIA#973 (orthogonal
durability for the chart-version annotation), NVIDIA#894 (chart bump that
first surfaced the stale-NVML class of bug).

Cross-review iteration 3 — cross-deployer gating + values-only re-fire:

  - [P1] Cross-deployer install-time gating wired across every
    bundler path. The prior revision relied on the Job's eventual
    rollout-restart to self-correct the transient race; without
    install-time gating, downstream releases (notably
    nvidia-dra-driver-gpu) could install against stale-NVML driver
    state for the 15-20m window between gpu-operator-post applying
    its chart resources and the Job's kubectl wait + rollout
    restart returning. Fix:
      - pkg/bundler/deployer/helm/templates/deploy.sh.tmpl: per-name
        case for gpu-operator-post sets COMPONENT_HELM_TIMEOUT=25m
        + COMPONENT_EXTRA_WAIT_ARGS=--wait-for-jobs +
        COMPONENT_FORCE_WAIT=true (overrides --no-wait — the flag
        is a readiness-convenience knob, not a correctness-bypass
        knob; components that exist specifically to close a known
        race must always wait). Vendored gpu-operator case lifted
        to 25m for the --vendor-charts path where the Job ships as
        a hook on the primary chart.
      - pkg/bundler/deployer/helmfile/releases.go: new
        releaseOverrides map keyed by synthesized folder name;
        gpu-operator-post -> {timeout: 25m, waitForJobs: true};
        gpu-operator (vendored coverage) -> {timeout: 25m}. Mirrors
        the deploy.sh case switch.
      - pkg/bundler/deployer/flux/helm.go: new Timeout field on
        HelmReleaseData / ChartRefHelmReleaseData +
        releaseTimeoutOverrides map (gpu-operator-post -> 25m).
        helmrelease.yaml.tmpl and helmrelease-chartref.yaml.tmpl
        emit spec.timeout conditionally. helm-controller already
        waits for hook Jobs by default, so no waitForJobs knob is
        needed on this path — only the outer timeout.
      - Argo CD path is naturally gated by sync-wave + Job-Healthy
        default: app-of-apps waits for each child Application to
        reach Healthy, which requires the Job's status.succeeded.
        No explicit wiring needed.

  - [medium] Driver-version-only overrides now re-fire the Job on
    every deployer. The prior key was _aicrParentChartVersion (=
    gpu-operator chart version) only. On non-hook deployers
    (Helmfile / Argo CD / argocd-helm where stripHelmHooks removes
    the upgrade-hook annotations), the Job is a regular chart
    resource — kubectl apply on the same name is a no-op, so
    --set gpuoperator:driver.version=<X> did NOT re-run the
    migration mitigation when the gpu-operator chart pin was
    unchanged. Fix: new draDriverVersionValueKey = _aicrDriverVersion
    is injected alongside the parent-chart version (reading the
    merged componentValues at driver.version, so any source —
    registry default / overlay / user --set — flows through).
    The Job-name slug becomes "<chartSlug>-d<driverSlug>" when
    driver.version is set; host-managed-driver overlays
    (driver.enabled=false, version unset) keep the original
    1-component shape. The chart-version label
    (aicr.nvidia.com/gpu-operator-chart-version) stays semantically
    tied to the chart pin only.

  - [low] Empty gpu-operator.Version no longer produces an
    unbundleable manifest. The prior helper early-returned on
    empty Version, but the hook ships unconditionally on the
    gpu-operator componentRef, so the template would render
    "<nil>" into the Job name and chart-version label — invalid
    DNS-1123 / label values. Fix: fall through with
    NormalizeVersionWithDefault's "0.1.0" default + slog.Warn so
    operators get a debuggable signal of the underlying recipe
    gap.

  - User-facing doc: --no-wait correctness-gate exception
    documented in docs/user/cli-reference.md (deploy-script
    section) and pkg/bundler/deployer/helm/templates/README.md.tmpl
    so operators reading the doc know the flag does not bypass
    gpu-operator-post.

  - Tests: TestInjectDRAParentChartVersionValue_DriverVersionKey,
    _DriverVersionAbsent, and _EmptyVersionFallback added to pin
    the new behavior. All existing bundler / manifest / recipes
    tests pass with -race; golangci-lint clean.

End-to-end render verified on a real EKS H100 training recipe:
gpu-operator componentValues carries
  _aicrParentChartVersion: 26.3.1
  _aicrDriverVersion: 580.126.20
and the rendered Job name is
  aicr-dra-rollout-hook-26-3-1-d580-126-20-r1
— a driver.version-only override now produces a uniquely-named
Job and triggers re-fire on every deployer.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
…ers (NVIDIA#980)

Add a Job to the bundler-synthesized gpu-operator-post chart that waits
for gpu-operator's k8s-driver-manager to finish migrating drivers on
every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet
plugin DaemonSet so its NVML handle rebinds to the post-migration
driver state. Extends the kubectl wait + rollout-restart block already
in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl to every deployer
the bundler emits (Helm, helmfile, Flux, Argo CD, argocd-helm), not
just the default Helm deployer that runs ./deploy.sh.

Re-fire mechanism — dual approach across bundler paths:

  1. Non-vendored localformat (default for Helm/helmfile/Argo CD/
     argocd-helm). local_helm.stripHelmHooks unconditionally removes
     sync-phase helm.sh/hook annotations because Argo CD's
     syncPolicy.automated would otherwise treat the resource as a
     PostSync hook that never fires under path-based sources. After
     stripping, the Job ships as a regular chart resource; re-fire on
     upgrade is achieved by keying the Job name on the parent
     gpu-operator chart version, which the bundler injects into the
     synthesized chart's values map at
     gpu-operator._aicrParentChartVersion (see
     DefaultBundler.injectDRAParentChartVersionValue). The values-map
     path is used in preference to .Chart.Version because the
     synthesized chart's Chart.yaml hardcodes 0.1.0 (which would
     defeat re-fire) and because flux-oci's source-controller
     suffixes .Chart.Version with "+<artifact-sha>" (invalid in K8s
     label values).

  2. Vendored localformat (--vendor-charts). vendor_folder.go calls
     injectPostInstallHooks on each mixed recipe-side manifest, which
     auto-tags resources with helm.sh/hook: post-install only UNLESS
     the document already declares helm.sh/hook (vendor_wrapper.go
     injectHookOnDoc:223), and auto-assigns sequential hook-weights
     starting at postInstallHookWeightBase=100. Two consequences for
     this manifest:
       (a) auto-injected post-install hooks fire only on first install
           and are silently skipped on every helm upgrade — exactly
           when the stale-NVML race occurs.
       (b) if only the Job author-declares a hook, the SA / Cluster
           Role / ClusterRoleBinding still get auto-injected weights
           100/101/102 while the Job takes Helm's default weight 0,
           putting the Job BEFORE its ServiceAccount/RBAC in install
           order and failing with "ServiceAccount not found".
     Both are addressed by author-declaring helm.sh/hook:
     post-install,post-upgrade plus before-hook-creation on ALL FOUR
     resources with ascending hook-weights:
         ServiceAccount=1, ClusterRole=2, ClusterRoleBinding=3, Job=10
     vendor_wrapper honors the author-declared hooks; the explicit
     weights guarantee the Job's prerequisites exist before it runs;
     post-upgrade re-fires the Job (and refreshes the RBAC) on every
     helm upgrade. stripHelmHooks on the non-vendored path strips ALL
     of these annotations (every phase here is in syncPhaseHooks at
     hooks.go:109-120), so this declaration is inert there and Helm's
     default kind-based ordering (SA → ClusterRole → ClusterRole
     Binding → Job) plus the chart-version-keyed Job name (path 1)
     provide the same install order and re-fire mechanism without
     hooks.

helm.sh/resource-policy: keep applies to path 1 only and prevents
Helm's three-way merge from pruning prior-version Jobs mid-rollout.
It is a no-op for the vendored hook path (Helm does not apply
resource-policy to hook resources); retention there is governed by
helm.sh/hook-delete-policy=before-hook-creation, which deletes a hook
resource ONLY when a new hook with the same name is created. SA / CR /
CRB names are stable so they cycle on every install/upgrade (idempotent
delete-and-recreate); Job names are chart-version-keyed so prior-
version Jobs survive cross-version upgrades as Succeeded records.
ttlSecondsAfterFinished is intentionally NOT set: the Job is part of
GitOps desired state, so a Kubernetes TTL delete would make the
resource missing relative to the rendered manifest and trigger an
immediate recreate on the next reconcile.

Shell script is fail-closed (POSIX sh, alpine/k8s busybox):

  - Exits 0 only when state is positively determined as a no-op:
    DRA DaemonSet absent (fresh install), driver DaemonSet absent
    (host-managed driver mode), or zero managed nodes labeled
    nvidia.com/gpu.deploy.driver=true.
  - Exits 1 on every other failure: any classification kubectl-get
    fails, kubectl-wait for upgrade-done times out, rollout-restart
    fails, or rollout-status times out within 5m.
  - Uses a capture-then-check pattern (KUBECTL_OUT=$(kubectl ... 2>&1)
    || { echo "..."; exit 1; }) to distinguish "API call failed" from
    "result is empty" — POSIX sh lacks `set -o pipefail`, so the
    previous `kubectl ... | grep ... | head` patterns silently masked
    kubectl exit codes and the warning-only timeouts could exit 0
    after an incomplete mitigation.
  - backoffLimit: 2 on the Job is load-bearing: a transient apiserver
    hiccup auto-retries twice before surfacing as a failed release.

RBAC (ClusterRole + ClusterRoleBinding only):

  - daemonsets.apps — get/list/watch/patch (cluster-wide; watch
    required by kubectl rollout status).
  - nodes — get/list/watch (for the upgrade-done label wait).
  - pods  — get/list/watch (rollout-status pod readiness).

A namespaced Role was rejected because gpu-operator-post installs
before nvidia-dra-driver-gpu's chart, so a Role inside the
nvidia-dra-driver namespace would fail on a fresh cluster (namespace
doesn't exist yet).

Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest list
digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be,
matching the repo's digest-pinning standard. alpine/k8s ships kubectl
plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no
pipefail).

The existing deploy.sh.tmpl block is intentionally kept as
defense-in-depth for Helm-deployer users; the extra rollout-restart on
an already-rolled DaemonSet is a no-op. A follow-up will evaluate
removing the script-side block once this Job has shipped in a release
and proven reliable.

Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml
so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via
the additive manifestFiles merge.

Refs NVIDIA#980
Related: NVIDIA#965 (deploy.sh block this Job generalizes), NVIDIA#973 (orthogonal
durability for the chart-version annotation), NVIDIA#894 (chart bump that
first surfaced the stale-NVML class of bug).

Cross-review iteration 3 — cross-deployer gating + values-only re-fire:

  - [P1] Cross-deployer install-time gating wired across every
    bundler path. The prior revision relied on the Job's eventual
    rollout-restart to self-correct the transient race; without
    install-time gating, downstream releases (notably
    nvidia-dra-driver-gpu) could install against stale-NVML driver
    state for the 15-20m window between gpu-operator-post applying
    its chart resources and the Job's kubectl wait + rollout
    restart returning. Fix:
      - pkg/bundler/deployer/helm/templates/deploy.sh.tmpl: per-name
        case for gpu-operator-post sets COMPONENT_HELM_TIMEOUT=25m
        + COMPONENT_EXTRA_WAIT_ARGS=--wait-for-jobs +
        COMPONENT_FORCE_WAIT=true (overrides --no-wait — the flag
        is a readiness-convenience knob, not a correctness-bypass
        knob; components that exist specifically to close a known
        race must always wait). Vendored gpu-operator case lifted
        to 25m for the --vendor-charts path where the Job ships as
        a hook on the primary chart.
      - pkg/bundler/deployer/helmfile/releases.go: new
        releaseOverrides map keyed by synthesized folder name;
        gpu-operator-post -> {timeout: 25m, waitForJobs: true};
        gpu-operator (vendored coverage) -> {timeout: 25m}. Mirrors
        the deploy.sh case switch.
      - pkg/bundler/deployer/flux/helm.go: new Timeout field on
        HelmReleaseData / ChartRefHelmReleaseData +
        releaseTimeoutOverrides map (gpu-operator-post -> 25m).
        helmrelease.yaml.tmpl and helmrelease-chartref.yaml.tmpl
        emit spec.timeout conditionally. helm-controller already
        waits for hook Jobs by default, so no waitForJobs knob is
        needed on this path — only the outer timeout.
      - Argo CD path is naturally gated by sync-wave + Job-Healthy
        default: app-of-apps waits for each child Application to
        reach Healthy, which requires the Job's status.succeeded.
        No explicit wiring needed.

  - [medium] Driver-version-only overrides now re-fire the Job on
    every deployer. The prior key was _aicrParentChartVersion (=
    gpu-operator chart version) only. On non-hook deployers
    (Helmfile / Argo CD / argocd-helm where stripHelmHooks removes
    the upgrade-hook annotations), the Job is a regular chart
    resource — kubectl apply on the same name is a no-op, so
    --set gpuoperator:driver.version=<X> did NOT re-run the
    migration mitigation when the gpu-operator chart pin was
    unchanged. Fix: new draDriverVersionValueKey = _aicrDriverVersion
    is injected alongside the parent-chart version (reading the
    merged componentValues at driver.version, so any source —
    registry default / overlay / user --set — flows through).
    The Job-name slug becomes "<chartSlug>-d<driverSlug>" when
    driver.version is set; host-managed-driver overlays
    (driver.enabled=false, version unset) keep the original
    1-component shape. The chart-version label
    (aicr.nvidia.com/gpu-operator-chart-version) stays semantically
    tied to the chart pin only.

  - [low] Empty gpu-operator.Version no longer produces an
    unbundleable manifest. The prior helper early-returned on
    empty Version, but the hook ships unconditionally on the
    gpu-operator componentRef, so the template would render
    "<nil>" into the Job name and chart-version label — invalid
    DNS-1123 / label values. Fix: fall through with
    NormalizeVersionWithDefault's "0.1.0" default + slog.Warn so
    operators get a debuggable signal of the underlying recipe
    gap.

  - User-facing doc: --no-wait correctness-gate exception
    documented in docs/user/cli-reference.md (deploy-script
    section) and pkg/bundler/deployer/helm/templates/README.md.tmpl
    so operators reading the doc know the flag does not bypass
    gpu-operator-post.

  - Tests: TestInjectDRAParentChartVersionValue_DriverVersionKey,
    _DriverVersionAbsent, and _EmptyVersionFallback added to pin
    the new behavior. All existing bundler / manifest / recipes
    tests pass with -race; golangci-lint clean.

End-to-end render verified on a real EKS H100 training recipe:
gpu-operator componentValues carries
  _aicrParentChartVersion: 26.3.1
  _aicrDriverVersion: 580.126.20
and the rendered Job name is
  aicr-dra-rollout-hook-26-3-1-d580-126-20-r1
— a driver.version-only override now produces a uniquely-named
Job and triggers re-fire on every deployer.

Document known limits of the cross-deployer gate:

  - Argo CD app-of-apps. Sync-waves give ordering but not gating
    after Argo CD 1.8 removed default Health assessment for
    argoproj.io/Application
    (https://argo-cd.readthedocs.io/en/stable/operator-manual/upgrading/1.7-1.8/#health-assessment-of-argoprojioapplication-crd-has-been-removed).
    Restoring gating needs a resource.customizations Lua block on
    argocd-cm — out of scope for the bundle (cluster-wide config).
    Without it, Argo CD users fall back to the Job's fail-closed
    semantics: exit 1 surfaces gpu-operator-post Application as
    unHealthy, but nvidia-dra-driver-gpu may still sync in
    parallel.

  - Post-bundle driver.version edits. The Job-name slug is baked
    at bundle time by pkg/manifest.Render via
    pkg/bundler/deployer/localformat/local_helm.go:155, so a
    driver.version change applied after `aicr bundle` does not
    update the Job name (re-bundle picks it up; install-time
    `--set` / `--dynamic` does not). Closing this fully needs
    deferring this manifest's render to helm install time — a
    structural change to the localformat render pipeline tracked
    as a follow-up.

PR description updated to reflect both limitations and the
already-wired install-time gating across helm / helmfile / Flux.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chore(recipes): upgrade gpu-operator to latest stable

2 participants