chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20 by yuanchen8911 · Pull Request #965 · NVIDIA/aicr

yuanchen8911 · 2026-05-19T14:49:12Z

Summary

Lift every leaf recipe to the current upstream stable gpu-operator chart (v26.3.1, published 2026-04-18), align the driver pin with NVIDIA's qualified version (580.126.20), and mitigate three v26.3.1-introduced regressions: stricter driver-validation init container (peermem gate), stale DRA NVML handles after a kernel-module reload, and the timing race between gpu-operator's async per-node driver migration and the DRA kubelet plugin's post-install restart.

Motivation / Context

The registry already defaulted to v26.3.1, but base.yaml shadowed it with v25.10.1 and the AKS/OKE overlays held v26.3.0. As a result, leaves actually shipped v25.10.1 / v26.3.0 images even though docs/user/container-images.md advertised v26.3.1 — BOM-vs-leaf drift introduced by chart-pin duplication between the registry and base.yaml. This PR resolves that drift by single-sourcing the version pin in base.yaml at the upstream-stable level. The broader resolver/BOM cleanup is tracked separately in #966.

Coupled value updates so the bump matches NVIDIA's qualified stack and avoids v26.3.1 regressions:

driver.version advances from 580.105.08 to 580.126.20 — the v26.3.1 chart default and the GB200+EFA floor. The per-overlay GB200 driver pin (previously 580.126.20) becomes redundant and is dropped; kernelModuleConfig stays.
ccManager.enabled: false pinned in global values. Upstream v26.3.x flipped defaults from {enabled: false, defaultMode: "off"} to {enabled: true, defaultMode: "on"}. AICR keeps it off until we have explicit CC-capable hardware support.
driver.rdma.enabled global default flipped true → false. v26.3.1's driver-validation init container now hard-gates on lsmod | grep nvidia_peermem. That module only loads against Mellanox MOFED symbols. On AWS EFA (EKS p4d/p5/p5e) and Linode there is no MOFED, so peermem fails to load (modprobe ... Invalid argument) and the rest of the GPU stack stays stuck in Init forever. NCCL on EFA uses aws-ofi-nccl/libfabric, which does not consume nvidia_peermem, so there is no functional path lost.
driver.rdma.useHostMofed: true explicitly set on AKS (values-aks.yaml, values-aks-training.yaml) alongside enabled: true. AKS overlays deploy network-operator, which installs MOFED kernel modules on the host via a privileged DaemonSet. With useHostMofed: true, peermem binds against host symbols.
DRA pod-template annotation added to recipes/components/nvidia-dra-driver-gpu/values.yaml on both controller.podAnnotations and kubeletPlugin.podAnnotations — aicr.nvidia.com/gpu-operator-chart-version: v26.3.1. Bumping the annotation value on every gpu-operator chart change forces a DaemonSet re-roll across every deployer (Helm, helmfile, Flux, Argo CD), refreshing NVML against the now-running driver. Manual coupling of the annotation value to the chart version is a known maintenance gap — tracked in chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973 with a bundler-derived annotation as the durable fix.
Wait for per-node driver migration before the DRA post-install rollout-restart (pkg/bundler/deployer/helm/templates/deploy.sh.tmpl). gpu-operator's k8s-driver-manager runs its per-node module reload asynchronously after helm upgrade gpu-operator returns. On a multi-GPU-node cluster, the previously-existing DRA rollout-restart in deploy.sh (and the annotation re-roll above) could fire before some nodes had finished migration — leaving the freshly-rolled DRA kubelet plugin pinned to a now-stale NVML handle. Reproduced live on a GB200 EKS cluster during this PR's validation: the chart-version annotation re-rolled the DRA pods, then the second GB200 node's driver migration ran after the re-roll, leaving its kubelet-plugin pod with stale NVML and triggering the same "invalid CDI Spec: empty device edits" failure mode the annotation was meant to prevent. The fix is ~10 lines in the deploy.sh template: kubectl wait --for=jsonpath until every nvidia.com/gpu.present=true node carries nvidia.com/gpu-driver-upgrade-state=upgrade-done (15-min timeout, best-effort fall-through), then run the existing rollout-restart. Helm-deployer only; chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973 will deliver a Helm post-install hook on gpu-operator-post to extend the protection to GitOps deployers.

Effective resolved values per leaf, post-PR (verified via aicr bundle + values inspection):

Service	`driver.enabled`	`driver.rdma.enabled`	`useHostMofed`
EKS h100 / gb200 (training, inference, dynamo)	`true`	`false`	n/a
AKS h100 (training, inference-dynamo)	`true`	`true` (explicit override)	`true` (explicit override; binds peermem to host MOFED from network-operator)
GKE COS h100 (training, inference)	`false` (host driver)	n/a	n/a
OKE gb200 (training, inference)	`false` (host driver)	n/a	n/a
Kind	`false` (host driver)	`false`	n/a
LKE rtx-pro-6000 (training, inference)	`true`	`false`	n/a

Chart version on every supported leaf resolves to v26.3.1 (verified via aicr query ... --selector components.gpu-operator.version).

Fixes: #894
Related: #698, #966, #973

Type of Change

Bug fix (non-breaking change that fixes an issue — v26.3.1 validator trap on EKS / LKE + stale DRA NVML after driver migration)
Build/CI/tooling (component version maintenance)

Component(s) Affected

Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler/... — deploy.sh template)
Docs/examples (docs/, examples/)

Implementation Notes

Constraint floor Deployment.gpu-operator.version >= v24.6.0 (8 overlays) is still satisfied.
kubeVersion in Chart.yaml is unchanged (>= 1.16.0-0); AICR's >= 1.32.4 floor in base.yaml remains binding.
Bundle render verified for h100-eks-training, gb200-eks-training, and AKS leaves: rendered gpu-operator/values.yaml shows version: 580.126.20, ccManager.enabled: false, the correct per-cloud rdma.enabled + useHostMofed, and (for gb200) kernelModuleConfig.name: nvidia-kernel-module-params.
DRA values render with both controller.podAnnotations.aicr.nvidia.com/gpu-operator-chart-version and kubeletPlugin.podAnnotations.aicr.nvidia.com/gpu-operator-chart-version set to v26.3.1. Confirmed on aicr2 and the yljtrxpmzu GB200 cluster that helm upgrade on the new bundle force-rolls both DRA pods.
The new kubectl wait block in deploy.sh.tmpl uses a 15-min timeout and WARNING; proceeding anyway semantics rather than a hard fail, so a misconfigured cluster (e.g. nodes never reach upgrade-done) still gets the rollout-restart attempt. The guard around GPU_NODE_COUNT > 0 avoids kubectl wait's "no matching resources" error on clusters with no GPU nodes yet (initial install ordering).
BOM diff is narrow: only nvcr.io/nvidia/driver:580.105.08 → :580.126.20 moves, because the registry-driven BOM was already rendering against v26.3.1 (the BOM-vs-leaf drift this PR also resolves at the leaf layer; the BOM tool itself is rewritten in chore(recipes): make resolved recipes the single source of truth for chart versions #966).
tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling.yaml updated to match the new EKS expectation (rdma.enabled: false), with a comment explaining the EFA / aws-ofi-nccl rationale.
Off-overlay implication: anyone running AICR on bare-metal MOFED hosts without using the AKS overlay (which ships network-operator) would need to opt in via --set gpuoperator:driver.rdma.enabled=true and --set gpuoperator:driver.rdma.useHostMofed=true. Documented inline in recipes/components/gpu-operator/values.yaml.

Testing

make qualify   # PASS — tests + lint + e2e + scan + repo-specific checks
make bom-docs  # regenerated docs/user/container-images.md

Live cluster validation:

aicr2 (EKS H100 p5.48xlarge, ad-hoc dev cluster):

Pre-upgrade driver: 580.105.08. Post-upgrade: 580.126.20. All GPU pods Running, nvidia-cuda-validator Completed.
Full aicr validate: deployment passed (4/4, 28.8s), performance passed (inference-perf: 39,776.81 tok/s, TTFT p99 122.74 ms, 3m 15.7s), conformance passed (1m 49.5s).

yljtrxpmzu (EKS GB200 p6e-gb200.36xlarge × 2, DGX Cloud dev cluster):

Validated the upgrade-time timing race that motivated the new kubectl wait block: the chart-version annotation re-rolled DRA pods correctly, but a subsequent driver-module reload on the second GB200 node left the kubelet-plugin pod with stale NVML, producing invalid CDI Spec: empty device edits on the next aicr validate --phase performance run. Adding the wait closes that gap.
After the new wait block: deployment passed (4/4, 26.5s), performance passed (16,705.82 tok/s, TTFT p99 233.31 ms, 2m 21.8s — against the placeholder thresholds added in feat(recipes): add inference-perf to gb200-eks-ubuntu-inference-dynamo #977).

GPU CI follow-up: H100 nvkind run is required per #894 DoD; will trigger via CI label after maintainer review.

Risk Assessment

Medium — Single coherent chart bump touching every leaf, plus a localized deploy.sh template change that only fires for clusters that have any nvidia.com/gpu.present=true nodes. Drivers move 580.105.08 → 580.126.20 (node-side change; cordon/reboot on upgrade). ccManager and driver.rdma.enabled explicitly disabled to preserve behavior or to mitigate v26.3.1 regressions. AKS preserves the prior rdma.enabled: true behavior via explicit per-cloud overrides plus the useHostMofed: true correctness fix. DRA annotation closes the cross-deployer GPU-allocation regression introduced by the driver bump; the new kubectl wait closes the residual same-deployer timing race.

Rollout notes:

Existing deployments using the prior chart should follow the standard gpu-operator upgrade path (the chart's k8s-driver-manager handles driver replacement with maxParallelUpgrades = 5). No CRD migrations are introduced.
For any user running AICR on bare-metal MOFED hosts off-overlay, the global rdma.enabled flip from true → false is a behavior change. They can opt back in with --set gpuoperator:driver.rdma.enabled=true (and add useHostMofed=true if MOFED is on the host). Documented in recipes/components/gpu-operator/values.yaml.
The new kubectl wait adds at most 15 min to deploy.sh on clusters whose nodes never reach upgrade-done (best-effort; logs WARNING then proceeds). On a healthy cluster the wait is sub-second once migration completes.

Known follow-ups

chore(recipes): make resolved recipes the single source of truth for chart versions #966 — tools/bom and recipe resolver both read helm.defaultVersion today, so chart-pin drift between registry and overlays can happen silently. This PR fixes the immediate gpu-operator instance; chore(recipes): make resolved recipes the single source of truth for chart versions #966 fixes the resolver and BOM layers so the class can't recur.
chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973 — durable cross-deployer DRA-rollout trigger. The annotation + kubectl wait in this PR cover the default Helm-deployer flow. chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973 tracks the durable fix that works for helmfile / Flux / Argo CD too — a Helm post-install,post-upgrade hook on gpu-operator-post that runs the same wait-and-restart logic from inside the cluster, so non-Helm-deployer pipelines get the same protection.

Checklist

Tests pass locally (make qualify)
Linter passes (make lint — included in qualify)
I did not skip/disable tests to make CI green
I updated docs if user-facing behavior changed (docs/user/container-images.md regenerated; rationale comments in values.yaml, values-aks*.yaml, nvidia-dra-driver-gpu/values.yaml, and deploy.sh.tmpl)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

github-actions · 2026-05-19T14:50:14Z

🌿 Preview your docs: https://nvidia-preview-chore-gpu-operator-v26-3-1.docs.buildwithfern.com/aicr

coderabbitai · 2026-05-19T14:52:08Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR upgrades the GPU Operator Helm chart to v26.3.1 across overlays, pins the default driver version to 580.126.20, disables RDMA by default in component values and adds ccManager.enabled: false, adjusts GB200 overlay driver pins (inference and training differ), introduces AKS-specific RDMA host-MOFED settings, updates a test assertion to disable RDMA for EKS training, and syncs the container-images documentation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973: The chart bump and driver pin (v26.3.1 / 580.126.20) are directly related to adding the aicr.nvidia.com/gpu-operator-chart-version annotation for nvidia-dra-driver kubelet-plugin restarts.

Suggested reviewers

mchmarny
lalitadithya
njhensley

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description is comprehensive and directly related to the changeset, detailing the gpu-operator chart bump from v26.3.0/v25.10.1 to v26.3.1, driver version update to 580.126.20, and regression mitigations.
Title check	✅ Passed	The title directly and accurately summarizes the main changes: bumping gpu-operator chart to v26.3.1 and driver to 580.126.20.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/nvidia-dra-driver-gpu/values.yaml`:
- Around line 59-80: Add an automated CI validation that fails the PR if the
aicr.nvidia.com/gpu-operator-chart-version annotation in the values.yaml (under
controller and kubeletPlugin) is not equal to the current gpu-operator chart
version; implement the check as a simple script or GitHub Action step that reads
the gpu-operator chart version (from the chart's Chart.yaml or a central source
of truth) and compares it to the annotation value, returning a non-zero exit
code with a clear message when mismatched so authors must update the annotation
(track this automation work and reference issue `#973` in the CI job description).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 14cad807-a6e1-4952-af30-e6ca238c6b82

📥 Commits

Reviewing files that changed from the base of the PR and between 8f2e66e and 1cefd48.

📒 Files selected for processing (11)

docs/user/container-images.md
recipes/components/gpu-operator/values-aks-training.yaml
recipes/components/gpu-operator/values-aks.yaml
recipes/components/gpu-operator/values.yaml
recipes/components/nvidia-dra-driver-gpu/values.yaml
recipes/overlays/aks.yaml
recipes/overlays/base.yaml
recipes/overlays/gb200-eks-inference.yaml
recipes/overlays/gb200-eks-training.yaml
recipes/overlays/oke.yaml
tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling.yaml

💤 Files with no reviewable changes (2)

recipes/overlays/gb200-eks-training.yaml
recipes/overlays/gb200-eks-inference.yaml

yuanchen8911 · 2026-05-19T18:33:32Z

@coderabbitai review

coderabbitai · 2026-05-19T18:33:38Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

yuanchen8911 · 2026-05-20T18:35:17Z

Deployed and validated end-to-end on three clusters:

EKS H100 (aicr2)
EKS GB200
GKE H100 (COS) — both inference (Dynamo) and training (Kubeflow) stacks

aicr validate --phase deployment passes on all three; --phase performance passes where applicable (inference-perf for Dynamo, NCCL all-reduce-bw for training).

Rebased onto latest main and removing WIP — ready for review.

…126.20 (NVIDIA#894) Lift every leaf recipe to the current upstream stable gpu-operator chart (v26.3.1, published 2026-04-18). The registry already defaulted to v26.3.1, but base.yaml shadowed it with v25.10.1 and the aks/oke overlays held v26.3.0 — so leaves actually shipped v25.10.1 / v26.3.0 images even though the BOM advertised v26.3.1. - recipes/overlays/base.yaml: v25.10.1 -> v26.3.1 - recipes/overlays/aks.yaml: v26.3.0 -> v26.3.1 - recipes/overlays/oke.yaml: v26.3.0 -> v26.3.1 Driver pin moves to NVIDIA's v26.3.1-qualified version (580.126.20). This also matches the GB200+EFA floor, so the per-overlay driver.version: 580.126.20 in gb200-eks-{training, inference}.yaml becomes redundant and is dropped; kernelModuleConfig stays. Upstream v26.3.x flipped ccManager to enabled=true,defaultMode=on by default. Pin ccManager.enabled: false in the global values.yaml so we do not silently turn on Confidential Compute Manager on clusters without CC-capable hardware. Revisit when AICR adds explicit CC support. Flip driver.rdma.enabled global default true -> false. v26.3.1's driver-validation init container now hard-gates on lsmod | grep nvidia_peermem; the module only loads against Mellanox MOFED symbols, so AWS EFA (EKS p4d/p5/p5e) and Linode hosts trap the validator forever even though NCCL on EFA uses aws-ofi-nccl / libfabric and does not need nvidia_peermem. Re-enable rdma explicitly in components/gpu-operator/values-aks.yaml and values-aks-training.yaml — AKS overlays deploy network-operator which installs MOFED on ND-series InfiniBand nodes, so peermem has Mellanox symbols to bind against. Set driver.rdma.useHostMofed: true alongside, so the driver container binds nvidia_peermem against the network-operator-installed host MOFED instead of building its own bundled MOFED inside the container. GKE-cos / OKE / kind keep driver.enabled: false (host-managed driver) so the flag is moot; EKS / LKE inherit the new safe default. cuj1-training chainsaw assertion updated to match. Add gpu-operator-chart-version pod annotation to the DRA driver in recipes/components/nvidia-dra-driver-gpu/values.yaml (kubeletPlugin and controller). gpu-operator's k8s-driver-manager reloads the host NVIDIA kernel modules during a driver bump but does NOT restart the sibling nvidia-dra-driver-gpu DaemonSet because its chart template is unchanged. The DRA kubelet plugin loads libnvidia-ml.so at pod start and pins to the running driver version, so a kernel-module reload leaves the pod with a stale NVML handle; CDI spec generation then fails with Driver/library version mismatch and DRA-allocated workloads stay in ContainerCreating. Bumping the annotation value on every gpu-operator chart bump forces a DaemonSet re-roll across every deployer (Helm, helmfile, Flux, Argo CD), refreshing NVML against the now-running driver. The deploy.sh Helm-deployer template already restarts the kubelet plugin post-install; the annotation closes the gap for GitOps-deployer artifacts. Manual coupling of the annotation value to the chart version is a known maintenance gap tracked in issue NVIDIA#973 (bundler-derived annotation as the durable fix). Verified live on aicr2: applying the new bundle rolled both DRA pods cleanly, and aicr validate --phase performance passed end-to-end (inference-perf 39.7k tok/s, TTFT p99 122ms). Gate the existing post-install DRA kubelet-plugin restart on gpu-operator's per-node driver migration completing. The annotation-based re-roll above and the existing kubectl rollout restart in deploy.sh both fire at helm-upgrade time, but k8s-driver-manager runs the per-node module reload asynchronously after `helm upgrade gpu-operator` returns. On a multi-GPU-node cluster, the DRA plugin can re-roll on a node whose driver migration has not yet started, get its NVML handle stuck to the pre-migration state, and produce "invalid CDI Spec: empty device edits" once the modules reload underneath it. Reproduced live on a GB200 EKS cluster (yljtrxpmzu) during PR validation: the chart-version annotation re-rolled the DRA pods correctly, but the second GB200 node's driver migration ran after that, leaving its kubelet-plugin pod with a stale NVML view. Adding a kubectl wait for nvidia.com/gpu-driver-upgrade-state=upgrade-done on every GPU node before the existing post-install rollout-restart closes this timing race for the default Helm-deployer flow. See NVIDIA#973 for broader cross-deployer coverage (a Helm post-install hook on gpu-operator-post would cover helmfile/Flux/Argo CD too). BOM (docs/user/container-images.md) regenerated; only the driver image moves, since the registry-driven BOM was already on v26.3.1. The BOM-vs-leaf drift class itself is tracked separately in NVIDIA#966.

…ote chart name (NVIDIA#1034) Two argocd-helm OCI publishing bugs surfaced by Codex review against the post-NVIDIA#1032 contract. P1 — Path-based child Applications resolve to a non-existent artifact NVIDIA#1032 changed the --set repoURL contract so callers pass only the parent namespace (e.g., oci://reg/org); the parent Application appends .Chart.Name via its separate source.chart field. The parent-App template at parentAppTemplate / line 407 implements this correctly, but the path-based child-App template in injectValuesIntoSingleSource at line 703 was left emitting only .Values.repoURL. Argo CD's generic OCI source (used by path-based children since they have no source.chart) treats spec.source.repoURL as the full artifact reference and resolves it directly, so under the new contract a child source pointed at oci://reg/org:tag — an artifact that doesn't exist — and the child Application failed to sync. Fix: append /{{ .Chart.Name }} to the rendered repoURL value so the assembled URL matches the artifact the parent App's repoURL/chart:tag triple resolves to. The error-message text in the required directive is updated to say "this template appends .Chart.Name" (the path-based template is doing the appending, not the parent App). P2 — Unquoted name and version in generated Chart.yaml writeChartYAML emitted "name: %s" / "version: %s" with raw fmt.Fprintf. (*Reference).ChartName() returns path.Base(Repository), so a valid OCI artifact path whose last segment is a YAML reserved scalar — "null", "true", "false", "yes", "no", numeric strings — produced an unquoted YAML reserved word as the chart's name. Helm's loader then parsed name: null as YAML null, chart.Metadata.Name became empty, and helm package / helm push rejected the chart with "chart.metadata.name is required". Same trap for the version field ("1.0" parses as float). Fix: emit both via %q so the values round-trip as YAML strings. OCI artifact path segments are constrained to printable ASCII by the docker reference grammar, which is the safe charset for Go's %q. Tests - TestInjectValuesIntoSingleSource_AppendsChartName — asserts the rendered path-based child source.repoURL appends /{{ .Chart.Name }} after the user-supplied .Values.repoURL. - TestWriteChartYAML_QuotesYAMLReservedScalarsAsName — table-driven case across null / true / false / numeric / yes / normal name; for each, writes Chart.yaml and verifies yaml.Unmarshal round-trips name back as the expected string. - TestHelmTemplate_RendersWithSetRepoURL — updated to exercise the new contract: --set repoURL=oci://reg/org (parent namespace only). Asserts the parent App's repoURL equals the parent namespace and the child App's repoURL equals parent-namespace + /aicr-bundle. - TestGenerate_CustomChartName and the existing Chart.yaml-version assertion updated to expect quoted scalars. - Golden templates and Chart.yaml fixtures regenerated. Closes NVIDIA#1034 Related: NVIDIA#1032 (the contract change this PR completes), NVIDIA#1019, NVIDIA#965

…oss all deployers (NVIDIA#980) Add a Helm post-install,post-upgrade Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. This extends the kubectl wait + rollout-restart block that pkg/bundler/deployer/helm/templates/deploy.sh.tmpl already runs for the default Helm deployer (./deploy.sh) to every other deployer — helmfile, Flux, Argo CD, argocd-helm — by encoding the same logic as a Helm hook inside the rendered chart. Helm/helmfile/Flux/Argo CD all execute Helm-defined hooks as part of the release lifecycle, so users who consume the bundle via GitOps reconcilers gain the same protection without needing to run the post-deploy script. The hook mirrors deploy.sh's logic: - Detects host-managed-driver overlays (GKE COS, OKE, Kind, Talos) by absence of nvidia-driver-daemonset and skips the migration wait. - Waits up to 15m for every managed GPU node to reach the nvidia.com/gpu-driver-upgrade-state=upgrade-done label. - Issues kubectl rollout restart on the DRA kubelet plugin DaemonSet and waits up to 5m for the new generation to be ready. - Exits 0 if the DRA DaemonSet is not yet present (fresh install before nvidia-dra-driver-gpu's chart has applied), since the rollout-restart is only meaningful in the upgrade path. RBAC is shipped alongside the Job: - ServiceAccount in the gpu-operator namespace. - ClusterRole for cluster-wide DaemonSet list/get and Node list/watch (needed to discover nvidia-driver-daemonset cross-namespace and wait on per-node upgrade-state labels). - Role+RoleBinding in the nvidia-dra-driver namespace scoped to patch on DaemonSets and list/watch on Pods (rollout-status). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once the hook has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. WIP / draft: parity test (assert the hook Job appears in rendered output for at least two deployers), full make qualify, and cluster verification on a fresh gpu-operator chart bump under a GitOps deployer all still TBD before flipping to ready-for-review. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this hook generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug).

…pe (NVIDIA#973) Replace the hand-maintained aicr.nvidia.com/gpu-operator-chart-version annotation in recipes/components/nvidia-dra-driver-gpu/values.yaml with a bundler-side injection that reads the value from the resolved gpu-operator componentRef and writes it onto both controller and kubeletPlugin podAnnotations of the rendered DRA values. The annotation cannot drift from the actual chart pin any more — a gpu-operator chart bump produces a rendered pod-template diff automatically, helm upgrade (and every other deployer) re-rolls the DaemonSet, and the kubelet plugin's NVML handle rebinds to the post-migration driver state. Why this matters PR NVIDIA#965 mitigated the stale-NVML class of bug (gpu-operator chart bump → k8s-driver-manager reloads kernel modules async → DRA kubelet plugin's NVML handle goes stale → CDI spec generation fails with "Driver/library version mismatch" → DRA-allocated pods stuck in ContainerCreating) by hard-coding the gpu-operator chart version into the DRA pod-template annotation. The annotation worked, but only as long as a maintainer remembered to bump both pins in lockstep. A future PR that bumped gpu-operator and forgot the annotation would produce identical rendered DaemonSet manifests, helm upgrade would skip the re-roll, and stale NVML would return silently. make qualify did not catch the drift because no static check verified the two pins. Implementation injectDRAChartVersionAnnotation iterates the filtered recipeResult.ComponentRefs (Make has already filtered out disabled components by this point), extracts the gpu-operator chart version, and gates on both gpu-operator and nvidia-dra-driver-gpu being enabled. When both are present, it writes aicr.nvidia.com/gpu-operator-chart-version onto controller and kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map, creating nested maps lazily so existing values (priorityClassName, other annotations) are preserved. Injection happens in DefaultBundler.Make AFTER extractComponentValues — so the values map is populated and user --set overrides have already landed — and BEFORE buildDeployer — so every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives the same final componentValues map. Placing the call after --set means a user override of this specific annotation key is intentionally NOT honored; the annotation must always reflect the actual resolved gpu-operator chart version, or the rollout-trigger semantics break. The injection mirrors the GKE critical-priority quota helper (pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so callers familiar with that pattern read this one for free. Tests - Unit tests (bundler_dra_annotation_test.go): positive case both pod paths, negative case DRA disabled, negative case gpu-operator disabled, preserves-existing-values, overrides-user-set, version-variants (v-prefixed / unprefixed / pre-release / build metadata), nil-tolerant inputs, lazy-DRA-values-map creation. Helper coverage: 100%. - Generated-artifact parity test (bundler_dra_annotation_parity_test.go): bundles the same recipe through the default Helm deployer AND the helmfile deployer; asserts the annotation lands in the rendered DRA values.yaml for BOTH outputs. Catches the cross-deployer parity risk if a future refactor moves the injection into a deployer-specific render path. Plus a disabled-recipe negative case that asserts the gpu-operator values do not leak any DRA-related annotation when DRA is filtered out. Closes NVIDIA#973 Related: NVIDIA#965 (the chart-version annotation this durabilizes), NVIDIA#980 (orthogonal cross-deployer corrective restart), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug)

…oss all deployers (NVIDIA#980) Add a Helm post-install,post-upgrade Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. This extends the kubectl wait + rollout-restart block that pkg/bundler/deployer/helm/templates/deploy.sh.tmpl already runs for the default Helm deployer (./deploy.sh) to every other deployer — helmfile, Flux, Argo CD, argocd-helm — by encoding the same logic as a Helm hook inside the rendered chart. Helm/helmfile/Flux/Argo CD all execute Helm-defined hooks as part of the release lifecycle, so users who consume the bundle via GitOps reconcilers gain the same protection without needing to run the post-deploy script. The hook mirrors deploy.sh's logic: - Detects host-managed-driver overlays (GKE COS, OKE, Kind, Talos) by absence of nvidia-driver-daemonset and skips the migration wait. - Waits up to 15m for every managed GPU node to reach the nvidia.com/gpu-driver-upgrade-state=upgrade-done label. - Issues kubectl rollout restart on the DRA kubelet plugin DaemonSet and waits up to 5m for the new generation to be ready. - Exits 0 if the DRA DaemonSet is not yet present (fresh install before nvidia-dra-driver-gpu's chart has applied), since the rollout-restart is only meaningful in the upgrade path. RBAC is shipped alongside the Job: - ServiceAccount in the gpu-operator namespace. - ClusterRole for cluster-wide DaemonSet list/get and Node list/watch (needed to discover nvidia-driver-daemonset cross-namespace and wait on per-node upgrade-state labels). - Role+RoleBinding in the nvidia-dra-driver namespace scoped to patch on DaemonSets and list/watch on Pods (rollout-status). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once the hook has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. WIP / draft: parity test (assert the hook Job appears in rendered output for at least two deployers), full make qualify, and cluster verification on a fresh gpu-operator chart bump under a GitOps deployer all still TBD before flipping to ready-for-review. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this hook generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug).

…pe (NVIDIA#973) Replace the hand-maintained aicr.nvidia.com/gpu-operator-chart-version annotation in recipes/components/nvidia-dra-driver-gpu/values.yaml with a bundler-side injection that reads the value from the resolved gpu-operator componentRef and writes it onto both controller and kubeletPlugin podAnnotations of the rendered DRA values. The annotation cannot drift from the actual chart pin any more — a gpu-operator chart bump produces a rendered pod-template diff automatically, helm upgrade (and every other deployer) re-rolls the DaemonSet, and the kubelet plugin's NVML handle rebinds to the post-migration driver state. Why this matters PR NVIDIA#965 mitigated the stale-NVML class of bug (gpu-operator chart bump → k8s-driver-manager reloads kernel modules async → DRA kubelet plugin's NVML handle goes stale → CDI spec generation fails with "Driver/library version mismatch" → DRA-allocated pods stuck in ContainerCreating) by hard-coding the gpu-operator chart version into the DRA pod-template annotation. The annotation worked, but only as long as a maintainer remembered to bump both pins in lockstep. A future PR that bumped gpu-operator and forgot the annotation would produce identical rendered DaemonSet manifests, helm upgrade would skip the re-roll, and stale NVML would return silently. make qualify did not catch the drift because no static check verified the two pins. Implementation injectDRAChartVersionAnnotation iterates the filtered recipeResult.ComponentRefs (Make has already filtered out disabled components by this point), extracts the gpu-operator chart version, and gates on both gpu-operator and nvidia-dra-driver-gpu being enabled. When both are present, it writes aicr.nvidia.com/gpu-operator-chart-version onto controller and kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map, creating nested maps lazily so existing values (priorityClassName, other annotations) are preserved. Injection happens in DefaultBundler.Make AFTER extractComponentValues — so the values map is populated and user --set overrides have already landed — and BEFORE buildDeployer — so every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives the same final componentValues map. Placing the call after --set means a user override of this specific annotation key is intentionally NOT honored; the annotation must always reflect the actual resolved gpu-operator chart version, or the rollout-trigger semantics break. The injection mirrors the GKE critical-priority quota helper (pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so callers familiar with that pattern read this one for free. Tests - Unit tests (bundler_dra_annotation_test.go): positive case both pod paths, negative case DRA disabled, negative case gpu-operator disabled, preserves-existing-values, overrides-user-set, version-variants (v-prefixed / unprefixed / pre-release / build metadata), nil-tolerant inputs, lazy-DRA-values-map creation. Helper coverage: 100%. - Generated-artifact parity test (bundler_dra_annotation_parity_test.go): bundles the same recipe through the default Helm deployer AND the helmfile deployer; asserts the annotation lands in the rendered DRA values.yaml for BOTH outputs. Catches the cross-deployer parity risk if a future refactor moves the injection into a deployer-specific render path. Plus a disabled-recipe negative case that asserts the gpu-operator values do not leak any DRA-related annotation when DRA is filtered out. Closes NVIDIA#973 Related: NVIDIA#965 (the chart-version annotation this durabilizes), NVIDIA#980 (orthogonal cross-deployer corrective restart), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug)

…oss all deployers (NVIDIA#980) Add a Helm post-install,post-upgrade Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. This extends the kubectl wait + rollout-restart block that pkg/bundler/deployer/helm/templates/deploy.sh.tmpl already runs for the default Helm deployer (./deploy.sh) to every other deployer — helmfile, Flux, Argo CD, argocd-helm — by encoding the same logic as a Helm hook inside the rendered chart. Helm/helmfile/Flux/Argo CD all execute Helm-defined hooks as part of the release lifecycle, so users who consume the bundle via GitOps reconcilers gain the same protection without needing to run the post-deploy script. The hook mirrors deploy.sh's logic: - Detects host-managed-driver overlays (GKE COS, OKE, Kind, Talos) by absence of nvidia-driver-daemonset and skips the migration wait. - Waits up to 15m for every managed GPU node to reach the nvidia.com/gpu-driver-upgrade-state=upgrade-done label. - Issues kubectl rollout restart on the DRA kubelet plugin DaemonSet and waits up to 5m for the new generation to be ready. - Exits 0 if the DRA DaemonSet is not yet present (fresh install before nvidia-dra-driver-gpu's chart has applied), since the rollout-restart is only meaningful in the upgrade path. RBAC is shipped alongside the Job: - ServiceAccount in the gpu-operator namespace. - ClusterRole for cluster-wide DaemonSet list/get and Node list/watch (needed to discover nvidia-driver-daemonset cross-namespace and wait on per-node upgrade-state labels). - Role+RoleBinding in the nvidia-dra-driver namespace scoped to patch on DaemonSets and list/watch on Pods (rollout-status). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once the hook has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. WIP / draft: parity test (assert the hook Job appears in rendered output for at least two deployers), full make qualify, and cluster verification on a fresh gpu-operator chart bump under a GitOps deployer all still TBD before flipping to ready-for-review. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this hook generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug).

…ers (NVIDIA#980) Add a Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. Extends the kubectl wait + rollout-restart block already in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl to every deployer the bundler emits (Helm, helmfile, Flux, Argo CD, argocd-helm), not just the default Helm deployer that runs ./deploy.sh. Re-fire mechanism — dual approach across bundler paths: 1. Non-vendored localformat (default for Helm/helmfile/Argo CD/ argocd-helm). local_helm.stripHelmHooks unconditionally removes sync-phase helm.sh/hook annotations because Argo CD's syncPolicy.automated would otherwise treat the resource as a PostSync hook that never fires under path-based sources. After stripping, the Job ships as a regular chart resource; re-fire on upgrade is achieved by keying the Job name on the parent gpu-operator chart version, which the bundler injects into the synthesized chart's values map at gpu-operator._aicrParentChartVersion (see DefaultBundler.injectDRAParentChartVersionValue). The values-map path is used in preference to .Chart.Version because the synthesized chart's Chart.yaml hardcodes 0.1.0 (which would defeat re-fire) and because flux-oci's source-controller suffixes .Chart.Version with "+<artifact-sha>" (invalid in K8s label values). 2. Vendored localformat (--vendor-charts). vendor_folder.go calls injectPostInstallHooks on each mixed recipe-side manifest, which auto-tags resources with helm.sh/hook: post-install only UNLESS the document already declares helm.sh/hook (vendor_wrapper.go injectHookOnDoc:223). Without an author-declared hook, the auto-injected post-install-only annotation means the Job runs on first install and is silently skipped on every subsequent helm upgrade — exactly when the stale-NVML race fires. The Job therefore author-declares helm.sh/hook: post-install,post-upgrade helm.sh/hook-delete-policy: before-hook-creation so vendor_wrapper honors the value and the Job fires on both phases. stripHelmHooks on the non-vendored path strips both phases together (both are in syncPhaseHooks), so this declaration is inert there and the version-keyed name from path 1 remains the re-fire mechanism. Shell script is fail-closed (POSIX sh, alpine/k8s busybox): - Exits 0 only when state is positively determined as a no-op: DRA DaemonSet absent (fresh install), driver DaemonSet absent (host-managed driver mode), or zero managed nodes labeled nvidia.com/gpu.deploy.driver=true. - Exits 1 on every other failure: any classification kubectl-get fails, kubectl-wait for upgrade-done times out, rollout-restart fails, or rollout-status times out within 5m. - Uses a capture-then-check pattern (KUBECTL_OUT=$(kubectl ... 2>&1) || { echo "..."; exit 1; }) to distinguish "API call failed" from "result is empty" — POSIX sh lacks `set -o pipefail`, so the previous `kubectl ... | grep ... | head` patterns silently masked kubectl exit codes and the warning-only timeouts could exit 0 after an incomplete mitigation, contradicting the safety story. - backoffLimit: 2 on the Job is load-bearing: a transient apiserver hiccup auto-retries twice before surfacing as a failed release. RBAC (ClusterRole + ClusterRoleBinding only): - daemonsets.apps — get/list/watch/patch (cluster-wide; watch required by kubectl rollout status). - nodes — get/list/watch (for the upgrade-done label wait). - pods — get/list/watch (rollout-status pod readiness). A namespaced Role was rejected because gpu-operator-post installs before nvidia-dra-driver-gpu's chart, so a Role inside the nvidia-dra-driver namespace would fail on a fresh cluster (namespace doesn't exist yet). Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest list digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be, matching the repo's digest-pinning standard. alpine/k8s ships kubectl plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no pipefail). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once this Job has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this Job generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug).

…ers (NVIDIA#980) Add a Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. Extends the kubectl wait + rollout-restart block already in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl to every deployer the bundler emits (Helm, helmfile, Flux, Argo CD, argocd-helm), not just the default Helm deployer that runs ./deploy.sh. Re-fire mechanism — dual approach across bundler paths: 1. Non-vendored localformat (default for Helm/helmfile/Argo CD/ argocd-helm). local_helm.stripHelmHooks unconditionally removes sync-phase helm.sh/hook annotations because Argo CD's syncPolicy.automated would otherwise treat the resource as a PostSync hook that never fires under path-based sources. After stripping, the Job ships as a regular chart resource; re-fire on upgrade is achieved by keying the Job name on the parent gpu-operator chart version, which the bundler injects into the synthesized chart's values map at gpu-operator._aicrParentChartVersion (see DefaultBundler.injectDRAParentChartVersionValue). The values-map path is used in preference to .Chart.Version because the synthesized chart's Chart.yaml hardcodes 0.1.0 (which would defeat re-fire) and because flux-oci's source-controller suffixes .Chart.Version with "+<artifact-sha>" (invalid in K8s label values). 2. Vendored localformat (--vendor-charts). vendor_folder.go calls injectPostInstallHooks on each mixed recipe-side manifest, which auto-tags resources with helm.sh/hook: post-install only UNLESS the document already declares helm.sh/hook (vendor_wrapper.go injectHookOnDoc:223), and auto-assigns sequential hook-weights starting at postInstallHookWeightBase=100. Two consequences for this manifest: (a) auto-injected post-install hooks fire only on first install and are silently skipped on every helm upgrade — exactly when the stale-NVML race occurs. (b) if only the Job author-declares a hook, the SA / Cluster Role / ClusterRoleBinding still get auto-injected weights 100/101/102 while the Job takes Helm's default weight 0, putting the Job BEFORE its ServiceAccount/RBAC in install order and failing with "ServiceAccount not found". Both are addressed by author-declaring helm.sh/hook: post-install,post-upgrade plus before-hook-creation on ALL FOUR resources with ascending hook-weights: ServiceAccount=1, ClusterRole=2, ClusterRoleBinding=3, Job=10 vendor_wrapper honors the author-declared hooks; the explicit weights guarantee the Job's prerequisites exist before it runs; post-upgrade re-fires the Job (and refreshes the RBAC) on every helm upgrade. stripHelmHooks on the non-vendored path strips ALL of these annotations (every phase here is in syncPhaseHooks at hooks.go:109-120), so this declaration is inert there and Helm's default kind-based ordering (SA → ClusterRole → ClusterRole Binding → Job) plus the chart-version-keyed Job name (path 1) provide the same install order and re-fire mechanism without hooks. helm.sh/resource-policy: keep applies to path 1 only and prevents Helm's three-way merge from pruning prior-version Jobs mid-rollout. It is a no-op for the vendored hook path (Helm does not apply resource-policy to hook resources); retention there is governed by helm.sh/hook-delete-policy=before-hook-creation, which deletes a hook resource ONLY when a new hook with the same name is created. SA / CR / CRB names are stable so they cycle on every install/upgrade (idempotent delete-and-recreate); Job names are chart-version-keyed so prior- version Jobs survive cross-version upgrades as Succeeded records. ttlSecondsAfterFinished is intentionally NOT set: the Job is part of GitOps desired state, so a Kubernetes TTL delete would make the resource missing relative to the rendered manifest and trigger an immediate recreate on the next reconcile. Shell script is fail-closed (POSIX sh, alpine/k8s busybox): - Exits 0 only when state is positively determined as a no-op: DRA DaemonSet absent (fresh install), driver DaemonSet absent (host-managed driver mode), or zero managed nodes labeled nvidia.com/gpu.deploy.driver=true. - Exits 1 on every other failure: any classification kubectl-get fails, kubectl-wait for upgrade-done times out, rollout-restart fails, or rollout-status times out within 5m. - Uses a capture-then-check pattern (KUBECTL_OUT=$(kubectl ... 2>&1) || { echo "..."; exit 1; }) to distinguish "API call failed" from "result is empty" — POSIX sh lacks `set -o pipefail`, so the previous `kubectl ... | grep ... | head` patterns silently masked kubectl exit codes and the warning-only timeouts could exit 0 after an incomplete mitigation. - backoffLimit: 2 on the Job is load-bearing: a transient apiserver hiccup auto-retries twice before surfacing as a failed release. RBAC (ClusterRole + ClusterRoleBinding only): - daemonsets.apps — get/list/watch/patch (cluster-wide; watch required by kubectl rollout status). - nodes — get/list/watch (for the upgrade-done label wait). - pods — get/list/watch (rollout-status pod readiness). A namespaced Role was rejected because gpu-operator-post installs before nvidia-dra-driver-gpu's chart, so a Role inside the nvidia-dra-driver namespace would fail on a fresh cluster (namespace doesn't exist yet). Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest list digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be, matching the repo's digest-pinning standard. alpine/k8s ships kubectl plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no pipefail). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once this Job has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this Job generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug).

…ers (NVIDIA#980) Add a Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. Extends the kubectl wait + rollout-restart block already in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl to every deployer the bundler emits (Helm, helmfile, Flux, Argo CD, argocd-helm), not just the default Helm deployer that runs ./deploy.sh. Re-fire mechanism — dual approach across bundler paths: 1. Non-vendored localformat (default for Helm/helmfile/Argo CD/ argocd-helm). local_helm.stripHelmHooks unconditionally removes sync-phase helm.sh/hook annotations because Argo CD's syncPolicy.automated would otherwise treat the resource as a PostSync hook that never fires under path-based sources. After stripping, the Job ships as a regular chart resource; re-fire on upgrade is achieved by keying the Job name on the parent gpu-operator chart version, which the bundler injects into the synthesized chart's values map at gpu-operator._aicrParentChartVersion (see DefaultBundler.injectDRAParentChartVersionValue). The values-map path is used in preference to .Chart.Version because the synthesized chart's Chart.yaml hardcodes 0.1.0 (which would defeat re-fire) and because flux-oci's source-controller suffixes .Chart.Version with "+<artifact-sha>" (invalid in K8s label values). 2. Vendored localformat (--vendor-charts). vendor_folder.go calls injectPostInstallHooks on each mixed recipe-side manifest, which auto-tags resources with helm.sh/hook: post-install only UNLESS the document already declares helm.sh/hook (vendor_wrapper.go injectHookOnDoc:223), and auto-assigns sequential hook-weights starting at postInstallHookWeightBase=100. Two consequences for this manifest: (a) auto-injected post-install hooks fire only on first install and are silently skipped on every helm upgrade — exactly when the stale-NVML race occurs. (b) if only the Job author-declares a hook, the SA / Cluster Role / ClusterRoleBinding still get auto-injected weights 100/101/102 while the Job takes Helm's default weight 0, putting the Job BEFORE its ServiceAccount/RBAC in install order and failing with "ServiceAccount not found". Both are addressed by author-declaring helm.sh/hook: post-install,post-upgrade plus before-hook-creation on ALL FOUR resources with ascending hook-weights: ServiceAccount=1, ClusterRole=2, ClusterRoleBinding=3, Job=10 vendor_wrapper honors the author-declared hooks; the explicit weights guarantee the Job's prerequisites exist before it runs; post-upgrade re-fires the Job (and refreshes the RBAC) on every helm upgrade. stripHelmHooks on the non-vendored path strips ALL of these annotations (every phase here is in syncPhaseHooks at hooks.go:109-120), so this declaration is inert there and Helm's default kind-based ordering (SA → ClusterRole → ClusterRole Binding → Job) plus the chart-version-keyed Job name (path 1) provide the same install order and re-fire mechanism without hooks. helm.sh/resource-policy: keep applies to path 1 only and prevents Helm's three-way merge from pruning prior-version Jobs mid-rollout. It is a no-op for the vendored hook path (Helm does not apply resource-policy to hook resources); retention there is governed by helm.sh/hook-delete-policy=before-hook-creation, which deletes a hook resource ONLY when a new hook with the same name is created. SA / CR / CRB names are stable so they cycle on every install/upgrade (idempotent delete-and-recreate); Job names are chart-version-keyed so prior- version Jobs survive cross-version upgrades as Succeeded records. ttlSecondsAfterFinished is intentionally NOT set: the Job is part of GitOps desired state, so a Kubernetes TTL delete would make the resource missing relative to the rendered manifest and trigger an immediate recreate on the next reconcile. Shell script is fail-closed (POSIX sh, alpine/k8s busybox): - Exits 0 only when state is positively determined as a no-op: DRA DaemonSet absent (fresh install), driver DaemonSet absent (host-managed driver mode), or zero managed nodes labeled nvidia.com/gpu.deploy.driver=true. - Exits 1 on every other failure: any classification kubectl-get fails, kubectl-wait for upgrade-done times out, rollout-restart fails, or rollout-status times out within 5m. - Uses a capture-then-check pattern (KUBECTL_OUT=$(kubectl ... 2>&1) || { echo "..."; exit 1; }) to distinguish "API call failed" from "result is empty" — POSIX sh lacks `set -o pipefail`, so the previous `kubectl ... | grep ... | head` patterns silently masked kubectl exit codes and the warning-only timeouts could exit 0 after an incomplete mitigation. - backoffLimit: 2 on the Job is load-bearing: a transient apiserver hiccup auto-retries twice before surfacing as a failed release. RBAC (ClusterRole + ClusterRoleBinding only): - daemonsets.apps — get/list/watch/patch (cluster-wide; watch required by kubectl rollout status). - nodes — get/list/watch (for the upgrade-done label wait). - pods — get/list/watch (rollout-status pod readiness). A namespaced Role was rejected because gpu-operator-post installs before nvidia-dra-driver-gpu's chart, so a Role inside the nvidia-dra-driver namespace would fail on a fresh cluster (namespace doesn't exist yet). Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest list digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be, matching the repo's digest-pinning standard. alpine/k8s ships kubectl plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no pipefail). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once this Job has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this Job generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug). Cross-review iteration 3 — cross-deployer gating + values-only re-fire: - [P1] Cross-deployer install-time gating wired across every bundler path. The prior revision relied on the Job's eventual rollout-restart to self-correct the transient race; without install-time gating, downstream releases (notably nvidia-dra-driver-gpu) could install against stale-NVML driver state for the 15-20m window between gpu-operator-post applying its chart resources and the Job's kubectl wait + rollout restart returning. Fix: - pkg/bundler/deployer/helm/templates/deploy.sh.tmpl: per-name case for gpu-operator-post sets COMPONENT_HELM_TIMEOUT=25m + COMPONENT_EXTRA_WAIT_ARGS=--wait-for-jobs + COMPONENT_FORCE_WAIT=true (overrides --no-wait — the flag is a readiness-convenience knob, not a correctness-bypass knob; components that exist specifically to close a known race must always wait). Vendored gpu-operator case lifted to 25m for the --vendor-charts path where the Job ships as a hook on the primary chart. - pkg/bundler/deployer/helmfile/releases.go: new releaseOverrides map keyed by synthesized folder name; gpu-operator-post -> {timeout: 25m, waitForJobs: true}; gpu-operator (vendored coverage) -> {timeout: 25m}. Mirrors the deploy.sh case switch. - pkg/bundler/deployer/flux/helm.go: new Timeout field on HelmReleaseData / ChartRefHelmReleaseData + releaseTimeoutOverrides map (gpu-operator-post -> 25m). helmrelease.yaml.tmpl and helmrelease-chartref.yaml.tmpl emit spec.timeout conditionally. helm-controller already waits for hook Jobs by default, so no waitForJobs knob is needed on this path — only the outer timeout. - Argo CD path is naturally gated by sync-wave + Job-Healthy default: app-of-apps waits for each child Application to reach Healthy, which requires the Job's status.succeeded. No explicit wiring needed. - [medium] Driver-version-only overrides now re-fire the Job on every deployer. The prior key was _aicrParentChartVersion (= gpu-operator chart version) only. On non-hook deployers (Helmfile / Argo CD / argocd-helm where stripHelmHooks removes the upgrade-hook annotations), the Job is a regular chart resource — kubectl apply on the same name is a no-op, so --set gpuoperator:driver.version=<X> did NOT re-run the migration mitigation when the gpu-operator chart pin was unchanged. Fix: new draDriverVersionValueKey = _aicrDriverVersion is injected alongside the parent-chart version (reading the merged componentValues at driver.version, so any source — registry default / overlay / user --set — flows through). The Job-name slug becomes "<chartSlug>-d<driverSlug>" when driver.version is set; host-managed-driver overlays (driver.enabled=false, version unset) keep the original 1-component shape. The chart-version label (aicr.nvidia.com/gpu-operator-chart-version) stays semantically tied to the chart pin only. - [low] Empty gpu-operator.Version no longer produces an unbundleable manifest. The prior helper early-returned on empty Version, but the hook ships unconditionally on the gpu-operator componentRef, so the template would render "<nil>" into the Job name and chart-version label — invalid DNS-1123 / label values. Fix: fall through with NormalizeVersionWithDefault's "0.1.0" default + slog.Warn so operators get a debuggable signal of the underlying recipe gap. - User-facing doc: --no-wait correctness-gate exception documented in docs/user/cli-reference.md (deploy-script section) and pkg/bundler/deployer/helm/templates/README.md.tmpl so operators reading the doc know the flag does not bypass gpu-operator-post. - Tests: TestInjectDRAParentChartVersionValue_DriverVersionKey, _DriverVersionAbsent, and _EmptyVersionFallback added to pin the new behavior. All existing bundler / manifest / recipes tests pass with -race; golangci-lint clean. End-to-end render verified on a real EKS H100 training recipe: gpu-operator componentValues carries _aicrParentChartVersion: 26.3.1 _aicrDriverVersion: 580.126.20 and the rendered Job name is aicr-dra-rollout-hook-26-3-1-d580-126-20-r1 — a driver.version-only override now produces a uniquely-named Job and triggers re-fire on every deployer.

…ers (NVIDIA#980) Add a Job to the bundler-synthesized gpu-operator-post chart that waits for gpu-operator's k8s-driver-manager to finish migrating drivers on every managed GPU node, then restarts the nvidia-dra-driver-gpu kubelet plugin DaemonSet so its NVML handle rebinds to the post-migration driver state. Extends the kubectl wait + rollout-restart block already in pkg/bundler/deployer/helm/templates/deploy.sh.tmpl to every deployer the bundler emits (Helm, helmfile, Flux, Argo CD, argocd-helm), not just the default Helm deployer that runs ./deploy.sh. Re-fire mechanism — dual approach across bundler paths: 1. Non-vendored localformat (default for Helm/helmfile/Argo CD/ argocd-helm). local_helm.stripHelmHooks unconditionally removes sync-phase helm.sh/hook annotations because Argo CD's syncPolicy.automated would otherwise treat the resource as a PostSync hook that never fires under path-based sources. After stripping, the Job ships as a regular chart resource; re-fire on upgrade is achieved by keying the Job name on the parent gpu-operator chart version, which the bundler injects into the synthesized chart's values map at gpu-operator._aicrParentChartVersion (see DefaultBundler.injectDRAParentChartVersionValue). The values-map path is used in preference to .Chart.Version because the synthesized chart's Chart.yaml hardcodes 0.1.0 (which would defeat re-fire) and because flux-oci's source-controller suffixes .Chart.Version with "+<artifact-sha>" (invalid in K8s label values). 2. Vendored localformat (--vendor-charts). vendor_folder.go calls injectPostInstallHooks on each mixed recipe-side manifest, which auto-tags resources with helm.sh/hook: post-install only UNLESS the document already declares helm.sh/hook (vendor_wrapper.go injectHookOnDoc:223), and auto-assigns sequential hook-weights starting at postInstallHookWeightBase=100. Two consequences for this manifest: (a) auto-injected post-install hooks fire only on first install and are silently skipped on every helm upgrade — exactly when the stale-NVML race occurs. (b) if only the Job author-declares a hook, the SA / Cluster Role / ClusterRoleBinding still get auto-injected weights 100/101/102 while the Job takes Helm's default weight 0, putting the Job BEFORE its ServiceAccount/RBAC in install order and failing with "ServiceAccount not found". Both are addressed by author-declaring helm.sh/hook: post-install,post-upgrade plus before-hook-creation on ALL FOUR resources with ascending hook-weights: ServiceAccount=1, ClusterRole=2, ClusterRoleBinding=3, Job=10 vendor_wrapper honors the author-declared hooks; the explicit weights guarantee the Job's prerequisites exist before it runs; post-upgrade re-fires the Job (and refreshes the RBAC) on every helm upgrade. stripHelmHooks on the non-vendored path strips ALL of these annotations (every phase here is in syncPhaseHooks at hooks.go:109-120), so this declaration is inert there and Helm's default kind-based ordering (SA → ClusterRole → ClusterRole Binding → Job) plus the chart-version-keyed Job name (path 1) provide the same install order and re-fire mechanism without hooks. helm.sh/resource-policy: keep applies to path 1 only and prevents Helm's three-way merge from pruning prior-version Jobs mid-rollout. It is a no-op for the vendored hook path (Helm does not apply resource-policy to hook resources); retention there is governed by helm.sh/hook-delete-policy=before-hook-creation, which deletes a hook resource ONLY when a new hook with the same name is created. SA / CR / CRB names are stable so they cycle on every install/upgrade (idempotent delete-and-recreate); Job names are chart-version-keyed so prior- version Jobs survive cross-version upgrades as Succeeded records. ttlSecondsAfterFinished is intentionally NOT set: the Job is part of GitOps desired state, so a Kubernetes TTL delete would make the resource missing relative to the rendered manifest and trigger an immediate recreate on the next reconcile. Shell script is fail-closed (POSIX sh, alpine/k8s busybox): - Exits 0 only when state is positively determined as a no-op: DRA DaemonSet absent (fresh install), driver DaemonSet absent (host-managed driver mode), or zero managed nodes labeled nvidia.com/gpu.deploy.driver=true. - Exits 1 on every other failure: any classification kubectl-get fails, kubectl-wait for upgrade-done times out, rollout-restart fails, or rollout-status times out within 5m. - Uses a capture-then-check pattern (KUBECTL_OUT=$(kubectl ... 2>&1) || { echo "..."; exit 1; }) to distinguish "API call failed" from "result is empty" — POSIX sh lacks `set -o pipefail`, so the previous `kubectl ... | grep ... | head` patterns silently masked kubectl exit codes and the warning-only timeouts could exit 0 after an incomplete mitigation. - backoffLimit: 2 on the Job is load-bearing: a transient apiserver hiccup auto-retries twice before surfacing as a failed release. RBAC (ClusterRole + ClusterRoleBinding only): - daemonsets.apps — get/list/watch/patch (cluster-wide; watch required by kubectl rollout status). - nodes — get/list/watch (for the upgrade-done label wait). - pods — get/list/watch (rollout-status pod readiness). A namespaced Role was rejected because gpu-operator-post installs before nvidia-dra-driver-gpu's chart, so a Role inside the nvidia-dra-driver namespace would fail on a fresh cluster (namespace doesn't exist yet). Image: docker.io/alpine/k8s:1.34.8 pinned by multi-arch manifest list digest sha256:4d352ba7479706876a62566c4a8eaaf44d8182d39ee456dbd884830df5e493be, matching the repo's digest-pinning standard. alpine/k8s ships kubectl plus BusyBox sh; the script targets POSIX /bin/sh (no [[…]], no pipefail). The existing deploy.sh.tmpl block is intentionally kept as defense-in-depth for Helm-deployer users; the extra rollout-restart on an already-rolled DaemonSet is a no-op. A follow-up will evaluate removing the script-side block once this Job has shipped in a release and proven reliable. Wiring lives on the gpu-operator componentRef in recipes/overlays/base.yaml so it cascades to every service overlay (EKS, GKE, AKS, OKE, Kind) via the additive manifestFiles merge. Refs NVIDIA#980 Related: NVIDIA#965 (deploy.sh block this Job generalizes), NVIDIA#973 (orthogonal durability for the chart-version annotation), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug). Cross-review iteration 3 — cross-deployer gating + values-only re-fire: - [P1] Cross-deployer install-time gating wired across every bundler path. The prior revision relied on the Job's eventual rollout-restart to self-correct the transient race; without install-time gating, downstream releases (notably nvidia-dra-driver-gpu) could install against stale-NVML driver state for the 15-20m window between gpu-operator-post applying its chart resources and the Job's kubectl wait + rollout restart returning. Fix: - pkg/bundler/deployer/helm/templates/deploy.sh.tmpl: per-name case for gpu-operator-post sets COMPONENT_HELM_TIMEOUT=25m + COMPONENT_EXTRA_WAIT_ARGS=--wait-for-jobs + COMPONENT_FORCE_WAIT=true (overrides --no-wait — the flag is a readiness-convenience knob, not a correctness-bypass knob; components that exist specifically to close a known race must always wait). Vendored gpu-operator case lifted to 25m for the --vendor-charts path where the Job ships as a hook on the primary chart. - pkg/bundler/deployer/helmfile/releases.go: new releaseOverrides map keyed by synthesized folder name; gpu-operator-post -> {timeout: 25m, waitForJobs: true}; gpu-operator (vendored coverage) -> {timeout: 25m}. Mirrors the deploy.sh case switch. - pkg/bundler/deployer/flux/helm.go: new Timeout field on HelmReleaseData / ChartRefHelmReleaseData + releaseTimeoutOverrides map (gpu-operator-post -> 25m). helmrelease.yaml.tmpl and helmrelease-chartref.yaml.tmpl emit spec.timeout conditionally. helm-controller already waits for hook Jobs by default, so no waitForJobs knob is needed on this path — only the outer timeout. - Argo CD path is naturally gated by sync-wave + Job-Healthy default: app-of-apps waits for each child Application to reach Healthy, which requires the Job's status.succeeded. No explicit wiring needed. - [medium] Driver-version-only overrides now re-fire the Job on every deployer. The prior key was _aicrParentChartVersion (= gpu-operator chart version) only. On non-hook deployers (Helmfile / Argo CD / argocd-helm where stripHelmHooks removes the upgrade-hook annotations), the Job is a regular chart resource — kubectl apply on the same name is a no-op, so --set gpuoperator:driver.version=<X> did NOT re-run the migration mitigation when the gpu-operator chart pin was unchanged. Fix: new draDriverVersionValueKey = _aicrDriverVersion is injected alongside the parent-chart version (reading the merged componentValues at driver.version, so any source — registry default / overlay / user --set — flows through). The Job-name slug becomes "<chartSlug>-d<driverSlug>" when driver.version is set; host-managed-driver overlays (driver.enabled=false, version unset) keep the original 1-component shape. The chart-version label (aicr.nvidia.com/gpu-operator-chart-version) stays semantically tied to the chart pin only. - [low] Empty gpu-operator.Version no longer produces an unbundleable manifest. The prior helper early-returned on empty Version, but the hook ships unconditionally on the gpu-operator componentRef, so the template would render "<nil>" into the Job name and chart-version label — invalid DNS-1123 / label values. Fix: fall through with NormalizeVersionWithDefault's "0.1.0" default + slog.Warn so operators get a debuggable signal of the underlying recipe gap. - User-facing doc: --no-wait correctness-gate exception documented in docs/user/cli-reference.md (deploy-script section) and pkg/bundler/deployer/helm/templates/README.md.tmpl so operators reading the doc know the flag does not bypass gpu-operator-post. - Tests: TestInjectDRAParentChartVersionValue_DriverVersionKey, _DriverVersionAbsent, and _EmptyVersionFallback added to pin the new behavior. All existing bundler / manifest / recipes tests pass with -race; golangci-lint clean. End-to-end render verified on a real EKS H100 training recipe: gpu-operator componentValues carries _aicrParentChartVersion: 26.3.1 _aicrDriverVersion: 580.126.20 and the rendered Job name is aicr-dra-rollout-hook-26-3-1-d580-126-20-r1 — a driver.version-only override now produces a uniquely-named Job and triggers re-fire on every deployer. Document known limits of the cross-deployer gate: - Argo CD app-of-apps. Sync-waves give ordering but not gating after Argo CD 1.8 removed default Health assessment for argoproj.io/Application (https://argo-cd.readthedocs.io/en/stable/operator-manual/upgrading/1.7-1.8/#health-assessment-of-argoprojioapplication-crd-has-been-removed). Restoring gating needs a resource.customizations Lua block on argocd-cm — out of scope for the bundle (cluster-wide config). Without it, Argo CD users fall back to the Job's fail-closed semantics: exit 1 surfaces gpu-operator-post Application as unHealthy, but nvidia-dra-driver-gpu may still sync in parallel. - Post-bundle driver.version edits. The Job-name slug is baked at bundle time by pkg/manifest.Render via pkg/bundler/deployer/localformat/local_helm.go:155, so a driver.version change applied after `aicr bundle` does not update the Job name (re-bundle picks it up; install-time `--set` / `--dynamic` does not). Closing this fully needs deferring this manifest's render to helm install time — a structural change to the localformat render pipeline tracked as a follow-up. PR description updated to reflect both limitations and the already-wired install-time gating across helm / helmfile / Flux.

yuanchen8911 requested review from a team as code owners May 19, 2026 14:49

yuanchen8911 added enhancement New feature or request area/recipes labels May 19, 2026

yuanchen8911 changed the title ~~chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20~~ WIP: chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20 May 19, 2026

yuanchen8911 marked this pull request as draft May 19, 2026 14:49

github-actions Bot added area/docs size/S labels May 19, 2026

yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch from 24ed5c8 to 1d586e9 Compare May 19, 2026 16:57

github-actions Bot added area/tests size/M and removed size/S labels May 19, 2026

This was referenced May 19, 2026

chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump) #973

Closed

chore(recipes): make resolved recipes the single source of truth for chart versions #966

Open

yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch 3 times, most recently from 8f2e66e to 1cefd48 Compare May 19, 2026 18:19

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread recipes/components/nvidia-dra-driver-gpu/values.yaml

yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch from 1cefd48 to 21778df Compare May 19, 2026 18:33

yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch from 21778df to 9dc9fb9 Compare May 19, 2026 20:32

yuanchen8911 marked this pull request as ready for review May 19, 2026 22:08

yuanchen8911 marked this pull request as draft May 19, 2026 22:09

yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch from 9dc9fb9 to bbb6ced Compare May 19, 2026 22:13

mchmarny assigned yuanchen8911 May 19, 2026

yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch from bbb6ced to bccefa7 Compare May 19, 2026 22:40

github-actions Bot added the area/bundler label May 19, 2026

yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch from 97c0698 to 9cae76f Compare May 20, 2026 18:35

yuanchen8911 changed the title ~~WIP: chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20~~ chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20 May 20, 2026

yuanchen8911 marked this pull request as ready for review May 20, 2026 18:35

yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch 2 times, most recently from 7ec5dc9 to 570bf0a Compare May 20, 2026 20:26

mchmarny approved these changes May 21, 2026

View reviewed changes

mchmarny enabled auto-merge (squash) May 21, 2026 01:42

yuanchen8911 force-pushed the chore/gpu-operator-v26.3.1 branch from 8e73cc2 to 7c920d0 Compare May 21, 2026 13:57

mchmarny merged commit 496e2db into NVIDIA:main May 21, 2026
142 checks passed

mchmarny mentioned this pull request May 23, 2026

[Bug]: nvkind gpu-operator values set unsupported devicePlugin.affinity, allowing device plugin to schedule on control-plane #982

Closed

3 tasks

coderabbitai Bot mentioned this pull request May 28, 2026

fix(recipes): address BCM overlay gaps from H200 NVL validation #1089

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20#965

chore(recipes): bump gpu-operator chart to v26.3.1 and driver to 580.126.20#965
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:chore/gpu-operator-v26.3.1

yuanchen8911 commented May 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Reviews paused

Walkthrough

Estimated code review effort

Possibly related issues

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

yuanchen8911 commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Uh oh!

yuanchen8911 commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Known follow-ups

Checklist

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Possibly related issues

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuanchen8911 commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Uh oh!

yuanchen8911 commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented May 19, 2026 •

edited

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading