feat(bundler): derive DRA chart-version annotation from resolved recipe (#973)#1033
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR injects the resolved gpu-operator chart version into nvidia-dra-driver-gpu componentValues (controller.podAnnotations and kubeletPlugin.podAnnotations) during bundler execution. The injection runs in DefaultBundler.Make immediately after extractComponentValues and only when both gpu-operator and nvidia-dra-driver-gpu are enabled. Tests (unit and cross-deployer parity) validate positive and negative cases. The manual annotation was removed from the DRA recipe values.yaml. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
5e6615a to
74e4f1f
Compare
74e4f1f to
6b7da4b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/bundler/bundler_dra_annotation_parity_test.go`:
- Line 103: Replace the use of context.Background() passed to b.Make with a
timeout-bounded context: create a context with timeout (e.g.,
context.WithTimeout(ctx, 30*time.Second)), defer the cancel, and pass that
context to b.Make (replace occurrences where b.Make(context.Background(), rr,
tmpDir) appears, including the second occurrence around line 161); ensure you
import time and use the new ctx variable so the test’s I/O call is bounded and
the cancel is deferred.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 7791223e-994c-4da9-a089-6d4efc469582
📒 Files selected for processing (4)
pkg/bundler/bundler.gopkg/bundler/bundler_dra_annotation_parity_test.gopkg/bundler/bundler_dra_annotation_test.gorecipes/components/nvidia-dra-driver-gpu/values.yaml
6b7da4b to
8b5ef74
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Clean refactor — moves the chart-version annotation from a hand-maintained values entry to a bundler-derived injection, killing the drift risk that #973 called out. Helper mirrors injectGKECriticalPriorityQuotas for shape consistency, and the function header doc is unusually clear about why the annotation exists, when it fires, and where the injection point sits in Make. Test coverage is thorough (positive, both gating directions, preserves-existing, overrides --set, version-string variants, nil-tolerance, lazy-map creation).
Two low-severity suggestions and two nits inline; no blockers:
- Edge case worth pinning: gpu-operator present in refs but with empty
Versionis currently a silent skip — consider aslog.Warn+ a test row to distinguish "disabled" from "enabled-with-missing-version". - Parity test covers helm + helmfile; the structural argument for the other three deployers is sound, but extending the table to all five would also catch bundle-layout drift (argocd/argocd-helm in particular emit a different shape).
All 102 CI checks green. Mergeable state shows "blocked" — looks like branch protection / required reviewer, not a CI failure.
8b5ef74 to
229946b
Compare
…pe (NVIDIA#973) Replace the hand-maintained aicr.nvidia.com/gpu-operator-chart-version annotation in recipes/components/nvidia-dra-driver-gpu/values.yaml with a bundler-side injection that reads the value from the resolved gpu-operator componentRef and writes it onto both controller and kubeletPlugin podAnnotations of the rendered DRA values. The annotation cannot drift from the actual chart pin any more — a gpu-operator chart bump produces a rendered pod-template diff automatically, helm upgrade (and every other deployer) re-rolls the DaemonSet, and the kubelet plugin's NVML handle rebinds to the post-migration driver state. Why this matters PR NVIDIA#965 mitigated the stale-NVML class of bug (gpu-operator chart bump → k8s-driver-manager reloads kernel modules async → DRA kubelet plugin's NVML handle goes stale → CDI spec generation fails with "Driver/library version mismatch" → DRA-allocated pods stuck in ContainerCreating) by hard-coding the gpu-operator chart version into the DRA pod-template annotation. The annotation worked, but only as long as a maintainer remembered to bump both pins in lockstep. A future PR that bumped gpu-operator and forgot the annotation would produce identical rendered DaemonSet manifests, helm upgrade would skip the re-roll, and stale NVML would return silently. make qualify did not catch the drift because no static check verified the two pins. Implementation injectDRAChartVersionAnnotation iterates the filtered recipeResult.ComponentRefs (Make has already filtered out disabled components by this point), extracts the gpu-operator chart version, and gates on both gpu-operator and nvidia-dra-driver-gpu being enabled. When both are present, it writes aicr.nvidia.com/gpu-operator-chart-version onto controller and kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map, creating nested maps lazily so existing values (priorityClassName, other annotations) are preserved. Injection happens in DefaultBundler.Make AFTER extractComponentValues — so the values map is populated and user --set overrides have already landed — and BEFORE buildDeployer — so every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives the same final componentValues map. Placing the call after --set means a user override of this specific annotation key is intentionally NOT honored; the annotation must always reflect the actual resolved gpu-operator chart version, or the rollout-trigger semantics break. The injection mirrors the GKE critical-priority quota helper (pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so callers familiar with that pattern read this one for free. Tests - Unit tests (bundler_dra_annotation_test.go): positive case both pod paths, negative case DRA disabled, negative case gpu-operator disabled, preserves-existing-values, overrides-user-set, version-variants (v-prefixed / unprefixed / pre-release / build metadata), nil-tolerant inputs, lazy-DRA-values-map creation. Helper coverage: 100%. - Generated-artifact parity test (bundler_dra_annotation_parity_test.go): bundles the same recipe through the default Helm deployer AND the helmfile deployer; asserts the annotation lands in the rendered DRA values.yaml for BOTH outputs. Catches the cross-deployer parity risk if a future refactor moves the injection into a deployer-specific render path. Plus a disabled-recipe negative case that asserts the gpu-operator values do not leak any DRA-related annotation when DRA is filtered out. Closes NVIDIA#973 Related: NVIDIA#965 (the chart-version annotation this durabilizes), NVIDIA#980 (orthogonal cross-deployer corrective restart), NVIDIA#894 (chart bump that first surfaced the stale-NVML class of bug)
229946b to
0c7b7f9
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/bundler/bundler_dra_annotation_test.go`:
- Around line 29-335: Consolidate the multiple
TestInjectDRAChartVersionAnnotation_* functions (PositiveCase,
DRAComponentDisabled, GPUOperatorDisabled, GPUOperatorEmptyVersion,
PreservesExistingValues, OverridesUserSet, NilInputs, LazyDRAValuesMap) into a
single table-driven test that iterates over a slice of cases and runs t.Run for
each, keeping TestInjectDRAChartVersionAnnotation_VersionVariants as-is; for
each case define the input componentValues and recipe.RecipeResult, the name,
and the expected assertions (e.g. expected created keys, preserved fields,
overridden annotation, or no-op), then call
b.injectDRAChartVersionAnnotation(componentValues, rr) and perform the same
assertions currently in the individual tests using the case data so the
arrange/call/assert pattern is unified and duplication removed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 3dc5210c-0b0f-4df2-b228-4edf493ca07e
📒 Files selected for processing (4)
pkg/bundler/bundler.gopkg/bundler/bundler_dra_annotation_parity_test.gopkg/bundler/bundler_dra_annotation_test.gorecipes/components/nvidia-dra-driver-gpu/values.yaml
| func TestInjectDRAChartVersionAnnotation_PositiveCase(t *testing.T) { | ||
| b, err := New() | ||
| if err != nil { | ||
| t.Fatalf("New() error = %v", err) | ||
| } | ||
|
|
||
| componentValues := map[string]map[string]any{ | ||
| gpuOperatorComponentName: {}, | ||
| draComponentName: {}, | ||
| } | ||
| rr := &recipe.RecipeResult{ | ||
| ComponentRefs: []recipe.ComponentRef{ | ||
| {Name: gpuOperatorComponentName, Version: "v26.4.0"}, | ||
| {Name: draComponentName, Version: "25.12.0"}, | ||
| }, | ||
| } | ||
|
|
||
| b.injectDRAChartVersionAnnotation(componentValues, rr) | ||
|
|
||
| for _, podPath := range []string{"controller", "kubeletPlugin"} { | ||
| got := dig(componentValues[draComponentName], podPath, "podAnnotations", draChartVersionAnnotation) | ||
| if got != "v26.4.0" { | ||
| t.Errorf("podAnnotations[%s][%s] = %v, want v26.4.0", | ||
| podPath, draChartVersionAnnotation, got) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // TestInjectDRAChartVersionAnnotation_DRAComponentDisabled pins the | ||
| // gating: when nvidia-dra-driver-gpu is absent from the filtered | ||
| // ComponentRefs (already filtered out as disabled by the caller), the | ||
| // helper is a no-op even if gpu-operator is enabled. This is the | ||
| // "negative case" the issue's acceptance criteria call out — proves | ||
| // the gating is correct and no warning fires. | ||
| func TestInjectDRAChartVersionAnnotation_DRAComponentDisabled(t *testing.T) { | ||
| b, err := New() | ||
| if err != nil { | ||
| t.Fatalf("New() error = %v", err) | ||
| } | ||
|
|
||
| componentValues := map[string]map[string]any{ | ||
| gpuOperatorComponentName: {}, | ||
| } | ||
| rr := &recipe.RecipeResult{ | ||
| ComponentRefs: []recipe.ComponentRef{ | ||
| {Name: gpuOperatorComponentName, Version: "v26.4.0"}, | ||
| }, | ||
| } | ||
|
|
||
| b.injectDRAChartVersionAnnotation(componentValues, rr) | ||
|
|
||
| if _, ok := componentValues[draComponentName]; ok { | ||
| t.Errorf("expected no DRA entry in componentValues when DRA component is disabled, got %v", | ||
| componentValues[draComponentName]) | ||
| } | ||
| } | ||
|
|
||
| // TestInjectDRAChartVersionAnnotation_GPUOperatorDisabled pins the | ||
| // mirror gating: when gpu-operator is absent (disabled or filtered | ||
| // out), no annotation is written because there's no chart version to | ||
| // mirror. | ||
| func TestInjectDRAChartVersionAnnotation_GPUOperatorDisabled(t *testing.T) { | ||
| b, err := New() | ||
| if err != nil { | ||
| t.Fatalf("New() error = %v", err) | ||
| } | ||
|
|
||
| componentValues := map[string]map[string]any{ | ||
| draComponentName: {}, | ||
| } | ||
| rr := &recipe.RecipeResult{ | ||
| ComponentRefs: []recipe.ComponentRef{ | ||
| {Name: draComponentName, Version: "25.12.0"}, | ||
| }, | ||
| } | ||
|
|
||
| b.injectDRAChartVersionAnnotation(componentValues, rr) | ||
|
|
||
| if got := componentValues[draComponentName]; len(got) != 0 { | ||
| t.Errorf("expected unchanged DRA values when gpu-operator is disabled, got %v", got) | ||
| } | ||
| } | ||
|
|
||
| // TestInjectDRAChartVersionAnnotation_GPUOperatorEmptyVersion pins the | ||
| // defensive branch: when gpu-operator IS enabled but its Version field | ||
| // is empty (the resolver shouldn't produce this in practice — see the | ||
| // helper docstring for why), the helper logs a warning and skips | ||
| // injection rather than writing an empty annotation value that would | ||
| // itself lock the DaemonSet to a meaningless pin and erase the | ||
| // "annotation never drifts from the chart pin" guarantee. | ||
| func TestInjectDRAChartVersionAnnotation_GPUOperatorEmptyVersion(t *testing.T) { | ||
| b, err := New() | ||
| if err != nil { | ||
| t.Fatalf("New() error = %v", err) | ||
| } | ||
|
|
||
| componentValues := map[string]map[string]any{ | ||
| gpuOperatorComponentName: {}, | ||
| draComponentName: {}, | ||
| } | ||
| rr := &recipe.RecipeResult{ | ||
| ComponentRefs: []recipe.ComponentRef{ | ||
| {Name: gpuOperatorComponentName, Version: ""}, | ||
| {Name: draComponentName, Version: "25.12.0"}, | ||
| }, | ||
| } | ||
|
|
||
| b.injectDRAChartVersionAnnotation(componentValues, rr) | ||
|
|
||
| if got := componentValues[draComponentName]; len(got) != 0 { | ||
| t.Errorf("expected unchanged DRA values when gpu-operator Version is empty, got %v", got) | ||
| } | ||
| } | ||
|
|
||
| // TestInjectDRAChartVersionAnnotation_PreservesExistingValues | ||
| // documents the additive contract: priorityClassName and any | ||
| // pre-existing podAnnotations from the values file (or from earlier | ||
| // override layers) survive injection. The helper only writes the one | ||
| // annotation key and leaves the rest of the controller / kubeletPlugin | ||
| // sections untouched. | ||
| func TestInjectDRAChartVersionAnnotation_PreservesExistingValues(t *testing.T) { | ||
| b, err := New() | ||
| if err != nil { | ||
| t.Fatalf("New() error = %v", err) | ||
| } | ||
|
|
||
| componentValues := map[string]map[string]any{ | ||
| gpuOperatorComponentName: {}, | ||
| draComponentName: { | ||
| "controller": map[string]any{ | ||
| "priorityClassName": "system-cluster-critical", | ||
| "podAnnotations": map[string]any{ | ||
| "operator.example.com/other-key": "preserved", | ||
| }, | ||
| }, | ||
| "kubeletPlugin": map[string]any{ | ||
| "priorityClassName": "system-node-critical", | ||
| }, | ||
| }, | ||
| } | ||
| rr := &recipe.RecipeResult{ | ||
| ComponentRefs: []recipe.ComponentRef{ | ||
| {Name: gpuOperatorComponentName, Version: "v26.4.0"}, | ||
| {Name: draComponentName, Version: "25.12.0"}, | ||
| }, | ||
| } | ||
|
|
||
| b.injectDRAChartVersionAnnotation(componentValues, rr) | ||
|
|
||
| dra := componentValues[draComponentName] | ||
| if got := dig(dra, "controller", "priorityClassName"); got != "system-cluster-critical" { | ||
| t.Errorf("controller.priorityClassName = %v, want system-cluster-critical", got) | ||
| } | ||
| if got := dig(dra, "controller", "podAnnotations", "operator.example.com/other-key"); got != "preserved" { | ||
| t.Errorf("controller.podAnnotations[operator.example.com/other-key] = %v, want preserved", got) | ||
| } | ||
| if got := dig(dra, "controller", "podAnnotations", draChartVersionAnnotation); got != "v26.4.0" { | ||
| t.Errorf("controller.podAnnotations[%s] = %v, want v26.4.0", draChartVersionAnnotation, got) | ||
| } | ||
| if got := dig(dra, "kubeletPlugin", "priorityClassName"); got != "system-node-critical" { | ||
| t.Errorf("kubeletPlugin.priorityClassName = %v, want system-node-critical", got) | ||
| } | ||
| if got := dig(dra, "kubeletPlugin", "podAnnotations", draChartVersionAnnotation); got != "v26.4.0" { | ||
| t.Errorf("kubeletPlugin.podAnnotations[%s] = %v, want v26.4.0", draChartVersionAnnotation, got) | ||
| } | ||
| } | ||
|
|
||
| // TestInjectDRAChartVersionAnnotation_OverridesUserSet pins the | ||
| // "internal annotation always reflects the actual chart version" | ||
| // invariant. A user --set that wrote a stale value into the | ||
| // annotation must be overwritten by the bundler-derived value; | ||
| // otherwise the rollout-trigger semantics break and the durable fix | ||
| // degrades into the same drift the manual annotation had. Injection | ||
| // happens AFTER extractComponentValues for exactly this reason. | ||
| func TestInjectDRAChartVersionAnnotation_OverridesUserSet(t *testing.T) { | ||
| b, err := New() | ||
| if err != nil { | ||
| t.Fatalf("New() error = %v", err) | ||
| } | ||
|
|
||
| componentValues := map[string]map[string]any{ | ||
| gpuOperatorComponentName: {}, | ||
| draComponentName: { | ||
| "controller": map[string]any{ | ||
| "podAnnotations": map[string]any{ | ||
| draChartVersionAnnotation: "v25.10.1-stale-from-user-set", | ||
| }, | ||
| }, | ||
| }, | ||
| } | ||
| rr := &recipe.RecipeResult{ | ||
| ComponentRefs: []recipe.ComponentRef{ | ||
| {Name: gpuOperatorComponentName, Version: "v26.4.0"}, | ||
| {Name: draComponentName, Version: "25.12.0"}, | ||
| }, | ||
| } | ||
|
|
||
| b.injectDRAChartVersionAnnotation(componentValues, rr) | ||
|
|
||
| got := dig(componentValues[draComponentName], "controller", "podAnnotations", draChartVersionAnnotation) | ||
| if got != "v26.4.0" { | ||
| t.Errorf("expected user --set value to be overridden by injection: got %v, want v26.4.0", got) | ||
| } | ||
| } | ||
|
|
||
| // TestInjectDRAChartVersionAnnotation_VersionVariants documents that | ||
| // the helper mirrors whatever string the resolver wrote into the | ||
| // gpu-operator ComponentRef. Helm chart pins ship with and without a | ||
| // leading "v"; the annotation tracks the exact string so the rendered | ||
| // pod-template diff reliably fires on a chart-pin change of any | ||
| // shape. | ||
| func TestInjectDRAChartVersionAnnotation_VersionVariants(t *testing.T) { | ||
| tests := []struct { | ||
| name string | ||
| version string | ||
| }{ | ||
| {"v-prefixed", "v26.4.0"}, | ||
| {"unprefixed", "26.4.0"}, | ||
| {"pre-release", "v26.5.0-beta.1"}, | ||
| {"build metadata", "v26.4.0+nvidia.1"}, | ||
| } | ||
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { | ||
| b, err := New() | ||
| if err != nil { | ||
| t.Fatalf("New() error = %v", err) | ||
| } | ||
| componentValues := map[string]map[string]any{ | ||
| gpuOperatorComponentName: {}, | ||
| draComponentName: {}, | ||
| } | ||
| rr := &recipe.RecipeResult{ | ||
| ComponentRefs: []recipe.ComponentRef{ | ||
| {Name: gpuOperatorComponentName, Version: tt.version}, | ||
| {Name: draComponentName, Version: "25.12.0"}, | ||
| }, | ||
| } | ||
| b.injectDRAChartVersionAnnotation(componentValues, rr) | ||
| got := dig(componentValues[draComponentName], "controller", "podAnnotations", draChartVersionAnnotation) | ||
| if got != tt.version { | ||
| t.Errorf("controller annotation = %v, want %v", got, tt.version) | ||
| } | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| // TestInjectDRAChartVersionAnnotation_NilInputs documents the | ||
| // nil-tolerant contract. The bundler's Make method only calls this | ||
| // after extractComponentValues returns a non-nil map and pretty much | ||
| // never with a nil recipe, but defensive nil-handling keeps the | ||
| // helper safe in unit tests and future callers. | ||
| func TestInjectDRAChartVersionAnnotation_NilInputs(t *testing.T) { | ||
| b, err := New() | ||
| if err != nil { | ||
| t.Fatalf("New() error = %v", err) | ||
| } | ||
| tests := []struct { | ||
| name string | ||
| cv map[string]map[string]any | ||
| rr *recipe.RecipeResult | ||
| }{ | ||
| {"nil componentValues", nil, &recipe.RecipeResult{}}, | ||
| {"nil recipe result", map[string]map[string]any{}, nil}, | ||
| {"both nil", nil, nil}, | ||
| } | ||
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { | ||
| // Should not panic. | ||
| b.injectDRAChartVersionAnnotation(tt.cv, tt.rr) | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| // TestInjectDRAChartVersionAnnotation_LazyDRAValuesMap pins one | ||
| // implementation detail worth fixing in place: when the DRA component | ||
| // is enabled but its componentValues entry hasn't been initialized | ||
| // yet (no values file, no overrides), the helper still injects by | ||
| // creating the nested map. This matters because every-deployer parity | ||
| // requires the annotation to appear in the rendered chart even when | ||
| // the DRA chart has no other values configured. | ||
| func TestInjectDRAChartVersionAnnotation_LazyDRAValuesMap(t *testing.T) { | ||
| b, err := New() | ||
| if err != nil { | ||
| t.Fatalf("New() error = %v", err) | ||
| } | ||
|
|
||
| componentValues := map[string]map[string]any{ | ||
| gpuOperatorComponentName: {}, | ||
| // nvidia-dra-driver-gpu key intentionally absent. | ||
| } | ||
| rr := &recipe.RecipeResult{ | ||
| ComponentRefs: []recipe.ComponentRef{ | ||
| {Name: gpuOperatorComponentName, Version: "v26.4.0"}, | ||
| {Name: draComponentName, Version: "25.12.0"}, | ||
| }, | ||
| } | ||
|
|
||
| b.injectDRAChartVersionAnnotation(componentValues, rr) | ||
|
|
||
| if _, ok := componentValues[draComponentName]; !ok { | ||
| t.Fatalf("expected helper to create DRA values entry on demand") | ||
| } | ||
| got := dig(componentValues[draComponentName], "controller", "podAnnotations", draChartVersionAnnotation) | ||
| if got != "v26.4.0" { | ||
| t.Errorf("controller annotation = %v, want v26.4.0", got) | ||
| } | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 🏗️ Heavy lift
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
f="pkg/bundler/bundler_dra_annotation_test.go"
if [ ! -f "$f" ]; then
echo "File not found: $f" >&2
exit 1
fi
# Print file with line numbers (trimmed to reasonable sizes by focusing around the test functions)
echo "== File stats =="
wc -l "$f"
echo "== Functions in file (names) =="
rg -n "func TestInjectDRAChartVersionAnnotation_" "$f" || true
rg -n "func dig\(" "$f" || true
rg -n "injectDRAChartVersionAnnotation" "$f" || true
echo "== Show relevant sections around test functions =="
# Show from start through a bit past the last test in the snippet range
sed -n '1,220p' "$f"
echo "-----"
sed -n '220,520p' "$f"Repository: NVIDIA/aicr
Length of output: 13804
Consolidate the TestInjectDRAChartVersionAnnotation_* scenarios into a single table-driven suite.
Most of these tests (PositiveCase, DRAComponentDisabled, GPUOperatorDisabled, GPUOperatorEmptyVersion, PreservesExistingValues, OverridesUserSet, NilInputs, LazyDRAValuesMap) follow the same arrange/call/assert shape with only input maps and expected outcomes changing; converting them into one tests := []struct{...} with subtests would reduce duplication and keep future scenarios consistent. (TestInjectDRAChartVersionAnnotation_VersionVariants already uses this pattern and can serve as the template.)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/bundler/bundler_dra_annotation_test.go` around lines 29 - 335,
Consolidate the multiple TestInjectDRAChartVersionAnnotation_* functions
(PositiveCase, DRAComponentDisabled, GPUOperatorDisabled,
GPUOperatorEmptyVersion, PreservesExistingValues, OverridesUserSet, NilInputs,
LazyDRAValuesMap) into a single table-driven test that iterates over a slice of
cases and runs t.Run for each, keeping
TestInjectDRAChartVersionAnnotation_VersionVariants as-is; for each case define
the input componentValues and recipe.RecipeResult, the name, and the expected
assertions (e.g. expected created keys, preserved fields, overridden annotation,
or no-op), then call b.injectDRAChartVersionAnnotation(componentValues, rr) and
perform the same assertions currently in the individual tests using the case
data so the arrange/call/assert pattern is unified and duplication removed.
Summary
Replace the hand-maintained
aicr.nvidia.com/gpu-operator-chart-versionannotation inrecipes/components/nvidia-dra-driver-gpu/values.yamlwith a bundler-side injection that derives the value from the resolvedgpu-operatorcomponentRef and writes it onto bothcontrollerandkubeletPluginpodAnnotationsin the rendered DRA values. The annotation can no longer drift from the actual chart pin.Motivation / Context
PR #965 closed the stale-NVML class of bug (gpu-operator chart bump →
k8s-driver-managerreloads kernel modules async → DRA kubelet plugin's NVML handle goes stale → CDI spec generation fails withDriver/library version mismatch→ DRA-allocated pods stuck inContainerCreating) by hard-coding the gpu-operator chart version into the DRA pod-template annotation. The annotation works, but only as long as a maintainer remembers to bump both pins in lockstep. A future PR that bumped gpu-operator and forgot the annotation would produce identical rendered DaemonSet manifests, helm upgrade would skip the re-roll, and stale NVML would return silently.make qualifydid not catch the drift because no static check verified the two pins.Fixes: #973
Related: #965 (the chart-version annotation this durabilizes), #980 (orthogonal cross-deployer corrective restart), #894 (chart bump that first surfaced the stale-NVML class of bug)
Type of Change
Component(s) Affected
pkg/bundler,pkg/component/*) — newinjectDRAChartVersionAnnotationhelper + call site inDefaultBundler.Makepkg/recipe) — no Go change, onlyrecipes/components/nvidia-dra-driver-gpu/values.yaml(stripped hand-maintained chart-version string)Implementation Notes
Helper.
injectDRAChartVersionAnnotationiterates the filteredrecipeResult.ComponentRefs(disabled components are removed upstream inDefaultBundler.Makeatpkg/bundler/bundler.go:219-257), extracts the gpu-operator chart version, and gates on BOTHgpu-operatorandnvidia-dra-driver-gpubeing enabled. When both are present, it writesaicr.nvidia.com/gpu-operator-chart-versionontocontrollerandkubeletPluginpodAnnotationsof the nvidia-dra-driver-gpu values map, creating nested maps lazily so existing values (priorityClassName, other annotations) are preserved.Injection point. In
DefaultBundler.MakeAFTERextractComponentValues— so the values map is populated and user--setoverrides have already landed — and BEFOREbuildDeployer— so every deployer receives the same finalcomponentValuesmap. Placing the call AFTER--setis deliberate: a user override of this specific annotation key is intentionally NOT honored; the annotation must always reflect the actual resolved gpu-operator chart version, or the rollout-trigger semantics break.TestInjectDRAChartVersionAnnotation_OverridesUserSetpins this invariant.Cross-deployer parity (structural). All five supported deployers consume
Generator.ComponentValues[ref.Name]from the same post-injection map:pkg/bundler/deployer/helm/helm.go:144pkg/bundler/deployer/helmfile/helmfile.go:136pkg/bundler/deployer/flux/helm.go:130pkg/bundler/deployer/argocd/argocd.go:552pkg/bundler/deployer/argocdhelm/argocdhelm.go:387The parity test (see Testing) directly exercises both internal consumption shapes — inline values marshal (helmfile) and
localformat.Component.Valueswrite (Helm, reused by argocd / argocd-helm) — and asserts the annotation lands in the renderedvalues.yaml. The remaining deployers are guaranteed by the single-map design; renamingGenerator.ComponentValueswould break compilation everywhere simultaneously.Pattern. Mirrors the GKE critical-priority quota helper (
pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so readers familiar with that pattern can pick this one up without context-switching.Recipe values change.
recipes/components/nvidia-dra-driver-gpu/values.yamlno longer carries the chart-version string at all. The old comment block is replaced with a short pointer to this helper.controller.priorityClassName: ""andkubeletPlugin.priorityClassName: ""remain unchanged.Behavior Delta
Standalone-DRA recipes (DRA enabled, gpu-operator disabled) no longer render the annotation, by design. The previously hand-maintained annotation lived inside
recipes/components/nvidia-dra-driver-gpu/values.yaml, so it was written unconditionally whenever the DRA component was bundled — including the (currently unused) case where DRA is deployed without gpu-operator (for example, a host-installed driver via--set gpuoperator:enabled=false). The new helper gates on BOTH components being enabled, so a standalone-DRA bundle renders without the annotation.This is the intended semantic: the annotation exists solely to force a DaemonSet re-roll when the gpu-operator chart pin bumps. With gpu-operator disabled, there is no chart pin to mirror and the rollout trigger has nothing to fire on. No in-tree overlay currently exercises this configuration, so the change is theoretical for today's recipes; it is the right shape going forward.
Testing
Tests added:
pkg/bundler/bundler_dra_annotation_test.go):priorityClassName+ unrelated annotations under both pod paths--setof the annotation key (proves the "must reflect actual chart version" invariant)v-prefixed, unprefixed, pre-release, build metadatacomponentValues[nvidia-dra-driver-gpu]when the entry isn't initializedpkg/bundler/bundler_dra_annotation_parity_test.go):values.yamlfor both outputs. These two deployers exercise both internal consumption shapes; Flux, Argo CD, and argocd-helm are guaranteed by the single-map design described in Implementation Notes.Risk Assessment
gpu-operator v26.3.1 / nvidia-dra-driver-gpu v25.12.0pair is byte-identical to the prior hand-maintained annotation. Easy revert (single function + call site + values comment).Rollout notes: No flag, no opt-in. The next gpu-operator chart bump after this PR lands gets the annotation derived automatically; no values-file edit needed. Existing bundles regenerated from the same gpu-operator pin produce the same output as today. See Behavior Delta for the only semantic change (annotation absent when gpu-operator is disabled).
Checklist
make testwith-race) — viamake qualifymake lint) — viamake qualifyinjectGKECriticalPriorityQuotasgit commit -S)