Skip to content

feat(bundler): derive DRA chart-version annotation from resolved recipe (#973)#1033

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/973-bundler-derived-dra-annotation
May 27, 2026
Merged

feat(bundler): derive DRA chart-version annotation from resolved recipe (#973)#1033
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/973-bundler-derived-dra-annotation

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented May 26, 2026

Summary

Replace the hand-maintained aicr.nvidia.com/gpu-operator-chart-version annotation in recipes/components/nvidia-dra-driver-gpu/values.yaml with a bundler-side injection that derives the value from the resolved gpu-operator componentRef and writes it onto both controller and kubeletPlugin podAnnotations in the rendered DRA values. The annotation can no longer drift from the actual chart pin.

Motivation / Context

PR #965 closed the stale-NVML class of bug (gpu-operator chart bump → k8s-driver-manager reloads kernel modules async → DRA kubelet plugin's NVML handle goes stale → CDI spec generation fails with Driver/library version mismatch → DRA-allocated pods stuck in ContainerCreating) by hard-coding the gpu-operator chart version into the DRA pod-template annotation. The annotation works, but only as long as a maintainer remembers to bump both pins in lockstep. A future PR that bumped gpu-operator and forgot the annotation would produce identical rendered DaemonSet manifests, helm upgrade would skip the re-roll, and stale NVML would return silently. make qualify did not catch the drift because no static check verified the two pins.

Fixes: #973
Related: #965 (the chart-version annotation this durabilizes), #980 (orthogonal cross-deployer corrective restart), #894 (chart bump that first surfaced the stale-NVML class of bug)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Refactoring (no functional changes) — net behavior is identical when gpu-operator chart pin matches the value previously hand-edited; durability change makes future drift impossible

Component(s) Affected

  • Bundlers (pkg/bundler, pkg/component/*) — new injectDRAChartVersionAnnotation helper + call site in DefaultBundler.Make
  • Recipe engine / data (pkg/recipe) — no Go change, only recipes/components/nvidia-dra-driver-gpu/values.yaml (stripped hand-maintained chart-version string)

Implementation Notes

Helper. injectDRAChartVersionAnnotation iterates the filtered recipeResult.ComponentRefs (disabled components are removed upstream in DefaultBundler.Make at pkg/bundler/bundler.go:219-257), extracts the gpu-operator chart version, and gates on BOTH gpu-operator and nvidia-dra-driver-gpu being enabled. When both are present, it writes aicr.nvidia.com/gpu-operator-chart-version onto controller and kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map, creating nested maps lazily so existing values (priorityClassName, other annotations) are preserved.

Injection point. In DefaultBundler.Make AFTER extractComponentValues — so the values map is populated and user --set overrides have already landed — and BEFORE buildDeployer — so every deployer receives the same final componentValues map. Placing the call AFTER --set is deliberate: a user override of this specific annotation key is intentionally NOT honored; the annotation must always reflect the actual resolved gpu-operator chart version, or the rollout-trigger semantics break. TestInjectDRAChartVersionAnnotation_OverridesUserSet pins this invariant.

Cross-deployer parity (structural). All five supported deployers consume Generator.ComponentValues[ref.Name] from the same post-injection map:

Deployer Consumption site
Helm pkg/bundler/deployer/helm/helm.go:144
helmfile pkg/bundler/deployer/helmfile/helmfile.go:136
Flux pkg/bundler/deployer/flux/helm.go:130
Argo CD pkg/bundler/deployer/argocd/argocd.go:552
argocd-helm pkg/bundler/deployer/argocdhelm/argocdhelm.go:387

The parity test (see Testing) directly exercises both internal consumption shapes — inline values marshal (helmfile) and localformat.Component.Values write (Helm, reused by argocd / argocd-helm) — and asserts the annotation lands in the rendered values.yaml. The remaining deployers are guaranteed by the single-map design; renaming Generator.ComponentValues would break compilation everywhere simultaneously.

Pattern. Mirrors the GKE critical-priority quota helper (pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so readers familiar with that pattern can pick this one up without context-switching.

Recipe values change. recipes/components/nvidia-dra-driver-gpu/values.yaml no longer carries the chart-version string at all. The old comment block is replaced with a short pointer to this helper. controller.priorityClassName: "" and kubeletPlugin.priorityClassName: "" remain unchanged.

Behavior Delta

Standalone-DRA recipes (DRA enabled, gpu-operator disabled) no longer render the annotation, by design. The previously hand-maintained annotation lived inside recipes/components/nvidia-dra-driver-gpu/values.yaml, so it was written unconditionally whenever the DRA component was bundled — including the (currently unused) case where DRA is deployed without gpu-operator (for example, a host-installed driver via --set gpuoperator:enabled=false). The new helper gates on BOTH components being enabled, so a standalone-DRA bundle renders without the annotation.

This is the intended semantic: the annotation exists solely to force a DaemonSet re-roll when the gpu-operator chart pin bumps. With gpu-operator disabled, there is no chart pin to mirror and the rollout trigger has nothing to fire on. No in-tree overlay currently exercises this configuration, so the change is theoretical for today's recipes; it is the right shape going forward.

Testing

unset GITLAB_TOKEN ; make qualify
# All 22 chainsaw tests pass.
# Coverage: 76.9% (threshold: 75%) — passes the project floor.
# Vulnerability scan: clean.

GOFLAGS="-mod=vendor" go test -coverprofile=/tmp/cover.out ./pkg/bundler
# pkg/bundler coverage 67.0% → 69.0% (+2.0% on this PR)
# injectDRAChartVersionAnnotation: 100% statement coverage

Tests added:

  • Unit tests (pkg/bundler/bundler_dra_annotation_test.go):
    • positive case both pod paths, annotation matches resolved gpu-operator version
    • negative case nvidia-dra-driver-gpu disabled → no-op, no DRA entry materialized
    • negative case gpu-operator disabled → no-op (pins the Behavior Delta above)
    • preserves existing priorityClassName + unrelated annotations under both pod paths
    • overrides user --set of the annotation key (proves the "must reflect actual chart version" invariant)
    • version variants: v-prefixed, unprefixed, pre-release, build metadata
    • nil-tolerant inputs (nil componentValues, nil recipe, both nil)
    • lazy-creation of componentValues[nvidia-dra-driver-gpu] when the entry isn't initialized
  • Generated-artifact parity test (pkg/bundler/bundler_dra_annotation_parity_test.go):
    • bundles the same recipe through the Helm deployer AND the helmfile deployer, asserts the annotation lands in the rendered DRA values.yaml for both outputs. These two deployers exercise both internal consumption shapes; Flux, Argo CD, and argocd-helm are guaranteed by the single-map design described in Implementation Notes.
    • disabled-recipe negative case: when nvidia-dra-driver-gpu is filtered out, the gpu-operator values must not leak any DRA-related annotation key

Risk Assessment

  • Low — Isolated bundler-side helper with 100% unit coverage and a cross-deployer parity test. Net rendered output for the current gpu-operator v26.3.1 / nvidia-dra-driver-gpu v25.12.0 pair is byte-identical to the prior hand-maintained annotation. Easy revert (single function + call site + values comment).

Rollout notes: No flag, no opt-in. The next gpu-operator chart bump after this PR lands gets the annotation derived automatically; no values-file edit needed. Existing bundles regenerated from the same gpu-operator pin produce the same output as today. See Behavior Delta for the only semantic change (annotation absent when gpu-operator is disabled).

Checklist

  • Tests pass locally (make test with -race) — via make qualify
  • Linter passes (make lint) — via make qualify
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed — no user-facing surface; recipe values comment refreshed
  • Changes follow existing patterns in the codebase — mirrors injectGKECriticalPriorityQuotas
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested review from a team as code owners May 26, 2026 18:35
@yuanchen8911 yuanchen8911 marked this pull request as draft May 26, 2026 18:37
@yuanchen8911 yuanchen8911 changed the title feat(bundler): derive DRA chart-version annotation from resolved recipe (#973) WIP: feat(bundler): derive DRA chart-version annotation from resolved recipe (#973) May 26, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR injects the resolved gpu-operator chart version into nvidia-dra-driver-gpu componentValues (controller.podAnnotations and kubeletPlugin.podAnnotations) during bundler execution. The injection runs in DefaultBundler.Make immediately after extractComponentValues and only when both gpu-operator and nvidia-dra-driver-gpu are enabled. Tests (unit and cross-deployer parity) validate positive and negative cases. The manual annotation was removed from the DRA recipe values.yaml.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

area/tests, size/L

Suggested reviewers

  • mchmarny
  • lockwobr
  • lalitadithya
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: deriving the DRA chart-version annotation from the resolved recipe rather than hand-maintaining it.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering summary, motivation, implementation, testing, and behavior changes.
Linked Issues check ✅ Passed The PR fully implements all primary objectives from issue #973: bundler-side injection of the annotation from resolved gpu-operator version, gating on both components being enabled, placement after extractComponentValues and before buildDeployer, and comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the objectives: new bundler helper, test files for unit and parity coverage, and recipe values cleanup. No extraneous modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 force-pushed the feat/973-bundler-derived-dra-annotation branch from 5e6615a to 74e4f1f Compare May 27, 2026 00:18
@yuanchen8911 yuanchen8911 force-pushed the feat/973-bundler-derived-dra-annotation branch from 74e4f1f to 6b7da4b Compare May 27, 2026 15:42
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/bundler/bundler_dra_annotation_parity_test.go`:
- Line 103: Replace the use of context.Background() passed to b.Make with a
timeout-bounded context: create a context with timeout (e.g.,
context.WithTimeout(ctx, 30*time.Second)), defer the cancel, and pass that
context to b.Make (replace occurrences where b.Make(context.Background(), rr,
tmpDir) appears, including the second occurrence around line 161); ensure you
import time and use the new ctx variable so the test’s I/O call is bounded and
the cancel is deferred.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 7791223e-994c-4da9-a089-6d4efc469582

📥 Commits

Reviewing files that changed from the base of the PR and between 74e4f1f and 6b7da4b.

📒 Files selected for processing (4)
  • pkg/bundler/bundler.go
  • pkg/bundler/bundler_dra_annotation_parity_test.go
  • pkg/bundler/bundler_dra_annotation_test.go
  • recipes/components/nvidia-dra-driver-gpu/values.yaml

Comment thread pkg/bundler/bundler_dra_annotation_parity_test.go Outdated
@yuanchen8911 yuanchen8911 changed the title WIP: feat(bundler): derive DRA chart-version annotation from resolved recipe (#973) feat(bundler): derive DRA chart-version annotation from resolved recipe (#973) May 27, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 27, 2026 16:09
@yuanchen8911 yuanchen8911 requested a review from mchmarny May 27, 2026 16:09
@yuanchen8911 yuanchen8911 force-pushed the feat/973-bundler-derived-dra-annotation branch from 6b7da4b to 8b5ef74 Compare May 27, 2026 16:11
@mchmarny mchmarny self-assigned this May 27, 2026
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean refactor — moves the chart-version annotation from a hand-maintained values entry to a bundler-derived injection, killing the drift risk that #973 called out. Helper mirrors injectGKECriticalPriorityQuotas for shape consistency, and the function header doc is unusually clear about why the annotation exists, when it fires, and where the injection point sits in Make. Test coverage is thorough (positive, both gating directions, preserves-existing, overrides --set, version-string variants, nil-tolerance, lazy-map creation).

Two low-severity suggestions and two nits inline; no blockers:

  • Edge case worth pinning: gpu-operator present in refs but with empty Version is currently a silent skip — consider a slog.Warn + a test row to distinguish "disabled" from "enabled-with-missing-version".
  • Parity test covers helm + helmfile; the structural argument for the other three deployers is sound, but extending the table to all five would also catch bundle-layout drift (argocd/argocd-helm in particular emit a different shape).

All 102 CI checks green. Mergeable state shows "blocked" — looks like branch protection / required reviewer, not a CI failure.

Comment thread pkg/bundler/bundler.go
Comment thread pkg/bundler/bundler_dra_annotation_parity_test.go Outdated
Comment thread pkg/bundler/bundler.go
Comment thread pkg/bundler/bundler.go
@yuanchen8911 yuanchen8911 force-pushed the feat/973-bundler-derived-dra-annotation branch from 8b5ef74 to 229946b Compare May 27, 2026 18:45
@yuanchen8911 yuanchen8911 requested a review from mchmarny May 27, 2026 18:47
…pe (NVIDIA#973)

Replace the hand-maintained
aicr.nvidia.com/gpu-operator-chart-version annotation in
recipes/components/nvidia-dra-driver-gpu/values.yaml with a
bundler-side injection that reads the value from the resolved
gpu-operator componentRef and writes it onto both controller and
kubeletPlugin podAnnotations of the rendered DRA values. The
annotation cannot drift from the actual chart pin any more — a
gpu-operator chart bump produces a rendered pod-template diff
automatically, helm upgrade (and every other deployer) re-rolls the
DaemonSet, and the kubelet plugin's NVML handle rebinds to the
post-migration driver state.

Why this matters

PR NVIDIA#965 mitigated the stale-NVML class of bug (gpu-operator chart
bump → k8s-driver-manager reloads kernel modules async → DRA kubelet
plugin's NVML handle goes stale → CDI spec generation fails with
"Driver/library version mismatch" → DRA-allocated pods stuck in
ContainerCreating) by hard-coding the gpu-operator chart version into
the DRA pod-template annotation. The annotation worked, but only as
long as a maintainer remembered to bump both pins in lockstep. A
future PR that bumped gpu-operator and forgot the annotation would
produce identical rendered DaemonSet manifests, helm upgrade would
skip the re-roll, and stale NVML would return silently. make qualify
did not catch the drift because no static check verified the two
pins.

Implementation

injectDRAChartVersionAnnotation iterates the filtered
recipeResult.ComponentRefs (Make has already filtered out disabled
components by this point), extracts the gpu-operator chart version,
and gates on both gpu-operator and nvidia-dra-driver-gpu being
enabled. When both are present, it writes
aicr.nvidia.com/gpu-operator-chart-version onto controller and
kubeletPlugin podAnnotations of the nvidia-dra-driver-gpu values map,
creating nested maps lazily so existing values
(priorityClassName, other annotations) are preserved.

Injection happens in DefaultBundler.Make AFTER
extractComponentValues — so the values map is populated and user
--set overrides have already landed — and BEFORE buildDeployer — so
every deployer (Helm, helmfile, Flux, Argo CD, argocd-helm) receives
the same final componentValues map. Placing the call after --set
means a user override of this specific annotation key is
intentionally NOT honored; the annotation must always reflect the
actual resolved gpu-operator chart version, or the rollout-trigger
semantics break.

The injection mirrors the GKE critical-priority quota helper
(pkg/bundler/bundler.go:injectGKECriticalPriorityQuotas) so callers
familiar with that pattern read this one for free.

Tests

- Unit tests (bundler_dra_annotation_test.go): positive case both
  pod paths, negative case DRA disabled, negative case gpu-operator
  disabled, preserves-existing-values, overrides-user-set,
  version-variants (v-prefixed / unprefixed / pre-release / build
  metadata), nil-tolerant inputs, lazy-DRA-values-map creation.
  Helper coverage: 100%.

- Generated-artifact parity test
  (bundler_dra_annotation_parity_test.go): bundles the same recipe
  through the default Helm deployer AND the helmfile deployer;
  asserts the annotation lands in the rendered DRA values.yaml for
  BOTH outputs. Catches the cross-deployer parity risk if a future
  refactor moves the injection into a deployer-specific render path.
  Plus a disabled-recipe negative case that asserts the gpu-operator
  values do not leak any DRA-related annotation when DRA is filtered
  out.

Closes NVIDIA#973
Related: NVIDIA#965 (the chart-version annotation this durabilizes),
NVIDIA#980 (orthogonal cross-deployer corrective restart), NVIDIA#894
(chart bump that first surfaced the stale-NVML class of bug)
@yuanchen8911 yuanchen8911 force-pushed the feat/973-bundler-derived-dra-annotation branch from 229946b to 0c7b7f9 Compare May 27, 2026 18:48
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/bundler/bundler_dra_annotation_test.go`:
- Around line 29-335: Consolidate the multiple
TestInjectDRAChartVersionAnnotation_* functions (PositiveCase,
DRAComponentDisabled, GPUOperatorDisabled, GPUOperatorEmptyVersion,
PreservesExistingValues, OverridesUserSet, NilInputs, LazyDRAValuesMap) into a
single table-driven test that iterates over a slice of cases and runs t.Run for
each, keeping TestInjectDRAChartVersionAnnotation_VersionVariants as-is; for
each case define the input componentValues and recipe.RecipeResult, the name,
and the expected assertions (e.g. expected created keys, preserved fields,
overridden annotation, or no-op), then call
b.injectDRAChartVersionAnnotation(componentValues, rr) and perform the same
assertions currently in the individual tests using the case data so the
arrange/call/assert pattern is unified and duplication removed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 3dc5210c-0b0f-4df2-b228-4edf493ca07e

📥 Commits

Reviewing files that changed from the base of the PR and between 8b5ef74 and 0c7b7f9.

📒 Files selected for processing (4)
  • pkg/bundler/bundler.go
  • pkg/bundler/bundler_dra_annotation_parity_test.go
  • pkg/bundler/bundler_dra_annotation_test.go
  • recipes/components/nvidia-dra-driver-gpu/values.yaml

Comment on lines +29 to +335
func TestInjectDRAChartVersionAnnotation_PositiveCase(t *testing.T) {
b, err := New()
if err != nil {
t.Fatalf("New() error = %v", err)
}

componentValues := map[string]map[string]any{
gpuOperatorComponentName: {},
draComponentName: {},
}
rr := &recipe.RecipeResult{
ComponentRefs: []recipe.ComponentRef{
{Name: gpuOperatorComponentName, Version: "v26.4.0"},
{Name: draComponentName, Version: "25.12.0"},
},
}

b.injectDRAChartVersionAnnotation(componentValues, rr)

for _, podPath := range []string{"controller", "kubeletPlugin"} {
got := dig(componentValues[draComponentName], podPath, "podAnnotations", draChartVersionAnnotation)
if got != "v26.4.0" {
t.Errorf("podAnnotations[%s][%s] = %v, want v26.4.0",
podPath, draChartVersionAnnotation, got)
}
}
}

// TestInjectDRAChartVersionAnnotation_DRAComponentDisabled pins the
// gating: when nvidia-dra-driver-gpu is absent from the filtered
// ComponentRefs (already filtered out as disabled by the caller), the
// helper is a no-op even if gpu-operator is enabled. This is the
// "negative case" the issue's acceptance criteria call out — proves
// the gating is correct and no warning fires.
func TestInjectDRAChartVersionAnnotation_DRAComponentDisabled(t *testing.T) {
b, err := New()
if err != nil {
t.Fatalf("New() error = %v", err)
}

componentValues := map[string]map[string]any{
gpuOperatorComponentName: {},
}
rr := &recipe.RecipeResult{
ComponentRefs: []recipe.ComponentRef{
{Name: gpuOperatorComponentName, Version: "v26.4.0"},
},
}

b.injectDRAChartVersionAnnotation(componentValues, rr)

if _, ok := componentValues[draComponentName]; ok {
t.Errorf("expected no DRA entry in componentValues when DRA component is disabled, got %v",
componentValues[draComponentName])
}
}

// TestInjectDRAChartVersionAnnotation_GPUOperatorDisabled pins the
// mirror gating: when gpu-operator is absent (disabled or filtered
// out), no annotation is written because there's no chart version to
// mirror.
func TestInjectDRAChartVersionAnnotation_GPUOperatorDisabled(t *testing.T) {
b, err := New()
if err != nil {
t.Fatalf("New() error = %v", err)
}

componentValues := map[string]map[string]any{
draComponentName: {},
}
rr := &recipe.RecipeResult{
ComponentRefs: []recipe.ComponentRef{
{Name: draComponentName, Version: "25.12.0"},
},
}

b.injectDRAChartVersionAnnotation(componentValues, rr)

if got := componentValues[draComponentName]; len(got) != 0 {
t.Errorf("expected unchanged DRA values when gpu-operator is disabled, got %v", got)
}
}

// TestInjectDRAChartVersionAnnotation_GPUOperatorEmptyVersion pins the
// defensive branch: when gpu-operator IS enabled but its Version field
// is empty (the resolver shouldn't produce this in practice — see the
// helper docstring for why), the helper logs a warning and skips
// injection rather than writing an empty annotation value that would
// itself lock the DaemonSet to a meaningless pin and erase the
// "annotation never drifts from the chart pin" guarantee.
func TestInjectDRAChartVersionAnnotation_GPUOperatorEmptyVersion(t *testing.T) {
b, err := New()
if err != nil {
t.Fatalf("New() error = %v", err)
}

componentValues := map[string]map[string]any{
gpuOperatorComponentName: {},
draComponentName: {},
}
rr := &recipe.RecipeResult{
ComponentRefs: []recipe.ComponentRef{
{Name: gpuOperatorComponentName, Version: ""},
{Name: draComponentName, Version: "25.12.0"},
},
}

b.injectDRAChartVersionAnnotation(componentValues, rr)

if got := componentValues[draComponentName]; len(got) != 0 {
t.Errorf("expected unchanged DRA values when gpu-operator Version is empty, got %v", got)
}
}

// TestInjectDRAChartVersionAnnotation_PreservesExistingValues
// documents the additive contract: priorityClassName and any
// pre-existing podAnnotations from the values file (or from earlier
// override layers) survive injection. The helper only writes the one
// annotation key and leaves the rest of the controller / kubeletPlugin
// sections untouched.
func TestInjectDRAChartVersionAnnotation_PreservesExistingValues(t *testing.T) {
b, err := New()
if err != nil {
t.Fatalf("New() error = %v", err)
}

componentValues := map[string]map[string]any{
gpuOperatorComponentName: {},
draComponentName: {
"controller": map[string]any{
"priorityClassName": "system-cluster-critical",
"podAnnotations": map[string]any{
"operator.example.com/other-key": "preserved",
},
},
"kubeletPlugin": map[string]any{
"priorityClassName": "system-node-critical",
},
},
}
rr := &recipe.RecipeResult{
ComponentRefs: []recipe.ComponentRef{
{Name: gpuOperatorComponentName, Version: "v26.4.0"},
{Name: draComponentName, Version: "25.12.0"},
},
}

b.injectDRAChartVersionAnnotation(componentValues, rr)

dra := componentValues[draComponentName]
if got := dig(dra, "controller", "priorityClassName"); got != "system-cluster-critical" {
t.Errorf("controller.priorityClassName = %v, want system-cluster-critical", got)
}
if got := dig(dra, "controller", "podAnnotations", "operator.example.com/other-key"); got != "preserved" {
t.Errorf("controller.podAnnotations[operator.example.com/other-key] = %v, want preserved", got)
}
if got := dig(dra, "controller", "podAnnotations", draChartVersionAnnotation); got != "v26.4.0" {
t.Errorf("controller.podAnnotations[%s] = %v, want v26.4.0", draChartVersionAnnotation, got)
}
if got := dig(dra, "kubeletPlugin", "priorityClassName"); got != "system-node-critical" {
t.Errorf("kubeletPlugin.priorityClassName = %v, want system-node-critical", got)
}
if got := dig(dra, "kubeletPlugin", "podAnnotations", draChartVersionAnnotation); got != "v26.4.0" {
t.Errorf("kubeletPlugin.podAnnotations[%s] = %v, want v26.4.0", draChartVersionAnnotation, got)
}
}

// TestInjectDRAChartVersionAnnotation_OverridesUserSet pins the
// "internal annotation always reflects the actual chart version"
// invariant. A user --set that wrote a stale value into the
// annotation must be overwritten by the bundler-derived value;
// otherwise the rollout-trigger semantics break and the durable fix
// degrades into the same drift the manual annotation had. Injection
// happens AFTER extractComponentValues for exactly this reason.
func TestInjectDRAChartVersionAnnotation_OverridesUserSet(t *testing.T) {
b, err := New()
if err != nil {
t.Fatalf("New() error = %v", err)
}

componentValues := map[string]map[string]any{
gpuOperatorComponentName: {},
draComponentName: {
"controller": map[string]any{
"podAnnotations": map[string]any{
draChartVersionAnnotation: "v25.10.1-stale-from-user-set",
},
},
},
}
rr := &recipe.RecipeResult{
ComponentRefs: []recipe.ComponentRef{
{Name: gpuOperatorComponentName, Version: "v26.4.0"},
{Name: draComponentName, Version: "25.12.0"},
},
}

b.injectDRAChartVersionAnnotation(componentValues, rr)

got := dig(componentValues[draComponentName], "controller", "podAnnotations", draChartVersionAnnotation)
if got != "v26.4.0" {
t.Errorf("expected user --set value to be overridden by injection: got %v, want v26.4.0", got)
}
}

// TestInjectDRAChartVersionAnnotation_VersionVariants documents that
// the helper mirrors whatever string the resolver wrote into the
// gpu-operator ComponentRef. Helm chart pins ship with and without a
// leading "v"; the annotation tracks the exact string so the rendered
// pod-template diff reliably fires on a chart-pin change of any
// shape.
func TestInjectDRAChartVersionAnnotation_VersionVariants(t *testing.T) {
tests := []struct {
name string
version string
}{
{"v-prefixed", "v26.4.0"},
{"unprefixed", "26.4.0"},
{"pre-release", "v26.5.0-beta.1"},
{"build metadata", "v26.4.0+nvidia.1"},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
b, err := New()
if err != nil {
t.Fatalf("New() error = %v", err)
}
componentValues := map[string]map[string]any{
gpuOperatorComponentName: {},
draComponentName: {},
}
rr := &recipe.RecipeResult{
ComponentRefs: []recipe.ComponentRef{
{Name: gpuOperatorComponentName, Version: tt.version},
{Name: draComponentName, Version: "25.12.0"},
},
}
b.injectDRAChartVersionAnnotation(componentValues, rr)
got := dig(componentValues[draComponentName], "controller", "podAnnotations", draChartVersionAnnotation)
if got != tt.version {
t.Errorf("controller annotation = %v, want %v", got, tt.version)
}
})
}
}

// TestInjectDRAChartVersionAnnotation_NilInputs documents the
// nil-tolerant contract. The bundler's Make method only calls this
// after extractComponentValues returns a non-nil map and pretty much
// never with a nil recipe, but defensive nil-handling keeps the
// helper safe in unit tests and future callers.
func TestInjectDRAChartVersionAnnotation_NilInputs(t *testing.T) {
b, err := New()
if err != nil {
t.Fatalf("New() error = %v", err)
}
tests := []struct {
name string
cv map[string]map[string]any
rr *recipe.RecipeResult
}{
{"nil componentValues", nil, &recipe.RecipeResult{}},
{"nil recipe result", map[string]map[string]any{}, nil},
{"both nil", nil, nil},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
// Should not panic.
b.injectDRAChartVersionAnnotation(tt.cv, tt.rr)
})
}
}

// TestInjectDRAChartVersionAnnotation_LazyDRAValuesMap pins one
// implementation detail worth fixing in place: when the DRA component
// is enabled but its componentValues entry hasn't been initialized
// yet (no values file, no overrides), the helper still injects by
// creating the nested map. This matters because every-deployer parity
// requires the annotation to appear in the rendered chart even when
// the DRA chart has no other values configured.
func TestInjectDRAChartVersionAnnotation_LazyDRAValuesMap(t *testing.T) {
b, err := New()
if err != nil {
t.Fatalf("New() error = %v", err)
}

componentValues := map[string]map[string]any{
gpuOperatorComponentName: {},
// nvidia-dra-driver-gpu key intentionally absent.
}
rr := &recipe.RecipeResult{
ComponentRefs: []recipe.ComponentRef{
{Name: gpuOperatorComponentName, Version: "v26.4.0"},
{Name: draComponentName, Version: "25.12.0"},
},
}

b.injectDRAChartVersionAnnotation(componentValues, rr)

if _, ok := componentValues[draComponentName]; !ok {
t.Fatalf("expected helper to create DRA values entry on demand")
}
got := dig(componentValues[draComponentName], "controller", "podAnnotations", draChartVersionAnnotation)
if got != "v26.4.0" {
t.Errorf("controller annotation = %v, want v26.4.0", got)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

f="pkg/bundler/bundler_dra_annotation_test.go"
if [ ! -f "$f" ]; then
  echo "File not found: $f" >&2
  exit 1
fi

# Print file with line numbers (trimmed to reasonable sizes by focusing around the test functions)
echo "== File stats ==" 
wc -l "$f"

echo "== Functions in file (names) =="
rg -n "func TestInjectDRAChartVersionAnnotation_" "$f" || true
rg -n "func dig\(" "$f" || true
rg -n "injectDRAChartVersionAnnotation" "$f" || true

echo "== Show relevant sections around test functions =="
# Show from start through a bit past the last test in the snippet range
sed -n '1,220p' "$f"
echo "-----"
sed -n '220,520p' "$f"

Repository: NVIDIA/aicr

Length of output: 13804


Consolidate the TestInjectDRAChartVersionAnnotation_* scenarios into a single table-driven suite.

Most of these tests (PositiveCase, DRAComponentDisabled, GPUOperatorDisabled, GPUOperatorEmptyVersion, PreservesExistingValues, OverridesUserSet, NilInputs, LazyDRAValuesMap) follow the same arrange/call/assert shape with only input maps and expected outcomes changing; converting them into one tests := []struct{...} with subtests would reduce duplication and keep future scenarios consistent. (TestInjectDRAChartVersionAnnotation_VersionVariants already uses this pattern and can serve as the template.)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/bundler/bundler_dra_annotation_test.go` around lines 29 - 335,
Consolidate the multiple TestInjectDRAChartVersionAnnotation_* functions
(PositiveCase, DRAComponentDisabled, GPUOperatorDisabled,
GPUOperatorEmptyVersion, PreservesExistingValues, OverridesUserSet, NilInputs,
LazyDRAValuesMap) into a single table-driven test that iterates over a slice of
cases and runs t.Run for each, keeping
TestInjectDRAChartVersionAnnotation_VersionVariants as-is; for each case define
the input componentValues and recipe.RecipeResult, the name, and the expected
assertions (e.g. expected created keys, preserved fields, overridden annotation,
or no-op), then call b.injectDRAChartVersionAnnotation(componentValues, rr) and
perform the same assertions currently in the individual tests using the case
data so the arrange/call/assert pattern is unified and duplication removed.

@yuanchen8911 yuanchen8911 merged commit 7adc586 into NVIDIA:main May 27, 2026
116 of 117 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chore(recipes): make DRA-rollout trigger durable (don't rely on manual annotation bump)

2 participants