fix(recipes): address BCM overlay gaps from H200 NVL validation#1089
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR updates Kubernetes recipe configurations: (1) adds documentation in nvidia-dra-driver-gpu values clarifying why Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/components/nvidia-dra-driver-gpu/values.yaml`:
- Around line 69-78: This change updates component values (see
recipes/components/nvidia-dra-driver-gpu/values.yaml) but did not include the
regenerated BOM docs; run the make target "make bom-docs" locally to regenerate
docs/user/container-images.md and add/commit the updated
docs/user/container-images.md to this PR so the BOM reflects the values.yaml
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 1fda92e1-e76d-445a-9a1e-e5e981fd3338
📒 Files selected for processing (3)
recipes/components/nvidia-dra-driver-gpu/values.yamlrecipes/overlays/bcm-training.yamlrecipes/overlays/bcm.yaml
fd65476 to
7753de2
Compare
|
Actionable comments posted: 0 |
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. Resolves to the same hydrated values as h100 today via
shared base components (both are Hopper, same R570/R580 driver line,
NVML auto-detects everything); no new overlay file is needed.
Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:
- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
contributor/api-server.md,contributor/cli.md,contributor/data.md,
contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
server/doc.go}: godoc enumerations
Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0;
H100 overlay chain produces correct hydrated values. End-to-end:
`aicr recipe --accelerator h200 --service bcm --os ubuntu --intent training`
succeeds and produces a recipe with `criteria.accelerator: h200`.
Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
7753de2 to
bccf6ad
Compare
|
Actionable comments posted: 0 |
- Document why nvidia-dra-driver-gpu controller and kubeletPlugin priorityClassName are explicitly neutralized (PSA / PriorityClass admission constraints AICR cannot assume cluster-wide). Notes the eviction-under-node-pressure trade-off so operators can re-pin via their own overlay if needed. - Mirror controller.tolerations onto kubeletPlugin.tolerations in bcm.yaml so DRA's kubelet plugin DaemonSet schedules on small BCM deployments that combine control-plane and worker roles on the same node. - Enable GPUDirect Storage (gds.enabled: true) in bcm-training.yaml. BCM-provisioned nodes typically ship NVIDIA-validated NVMe + ConnectX hardware where GDS delivers a meaningful training I/O perf win (most pronounced on H200 NVL given its 141GB HBM3e per device). Benign on nodes without compatible hardware. Surfaced during cluster-side validation of the recently-merged feat/bcm-service-type work on a real H200 NVL test cluster. Addresses 3 of 4 checkboxes in NVIDIA#1086; H200 criteria registration is the larger 4th item and will land in a separate PR per the umbrella issue.
bccf6ad to
c93c5d4
Compare
|
Actionable comments posted: 0 |
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.
Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:
- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
contributor/api-server.md,contributor/cli.md,contributor/data.md,
contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
server/doc.go}: godoc enumerations
Wires up the two consumer paths the enum alone left incomplete:
- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
snapshot -> fingerprint -> recipe path resolves real H200 hardware
("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
so an `--accelerator h200` recipe inherits the same deployment-phase
floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
model->accelerator mapping and valid-values tables
Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.
Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.
Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:
- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
contributor/api-server.md,contributor/cli.md,contributor/data.md,
contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
server/doc.go}: godoc enumerations
Wires up the two consumer paths the enum alone left incomplete:
- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
snapshot -> fingerprint -> recipe path resolves real H200 hardware
("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
so an `--accelerator h200` recipe inherits the same deployment-phase
floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
model->accelerator mapping and valid-values tables
Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.
Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.
Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:
- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
contributor/api-server.md,contributor/cli.md,contributor/data.md,
contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
server/doc.go}: godoc enumerations
Wires up the two consumer paths the enum alone left incomplete:
- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
snapshot -> fingerprint -> recipe path resolves real H200 hardware
("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
so an `--accelerator h200` recipe inherits the same deployment-phase
floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
model->accelerator mapping and valid-values tables
Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.
Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.
Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:
- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
contributor/api-server.md,contributor/cli.md,contributor/data.md,
contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
server/doc.go}: godoc enumerations
Wires up the two consumer paths the enum alone left incomplete:
- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
snapshot -> fingerprint -> recipe path resolves real H200 hardware
("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
so an `--accelerator h200` recipe inherits the same deployment-phase
floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
model->accelerator mapping and valid-values tables
Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.
Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.
Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:
- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
contributor/api-server.md,contributor/cli.md,contributor/data.md,
contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
server/doc.go}: godoc enumerations
Wires up the two consumer paths the enum alone left incomplete:
- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
snapshot -> fingerprint -> recipe path resolves real H200 hardware
("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
so an `--accelerator h200` recipe inherits the same deployment-phase
floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
model->accelerator mapping and valid-values tables
Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.
Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.
Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:
- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
contributor/api-server.md,contributor/cli.md,contributor/data.md,
contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
server/doc.go}: godoc enumerations
Wires up the two consumer paths the enum alone left incomplete:
- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
snapshot -> fingerprint -> recipe path resolves real H200 hardware
("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
so an `--accelerator h200` recipe inherits the same deployment-phase
floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
model->accelerator mapping and valid-values tables
Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.
Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.
Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:
- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
contributor/api-server.md,contributor/cli.md,contributor/data.md,
contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
server/doc.go}: godoc enumerations
Wires up the two consumer paths the enum alone left incomplete:
- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
snapshot -> fingerprint -> recipe path resolves real H200 hardware
("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
so an `--accelerator h200` recipe inherits the same deployment-phase
floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
model->accelerator mapping and valid-values tables
Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.
Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
Summary
Three small recipe-layer edits surfaced during cluster-side validation of the recently-merged
feat/bcm-service-typework on a real H200 NVL test cluster: document why DRApriorityClassNameis neutralized, mirror BCM control-plane tolerations onto DRA'skubeletPlugin, and enable GPUDirect Storage for BCM training.Motivation / Context
Fixes: part of #1086 (3 of 4 checkboxes; H200 criteria registration deferred to a separate PR per the umbrella issue)
Related: #1087 (integration test for
nvidiaDriverRoot/hostPaths.driverInstallDirlockstep)Cluster-side validation context:
node007aicr validatedeployment+conformance phases pass end-to-endType of Change
Component(s) Affected
pkg/recipe)Implementation Notes
recipes/components/nvidia-dra-driver-gpu/values.yaml— Document the existingpriorityClassName: ""neutralization rather than restoring the upstream chart default (system-node-critical). The override has been in place sincechore: init repo(Jan 2026) with no historical PR rationale; the most plausible reason is PriorityClass admission restrictions (PSA, ResourceQuota, PriorityClassPolicy) that AICR cannot assume across all supported services. Restoring the chart default would be a behavior change without a clear test signal, so the conservative path is documentation + aTODO(#1086)for revisit. Operators who need DRA pods to survive node-pressure eviction can re-pin via their own overlay.recipes/overlays/bcm.yaml— Mirrorcontroller.tolerationsontokubeletPlugin.tolerations. Followup investigation revealed the AICR bundler appends a blanket{operator: Exists}toleration to both paths viaregistry.yamlnodeScheduling.{system,accelerated}.tolerationPathsentries (defaults sourced frompkg/snapshotter/agent.go DefaultTolerations). The blanket subsumes the specific master+control-plane entries, so the mirror is functionally a no-op in default mode — its purpose is symmetry with the existing controller block and override-resilience when a user passes--system-node-tolerationto drop the blanket. An inline comment documents this relationship so future overlay editors aren't misled. Filed separately as an AICR design question whether the tolerate-all default should be tightened per-service.recipes/overlays/bcm-training.yaml— Enablegds.enabled: truefor thegpu-operatorcomponentRef. BCM-provisioned nodes typically ship NVIDIA-validated NVMe + ConnectX hardware where GDS is a meaningful training I/O perf win (most pronounced on H200 NVL given 141GB HBM3e per device). On nodes without compatible hardware thenvidia-fsDaemonSet is benign — it logs a warning and stays inert.Testing
YAML-only changes; no Go source modified. Per
CLAUDE.local.mdscoped verification policy, ran the checks that match the change surface:yamllint -c .yamllint.yaml recipes/components/nvidia-dra-driver-gpu/values.yaml \ recipes/overlays/bcm.yaml \ recipes/overlays/bcm-training.yaml # Clean. go test -count=1 -timeout 180s ./pkg/recipe/... # ok github.com/NVIDIA/aicr/pkg/recipe 0.864s # ok github.com/NVIDIA/aicr/pkg/recipe/oskind 0.713s go test -count=1 -timeout 240s ./pkg/bundler/... # 15 packages, all ok (recipe values flow through the bundler) # End-to-end recipe + bundle against the BCM overlay chain: aicr recipe --service bcm --accelerator h100 --os ubuntu --intent training -o /tmp/recipe.yaml aicr bundle -r /tmp/recipe.yaml --deployer helmfile -o /tmp/bundle # Generated DRA values show: # - controller.tolerations: [master, control-plane, {operator: Exists}] # - kubeletPlugin.tolerations: [master, control-plane, {operator: Exists}] # - gds.enabled: true on gpu-operatorSkipping full
make qualifybecause the change is YAML-only — no Go test, e2e, lint, or scan target can regress from these edits. The scoped recipe + bundler tests cover the consumers that parse these files.Risk Assessment
priorityClassName(documentation only); the kubeletPlugin tolerations mirror is a no-op in default mode (bundler blanket subsumes it); the BCM GDS enable is additive and benign on hardware that doesn't use them.Rollout notes: None. The GDS daemon is benign on nodes without compatible NVMe/NIC hardware (it logs a warning and stays inert).
Checklist
./pkg/recipe/...and./pkg/bundler/...with-count=1)yamllinton changed files)git commit -S)