fix(recipes): address BCM overlay gaps from H200 NVL validation by yuanchen8911 · Pull Request #1089 · NVIDIA/aicr

yuanchen8911 · 2026-05-28T16:35:00Z

Summary

Three small recipe-layer edits surfaced during cluster-side validation of the recently-merged feat/bcm-service-type work on a real H200 NVL test cluster: document why DRA priorityClassName is neutralized, mirror BCM control-plane tolerations onto DRA's kubeletPlugin, and enable GPUDirect Storage for BCM training.

Motivation / Context

Fixes: part of #1086 (3 of 4 checkboxes; H200 criteria registration deferred to a separate PR per the umbrella issue)
Related: #1087 (integration test for nvidiaDriverRoot / hostPaths.driverInstallDir lockstep)

Cluster-side validation context:

BCM-provisioned k8s 1.34.8, Ubuntu 24.04, kernel 6.8.0-71
2× NVIDIA H200 NVL (141GB HBM3e each) on node007
gpu-operator v26.3.1, nvidia-dra-driver-gpu 25.12.0
aicr validate deployment+conformance phases pass end-to-end

Type of Change

Bug fix (non-breaking change that fixes an issue)
Documentation update

Component(s) Affected

Recipe engine / data (pkg/recipe)

Implementation Notes

recipes/components/nvidia-dra-driver-gpu/values.yaml — Document the existing priorityClassName: "" neutralization rather than restoring the upstream chart default (system-node-critical). The override has been in place since chore: init repo (Jan 2026) with no historical PR rationale; the most plausible reason is PriorityClass admission restrictions (PSA, ResourceQuota, PriorityClassPolicy) that AICR cannot assume across all supported services. Restoring the chart default would be a behavior change without a clear test signal, so the conservative path is documentation + a TODO(#1086) for revisit. Operators who need DRA pods to survive node-pressure eviction can re-pin via their own overlay.

recipes/overlays/bcm.yaml — Mirror controller.tolerations onto kubeletPlugin.tolerations. Followup investigation revealed the AICR bundler appends a blanket {operator: Exists} toleration to both paths via registry.yaml nodeScheduling.{system,accelerated}.tolerationPaths entries (defaults sourced from pkg/snapshotter/agent.go DefaultTolerations). The blanket subsumes the specific master+control-plane entries, so the mirror is functionally a no-op in default mode — its purpose is symmetry with the existing controller block and override-resilience when a user passes --system-node-toleration to drop the blanket. An inline comment documents this relationship so future overlay editors aren't misled. Filed separately as an AICR design question whether the tolerate-all default should be tightened per-service.

recipes/overlays/bcm-training.yaml — Enable gds.enabled: true for the gpu-operator componentRef. BCM-provisioned nodes typically ship NVIDIA-validated NVMe + ConnectX hardware where GDS is a meaningful training I/O perf win (most pronounced on H200 NVL given 141GB HBM3e per device). On nodes without compatible hardware the nvidia-fs DaemonSet is benign — it logs a warning and stays inert.

Testing

YAML-only changes; no Go source modified. Per CLAUDE.local.md scoped verification policy, ran the checks that match the change surface:

yamllint -c .yamllint.yaml recipes/components/nvidia-dra-driver-gpu/values.yaml \
                            recipes/overlays/bcm.yaml \
                            recipes/overlays/bcm-training.yaml
# Clean.

go test -count=1 -timeout 180s ./pkg/recipe/...
# ok  github.com/NVIDIA/aicr/pkg/recipe         0.864s
# ok  github.com/NVIDIA/aicr/pkg/recipe/oskind  0.713s

go test -count=1 -timeout 240s ./pkg/bundler/...
# 15 packages, all ok (recipe values flow through the bundler)

# End-to-end recipe + bundle against the BCM overlay chain:
aicr recipe --service bcm --accelerator h100 --os ubuntu --intent training -o /tmp/recipe.yaml
aicr bundle -r /tmp/recipe.yaml --deployer helmfile -o /tmp/bundle
# Generated DRA values show:
#   - controller.tolerations: [master, control-plane, {operator: Exists}]
#   - kubeletPlugin.tolerations: [master, control-plane, {operator: Exists}]
#   - gds.enabled: true on gpu-operator

Skipping full make qualify because the change is YAML-only — no Go test, e2e, lint, or scan target can regress from these edits. The scoped recipe + bundler tests cover the consumers that parse these files.

Risk Assessment

Low — Isolated YAML-only changes; no behavior change to the priorityClassName (documentation only); the kubeletPlugin tolerations mirror is a no-op in default mode (bundler blanket subsumes it); the BCM GDS enable is additive and benign on hardware that doesn't use them.

Rollout notes: None. The GDS daemon is benign on nodes without compatible NVMe/NIC hardware (it logs a warning and stays inert).

Checklist

Tests pass locally (scoped: ./pkg/recipe/... and ./pkg/bundler/... with -count=1)
Linter passes (yamllint on changed files)
I did not skip/disable tests to make CI green
N/A — no new functionality requiring new tests; documentation + additive overlay edits
N/A — no user-facing behavior change beyond the GDS enable, which is BCM-service-internal
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-05-28T16:37:58Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d2ec6529-f236-471e-a89a-b9085f9fc30a

📥 Commits

Reviewing files that changed from the base of the PR and between bccf6ad and c93c5d4.

📒 Files selected for processing (3)

recipes/components/nvidia-dra-driver-gpu/values.yaml
recipes/overlays/bcm-training.yaml
recipes/overlays/bcm.yaml

📝 Walkthrough

Walkthrough

This PR updates Kubernetes recipe configurations: (1) adds documentation in nvidia-dra-driver-gpu values clarifying why controller.priorityClassName and kubeletPlugin.priorityClassName are set to "" and notes eviction implications; (2) enables GPUDirect Storage by adding a gpu-operator componentRef and overrides.gds.enabled: true in the bcm-training overlay with hardware-dependent comments; and (3) mirrors BCM master/control-plane tolerations onto nvidia-dra-driver-gpu.kubeletPlugin by adding NoSchedule operator: Exists tolerations so DRA kubelet plugin scheduling matches BCM.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related issues

Address recipe gaps surfaced during BCM / H200 NVL validation #1086: Addresses the same configuration items (priorityClassName handling, kubeletPlugin tolerations, enabling GDS) noted in the issue.

Possibly related PRs

NVIDIA/aicr#1082: Related changes around preserving scheduling-related Helm values and overlay precedence that interact with these overrides.

Suggested labels

area/docs

Suggested reviewers

mchmarny
lockwobr

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix(recipes): address BCM overlay gaps from H200 NVL validation' directly and clearly describes the main change: addressing specific gaps in BCM recipe overlays discovered during H200 NVL validation.
Description check	✅ Passed	The description provides comprehensive context for all three YAML changes, including the specific validation scenario (H200 NVL cluster), rationale for each modification, testing performed, and risk assessment—all directly related to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/nvidia-dra-driver-gpu/values.yaml`:
- Around line 69-78: This change updates component values (see
recipes/components/nvidia-dra-driver-gpu/values.yaml) but did not include the
regenerated BOM docs; run the make target "make bom-docs" locally to regenerate
docs/user/container-images.md and add/commit the updated
docs/user/container-images.md to this PR so the BOM reflects the values.yaml
change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1fda92e1-e76d-445a-9a1e-e5e981fd3338

📥 Commits

Reviewing files that changed from the base of the PR and between ae6c948 and fd65476.

📒 Files selected for processing (3)

recipes/components/nvidia-dra-driver-gpu/values.yaml
recipes/overlays/bcm-training.yaml
recipes/overlays/bcm.yaml

coderabbitai · 2026-05-28T17:14:53Z

Actionable comments posted: 0

Adds CriteriaAcceleratorH200 to the criteria registry so users can pass `--accelerator h200` and have the recipe metadata reflect the hardware. Resolves to the same hydrated values as h100 today via shared base components (both are Hopper, same R570/R580 driver line, NVML auto-detects everything); no new overlay file is needed. Updates every surface that enumerates accelerator types per the "Adding a new enum value" audit rule in .claude/CLAUDE.md: - pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes - pkg/recipe/criteria_test.go: parse + Get tests cover h200 - pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in) for mi300x as the extensibility-test example value - api/aicr/v1/server.yaml: all 5 enum blocks - .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries - docs/{README.md,user/cli-reference.md,user/api-reference.md, contributor/api-server.md,contributor/cli.md,contributor/data.md, contributor/validations.md,contributor/api-server-extending.md} - pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go, server/doc.go}: godoc enumerations Validated against a real H200 NVL cluster: GFD / DRA correctly identify the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0; H100 overlay chain produces correct hydrated values. End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu --intent training` succeeds and produces a recipe with `criteria.accelerator: h200`. Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out from PR NVIDIA#1089).

coderabbitai · 2026-05-28T19:20:00Z

Actionable comments posted: 0

- Document why nvidia-dra-driver-gpu controller and kubeletPlugin priorityClassName are explicitly neutralized (PSA / PriorityClass admission constraints AICR cannot assume cluster-wide). Notes the eviction-under-node-pressure trade-off so operators can re-pin via their own overlay if needed. - Mirror controller.tolerations onto kubeletPlugin.tolerations in bcm.yaml so DRA's kubelet plugin DaemonSet schedules on small BCM deployments that combine control-plane and worker roles on the same node. - Enable GPUDirect Storage (gds.enabled: true) in bcm-training.yaml. BCM-provisioned nodes typically ship NVIDIA-validated NVMe + ConnectX hardware where GDS delivers a meaningful training I/O perf win (most pronounced on H200 NVL given its 141GB HBM3e per device). Benign on nodes without compatible hardware. Surfaced during cluster-side validation of the recently-merged feat/bcm-service-type work on a real H200 NVL test cluster. Addresses 3 of 4 checkboxes in NVIDIA#1086; H200 criteria registration is the larger 4th item and will land in a separate PR per the umbrella issue.

coderabbitai · 2026-05-28T19:41:12Z

Actionable comments posted: 0

Adds CriteriaAcceleratorH200 to the criteria registry so users can pass `--accelerator h200` and have the recipe metadata reflect the hardware. H200 is the same Hopper generation as H100 (same R570/R580 driver line, same gpu-operator support floor), so its deployment-phase floor mirrors H100's. Updates every surface that enumerates accelerator types per the "Adding a new enum value" audit rule in .claude/CLAUDE.md: - pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes - pkg/recipe/criteria_test.go: parse + Get tests cover h200 - pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in) for mi300x as the extensibility-test example value - api/aicr/v1/server.yaml: all 5 enum blocks - .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries - docs/{README.md,user/cli-reference.md,user/api-reference.md, contributor/api-server.md,contributor/cli.md,contributor/data.md, contributor/validations.md,contributor/api-server-extending.md} - pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go, server/doc.go}: godoc enumerations Wires up the two consumer paths the enum alone left incomplete: - pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the snapshot -> fingerprint -> recipe path resolves real H200 hardware ("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku - recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0) so an `--accelerator h200` recipe inherits the same deployment-phase floor as H100 rather than landing on bare base - .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the model->accelerator mapping and valid-values tables Validated against a real H200 NVL cluster: GFD / DRA correctly identify the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0. End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu --intent training` resolves the identical deployment floor as the h100 equivalent and produces a recipe with criteria.accelerator: h200. Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out from PR NVIDIA#1089).

yuanchen8911 requested a review from a team as a code owner May 28, 2026 16:35

yuanchen8911 added area/recipes needs-triage bug labels May 28, 2026

github-actions Bot added the size/S label May 28, 2026

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread recipes/components/nvidia-dra-driver-gpu/values.yaml

yuanchen8911 force-pushed the fix/recipe-gaps-1086 branch from fd65476 to 7753de2 Compare May 28, 2026 17:10

yuanchen8911 mentioned this pull request May 28, 2026

Address recipe gaps surfaced during BCM / H200 NVL validation #1086

Closed

4 tasks

yuanchen8911 marked this pull request as draft May 28, 2026 17:22

yuanchen8911 changed the title ~~fix(recipes): address BCM overlay gaps from H200 NVL validation~~ WIP: fix(recipes): address BCM overlay gaps from H200 NVL validation May 28, 2026

yuanchen8911 mentioned this pull request May 28, 2026

feat(recipe): register h200 as first-class accelerator type #1091

Merged

15 tasks

yuanchen8911 marked this pull request as ready for review May 28, 2026 18:21

yuanchen8911 changed the title ~~WIP: fix(recipes): address BCM overlay gaps from H200 NVL validation~~ fix(recipes): address BCM overlay gaps from H200 NVL validation May 28, 2026

yuanchen8911 requested a review from mchmarny May 28, 2026 18:22

yuanchen8911 marked this pull request as draft May 28, 2026 18:53

yuanchen8911 force-pushed the fix/recipe-gaps-1086 branch from 7753de2 to bccf6ad Compare May 28, 2026 19:16

yuanchen8911 force-pushed the fix/recipe-gaps-1086 branch from bccf6ad to c93c5d4 Compare May 28, 2026 19:36

yuanchen8911 marked this pull request as ready for review May 28, 2026 19:39

mchmarny approved these changes May 28, 2026

View reviewed changes

mchmarny enabled auto-merge (squash) May 28, 2026 19:43

mchmarny self-assigned this May 28, 2026

mchmarny merged commit 66b852e into NVIDIA:main May 28, 2026
115 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(recipes): address BCM overlay gaps from H200 NVL validation#1089

fix(recipes): address BCM overlay gaps from H200 NVL validation#1089
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/recipe-gaps-1086

yuanchen8911 commented May 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading