Skip to content

fix(recipes): address BCM overlay gaps from H200 NVL validation#1089

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/recipe-gaps-1086
May 28, 2026
Merged

fix(recipes): address BCM overlay gaps from H200 NVL validation#1089
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/recipe-gaps-1086

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented May 28, 2026

Summary

Three small recipe-layer edits surfaced during cluster-side validation of the recently-merged feat/bcm-service-type work on a real H200 NVL test cluster: document why DRA priorityClassName is neutralized, mirror BCM control-plane tolerations onto DRA's kubeletPlugin, and enable GPUDirect Storage for BCM training.

Motivation / Context

Fixes: part of #1086 (3 of 4 checkboxes; H200 criteria registration deferred to a separate PR per the umbrella issue)
Related: #1087 (integration test for nvidiaDriverRoot / hostPaths.driverInstallDir lockstep)

Cluster-side validation context:

  • BCM-provisioned k8s 1.34.8, Ubuntu 24.04, kernel 6.8.0-71
  • 2× NVIDIA H200 NVL (141GB HBM3e each) on node007
  • gpu-operator v26.3.1, nvidia-dra-driver-gpu 25.12.0
  • aicr validate deployment+conformance phases pass end-to-end

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Documentation update

Component(s) Affected

  • Recipe engine / data (pkg/recipe)

Implementation Notes

recipes/components/nvidia-dra-driver-gpu/values.yaml — Document the existing priorityClassName: "" neutralization rather than restoring the upstream chart default (system-node-critical). The override has been in place since chore: init repo (Jan 2026) with no historical PR rationale; the most plausible reason is PriorityClass admission restrictions (PSA, ResourceQuota, PriorityClassPolicy) that AICR cannot assume across all supported services. Restoring the chart default would be a behavior change without a clear test signal, so the conservative path is documentation + a TODO(#1086) for revisit. Operators who need DRA pods to survive node-pressure eviction can re-pin via their own overlay.

recipes/overlays/bcm.yaml — Mirror controller.tolerations onto kubeletPlugin.tolerations. Followup investigation revealed the AICR bundler appends a blanket {operator: Exists} toleration to both paths via registry.yaml nodeScheduling.{system,accelerated}.tolerationPaths entries (defaults sourced from pkg/snapshotter/agent.go DefaultTolerations). The blanket subsumes the specific master+control-plane entries, so the mirror is functionally a no-op in default mode — its purpose is symmetry with the existing controller block and override-resilience when a user passes --system-node-toleration to drop the blanket. An inline comment documents this relationship so future overlay editors aren't misled. Filed separately as an AICR design question whether the tolerate-all default should be tightened per-service.

recipes/overlays/bcm-training.yaml — Enable gds.enabled: true for the gpu-operator componentRef. BCM-provisioned nodes typically ship NVIDIA-validated NVMe + ConnectX hardware where GDS is a meaningful training I/O perf win (most pronounced on H200 NVL given 141GB HBM3e per device). On nodes without compatible hardware the nvidia-fs DaemonSet is benign — it logs a warning and stays inert.

Testing

YAML-only changes; no Go source modified. Per CLAUDE.local.md scoped verification policy, ran the checks that match the change surface:

yamllint -c .yamllint.yaml recipes/components/nvidia-dra-driver-gpu/values.yaml \
                            recipes/overlays/bcm.yaml \
                            recipes/overlays/bcm-training.yaml
# Clean.

go test -count=1 -timeout 180s ./pkg/recipe/...
# ok  github.com/NVIDIA/aicr/pkg/recipe         0.864s
# ok  github.com/NVIDIA/aicr/pkg/recipe/oskind  0.713s

go test -count=1 -timeout 240s ./pkg/bundler/...
# 15 packages, all ok (recipe values flow through the bundler)

# End-to-end recipe + bundle against the BCM overlay chain:
aicr recipe --service bcm --accelerator h100 --os ubuntu --intent training -o /tmp/recipe.yaml
aicr bundle -r /tmp/recipe.yaml --deployer helmfile -o /tmp/bundle
# Generated DRA values show:
#   - controller.tolerations: [master, control-plane, {operator: Exists}]
#   - kubeletPlugin.tolerations: [master, control-plane, {operator: Exists}]
#   - gds.enabled: true on gpu-operator

Skipping full make qualify because the change is YAML-only — no Go test, e2e, lint, or scan target can regress from these edits. The scoped recipe + bundler tests cover the consumers that parse these files.

Risk Assessment

  • Low — Isolated YAML-only changes; no behavior change to the priorityClassName (documentation only); the kubeletPlugin tolerations mirror is a no-op in default mode (bundler blanket subsumes it); the BCM GDS enable is additive and benign on hardware that doesn't use them.

Rollout notes: None. The GDS daemon is benign on nodes without compatible NVMe/NIC hardware (it logs a warning and stays inert).

Checklist

  • Tests pass locally (scoped: ./pkg/recipe/... and ./pkg/bundler/... with -count=1)
  • Linter passes (yamllint on changed files)
  • I did not skip/disable tests to make CI green
  • N/A — no new functionality requiring new tests; documentation + additive overlay edits
  • N/A — no user-facing behavior change beyond the GDS enable, which is BCM-service-internal
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d2ec6529-f236-471e-a89a-b9085f9fc30a

📥 Commits

Reviewing files that changed from the base of the PR and between bccf6ad and c93c5d4.

📒 Files selected for processing (3)
  • recipes/components/nvidia-dra-driver-gpu/values.yaml
  • recipes/overlays/bcm-training.yaml
  • recipes/overlays/bcm.yaml

📝 Walkthrough

Walkthrough

This PR updates Kubernetes recipe configurations: (1) adds documentation in nvidia-dra-driver-gpu values clarifying why controller.priorityClassName and kubeletPlugin.priorityClassName are set to "" and notes eviction implications; (2) enables GPUDirect Storage by adding a gpu-operator componentRef and overrides.gds.enabled: true in the bcm-training overlay with hardware-dependent comments; and (3) mirrors BCM master/control-plane tolerations onto nvidia-dra-driver-gpu.kubeletPlugin by adding NoSchedule operator: Exists tolerations so DRA kubelet plugin scheduling matches BCM.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/aicr#1082: Related changes around preserving scheduling-related Helm values and overlay precedence that interact with these overrides.

Suggested labels

area/docs

Suggested reviewers

  • mchmarny
  • lockwobr
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(recipes): address BCM overlay gaps from H200 NVL validation' directly and clearly describes the main change: addressing specific gaps in BCM recipe overlays discovered during H200 NVL validation.
Description check ✅ Passed The description provides comprehensive context for all three YAML changes, including the specific validation scenario (H200 NVL cluster), rationale for each modification, testing performed, and risk assessment—all directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/nvidia-dra-driver-gpu/values.yaml`:
- Around line 69-78: This change updates component values (see
recipes/components/nvidia-dra-driver-gpu/values.yaml) but did not include the
regenerated BOM docs; run the make target "make bom-docs" locally to regenerate
docs/user/container-images.md and add/commit the updated
docs/user/container-images.md to this PR so the BOM reflects the values.yaml
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1fda92e1-e76d-445a-9a1e-e5e981fd3338

📥 Commits

Reviewing files that changed from the base of the PR and between ae6c948 and fd65476.

📒 Files selected for processing (3)
  • recipes/components/nvidia-dra-driver-gpu/values.yaml
  • recipes/overlays/bcm-training.yaml
  • recipes/overlays/bcm.yaml

Comment thread recipes/components/nvidia-dra-driver-gpu/values.yaml
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@yuanchen8911 yuanchen8911 marked this pull request as draft May 28, 2026 17:22
@yuanchen8911 yuanchen8911 changed the title fix(recipes): address BCM overlay gaps from H200 NVL validation WIP: fix(recipes): address BCM overlay gaps from H200 NVL validation May 28, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 28, 2026 18:21
@yuanchen8911 yuanchen8911 changed the title WIP: fix(recipes): address BCM overlay gaps from H200 NVL validation fix(recipes): address BCM overlay gaps from H200 NVL validation May 28, 2026
@yuanchen8911 yuanchen8911 requested a review from mchmarny May 28, 2026 18:22
@yuanchen8911 yuanchen8911 marked this pull request as draft May 28, 2026 18:53
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. Resolves to the same hydrated values as h100 today via
shared base components (both are Hopper, same R570/R580 driver line,
NVML auto-detects everything); no new overlay file is needed.

Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:

- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
  for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
  contributor/api-server.md,contributor/cli.md,contributor/data.md,
  contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
  server/doc.go}: godoc enumerations

Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0;
H100 overlay chain produces correct hydrated values. End-to-end:
`aicr recipe --accelerator h200 --service bcm --os ubuntu --intent training`
succeeds and produces a recipe with `criteria.accelerator: h200`.

Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
@yuanchen8911 yuanchen8911 force-pushed the fix/recipe-gaps-1086 branch from 7753de2 to bccf6ad Compare May 28, 2026 19:16
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

- Document why nvidia-dra-driver-gpu controller and kubeletPlugin
  priorityClassName are explicitly neutralized (PSA / PriorityClass
  admission constraints AICR cannot assume cluster-wide). Notes the
  eviction-under-node-pressure trade-off so operators can re-pin via
  their own overlay if needed.

- Mirror controller.tolerations onto kubeletPlugin.tolerations in
  bcm.yaml so DRA's kubelet plugin DaemonSet schedules on small BCM
  deployments that combine control-plane and worker roles on the
  same node.

- Enable GPUDirect Storage (gds.enabled: true) in bcm-training.yaml.
  BCM-provisioned nodes typically ship NVIDIA-validated NVMe + ConnectX
  hardware where GDS delivers a meaningful training I/O perf win
  (most pronounced on H200 NVL given its 141GB HBM3e per device).
  Benign on nodes without compatible hardware.

Surfaced during cluster-side validation of the recently-merged
feat/bcm-service-type work on a real H200 NVL test cluster.

Addresses 3 of 4 checkboxes in NVIDIA#1086; H200 criteria registration is
the larger 4th item and will land in a separate PR per the umbrella
issue.
@yuanchen8911 yuanchen8911 force-pushed the fix/recipe-gaps-1086 branch from bccf6ad to c93c5d4 Compare May 28, 2026 19:36
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 28, 2026 19:39
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@mchmarny mchmarny enabled auto-merge (squash) May 28, 2026 19:43
@mchmarny mchmarny self-assigned this May 28, 2026
@mchmarny mchmarny merged commit 66b852e into NVIDIA:main May 28, 2026
115 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.

Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:

- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
  for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
  contributor/api-server.md,contributor/cli.md,contributor/data.md,
  contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
  server/doc.go}: godoc enumerations

Wires up the two consumer paths the enum alone left incomplete:

- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
  snapshot -> fingerprint -> recipe path resolves real H200 hardware
  ("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
  h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
  so an `--accelerator h200` recipe inherits the same deployment-phase
  floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
  model->accelerator mapping and valid-values tables

Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.

Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.

Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:

- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
  for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
  contributor/api-server.md,contributor/cli.md,contributor/data.md,
  contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
  server/doc.go}: godoc enumerations

Wires up the two consumer paths the enum alone left incomplete:

- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
  snapshot -> fingerprint -> recipe path resolves real H200 hardware
  ("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
  h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
  so an `--accelerator h200` recipe inherits the same deployment-phase
  floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
  model->accelerator mapping and valid-values tables

Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.

Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.

Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:

- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
  for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
  contributor/api-server.md,contributor/cli.md,contributor/data.md,
  contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
  server/doc.go}: godoc enumerations

Wires up the two consumer paths the enum alone left incomplete:

- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
  snapshot -> fingerprint -> recipe path resolves real H200 hardware
  ("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
  h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
  so an `--accelerator h200` recipe inherits the same deployment-phase
  floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
  model->accelerator mapping and valid-values tables

Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.

Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.

Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:

- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
  for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
  contributor/api-server.md,contributor/cli.md,contributor/data.md,
  contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
  server/doc.go}: godoc enumerations

Wires up the two consumer paths the enum alone left incomplete:

- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
  snapshot -> fingerprint -> recipe path resolves real H200 hardware
  ("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
  h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
  so an `--accelerator h200` recipe inherits the same deployment-phase
  floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
  model->accelerator mapping and valid-values tables

Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.

Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.

Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:

- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
  for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
  contributor/api-server.md,contributor/cli.md,contributor/data.md,
  contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
  server/doc.go}: godoc enumerations

Wires up the two consumer paths the enum alone left incomplete:

- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
  snapshot -> fingerprint -> recipe path resolves real H200 hardware
  ("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
  h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
  so an `--accelerator h200` recipe inherits the same deployment-phase
  floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
  model->accelerator mapping and valid-values tables

Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.

Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.

Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:

- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
  for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
  contributor/api-server.md,contributor/cli.md,contributor/data.md,
  contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
  server/doc.go}: godoc enumerations

Wires up the two consumer paths the enum alone left incomplete:

- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
  snapshot -> fingerprint -> recipe path resolves real H200 hardware
  ("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
  h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
  so an `--accelerator h200` recipe inherits the same deployment-phase
  floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
  model->accelerator mapping and valid-values tables

Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.

Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.

Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:

- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
  for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
  contributor/api-server.md,contributor/cli.md,contributor/data.md,
  contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
  server/doc.go}: godoc enumerations

Wires up the two consumer paths the enum alone left incomplete:

- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
  snapshot -> fingerprint -> recipe path resolves real H200 hardware
  ("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
  h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
  so an `--accelerator h200` recipe inherits the same deployment-phase
  floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
  model->accelerator mapping and valid-values tables

Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.

Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants