feat(recipe): register h200 as first-class accelerator type by yuanchen8911 · Pull Request #1091 · NVIDIA/aicr

yuanchen8911 · 2026-05-28T17:32:02Z

Summary

Add h200 as a built-in accelerator type so users with H200 hardware can pass --accelerator h200 and have the recipe metadata reflect what's actually deployed. H200 is the same Hopper generation as H100 (same R570/R580 driver line, same gpu-operator support floor, NVML auto-detects everything), so it shares H100's deployment-phase floor via a parallel h200-any.yaml criteria-wildcard overlay.

Motivation / Context

Fixes: checkbox 4 of #1086
Related: #1089 (the BCM overlay-gaps PR that landed checkboxes 1–3 of #1086)

Validated on a real H200 NVL cluster during the BCM service overlay work:

2× NVIDIA H200 NVL (141GB HBM3e each), Hopper, compute 9.0
GFD labels and DRA ResourceSlice correctly identify the device as NVIDIA H200 NVL
The H100 overlay chain produces correct hydrated values when applied to H200 hardware — no functional divergence at the component-values layer

Before this PR:

Recipe metadata claimed accelerator: h100 for an H200 cluster — wrong on inspection, breaks reproducibility
If we ever want H200-specific tuning later (MIG profiles, gds.enabled gated on H200's larger HBM, or NVL-form-factor topology), there's no enum key to hang it on

Test infrastructure was already partially in place: pkg/recipe/criteria_registry_parse_test.go registered h200 via OriginExternal as the extensibility test value. This PR promotes h200 to first-class and swaps that test value to mi300x (a clearly non-built-in accelerator that retains the extensibility-path semantics).

Type of Change

New feature (non-breaking change that adds functionality)
Documentation update

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Snapshot / fingerprint (pkg/fingerprint)
Docs/examples (docs/)

Implementation Notes

Surfaces touched per the .claude/CLAUDE.md "Adding a new enum value" audit rule:

Core Go:

pkg/recipe/criteria.go — new const CriteriaAcceleratorH200, parse case, GetCriteriaAcceleratorTypes() list, godoc enumeration

Snapshot detection (the auto-detection half of the enum):

pkg/fingerprint/gpu_sku.go — add the H200 ProductName pattern to gpuSKURegistry so the snapshot → fingerprint → recipe path resolves real H200 hardware (NVIDIA H200 NVL, …141GB HBM3e) to h200 instead of unknown-sku. Without this, only the manual --accelerator h200 path worked; snapshot-driven recipe generation (the path the wrong-metadata problem originates from) never produced h200.
pkg/fingerprint/gpu_sku_test.go — H200 NVL + 141GB HBM3e cases

Deployment-phase floor (so h200 truly mirrors h100):

recipes/overlays/h200-any.yaml — new criteria-wildcard overlay mirroring h100-any.yaml (4 standard checks: operator-health, expected-resources, gpu-operator-version, check-nvidia-smi; plus gpu-operator >= v24.6.0, the Hopper floor). Without it, an --accelerator h200 recipe matched no h100-* overlay and landed on bare base, silently dropping the deployment validation floor every h100 recipe gets. Mirrors the per-accelerator wildcard pattern from feat(recipe): deliver deployment-phase floor at per-accelerator wildcards #1001 (h100-any / b200-any / gb200-any / rtx-pro-6000-any).

Tests:

pkg/recipe/criteria_test.go — parse table covers h200 and H200 (uppercase); TestGetCriteriaAcceleratorTypes expected list updated
pkg/recipe/criteria_registry_parse_test.go and criteria_registry_test.go — h200 was the OriginExternal extensibility example; swapped to mi300x (AMD, clearly non-built-in) so the tests retain their original intent

API surface:

api/aicr/v1/server.yaml — all 5 enum blocks (accelerator + gpu alias, on snapshot/recipe endpoints)

Docs:

docs/README.md (glossary)
docs/user/cli-reference.md, docs/user/api-reference.md
docs/contributor/api-server.md, docs/contributor/cli.md, docs/contributor/data.md, docs/contributor/validations.md, docs/contributor/api-server-extending.md
.claude/skills/analyzing-snapshots/SKILL.md — model→accelerator mapping table + valid-values list

Godoc:

pkg/recipe/doc.go, pkg/cli/recipe.go, pkg/api/doc.go, pkg/server/doc.go, pkg/fingerprint/doc.go, pkg/fingerprint/types.go

Issue templates:

.github/ISSUE_TEMPLATE/bug_report.yml (GPU type dropdown — 2 occurrences)

Intentionally not changed:

types.go top-level "e.g., h100, b200" example comment — e.g. is illustrative, not enumerative
docs/integrator/components/nodewright.md — that page lists NodeWright's current component support (h100, gb200), which is component-specific scope and would be a false claim to extend
Various e.g., h100, gb200 example references in docs/user/cli-reference.md and elsewhere — illustrative, not enumerative
Concrete service-bound h200-* leaf overlays (e.g. h200-bcm-training) — deferred until there's a real per-service tuning delta; the h200-any.yaml floor plus the accelerator-agnostic service overlays cover today's needs

Testing

yamllint -c .yamllint.yaml recipes/overlays/h200-any.yaml api/aicr/v1/server.yaml .github/ISSUE_TEMPLATE/bug_report.yml
# Clean.

go test -count=1 ./pkg/recipe/... ./pkg/fingerprint/...
# ok (pkg/recipe 90.8%, pkg/fingerprint 98.1%)

golangci-lint run -c .golangci.yaml ./pkg/fingerprint/...
# 0 issues.

make qualify
# Pass (touched packages pkg/recipe, pkg/fingerprint, recipes all ok).

# End-to-end — h200 now resolves the identical deployment floor as h100:
aicr recipe --service bcm --accelerator h200 --os ubuntu --intent training -o /tmp/h200.yaml
# deployment phase: operator-health, expected-resources, gpu-operator-version,
# check-nvidia-smi + Deployment.gpu-operator.version >= v24.6.0
# (byte-identical to the --accelerator h100 deployment phase)

Risk Assessment

Low — Additive enum value + parallel Hopper-floor overlay + one SKU-detection pattern. No breaking changes; existing h100 recipes are unaffected; h200-any.yaml only matches accelerator: h200 queries. H200 hardware now auto-detects to h200 on the snapshot path and inherits the same deployment floor as H100.

Rollout notes: None. Existing recipes that use --accelerator h100 for H200 hardware continue to work; users can opt into the more-accurate --accelerator h200 at their convenience.

Checklist

Tests pass locally (./pkg/recipe/..., ./pkg/fingerprint/..., and make qualify for the touched packages)
Linter passes (yamllint + golangci-lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (criteria_test.go parse + Get cases, gpu_sku_test.go H200 cases, extensibility tests swapped to mi300x)
I updated docs for user-facing behavior (all enum-enumerating doc pages audited per .claude/CLAUDE.md)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

github-actions · 2026-05-28T17:32:55Z

🌿 Preview your docs: https://nvidia-preview-feat-h200-accelerator-1086.docs.buildwithfern.com/aicr

coderabbitai · 2026-05-28T17:36:50Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds h200 as a supported GPU accelerator type across the AICR codebase. Changes include: updating the OpenAPI schema (api/aicr/v1/server.yaml) to include h200 in query parameter and schema enums; implementing h200 support in the Go criteria type system (pkg/recipe/criteria.go) with a new constant, parsing logic, and getter updates; extending unit tests to cover h200 parsing and adjust registry-based tests; updating fingerprint GPU SKU parsing and related tests; updating documentation, CLI help text, issue templates, and adding an h200 RecipeMetadata overlay.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Address recipe gaps surfaced during BCM / H200 NVL validation #1086 — The PR directly implements h200 accelerator support in code (CriteriaAcceleratorH200 constant, parsing, enums, tests, and documentation) to address the feature request.

Possibly related PRs

NVIDIA/aicr#999 — Overlaps at accelerator parsing in pkg/recipe/criteria.go; both PRs modify parsing/registry behavior.

Suggested labels

area/tests, size/L

Suggested reviewers

mchmarny
lalitadithya
xdu31

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: promoting h200 to a first-class accelerator type with full enum registration, detection, and overlay support.
Description check	✅ Passed	The description comprehensively explains the change, motivation, implementation details, testing, and risk assessment, all directly related to the h200 accelerator promotion.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/recipe/doc.go`:
- Around line 29-30: The package docs are missing CriteriaAcceleratorH200 in the
"Accelerator types for GPU selection" list; add CriteriaAcceleratorH200 to that
bullet list in pkg/recipe/doc.go so the documented accelerator list (the
comments showing values like h100, h200, gb200, ...) matches the updated
field/query docs and includes h200 alongside the other CriteriaAccelerator*
entries.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0ee64d10-17bd-4920-9e70-2facff9a017d

📥 Commits

Reviewing files that changed from the base of the PR and between ae6c948 and 73564c1.

📒 Files selected for processing (20)

.github/ISSUE_TEMPLATE/bug_report.yml
api/aicr/v1/server.yaml
docs/README.md
docs/contributor/api-server-extending.md
docs/contributor/api-server.md
docs/contributor/cli.md
docs/contributor/data.md
docs/contributor/validations.md
docs/user/api-reference.md
docs/user/cli-reference.md
pkg/api/doc.go
pkg/cli/recipe.go
pkg/fingerprint/doc.go
pkg/fingerprint/types.go
pkg/recipe/criteria.go
pkg/recipe/criteria_registry_parse_test.go
pkg/recipe/criteria_registry_test.go
pkg/recipe/criteria_test.go
pkg/recipe/doc.go
pkg/server/doc.go

coderabbitai · 2026-05-28T19:21:53Z

Actionable comments posted: 0

coderabbitai · 2026-05-28T20:08:05Z

Actionable comments posted: 0

coderabbitai · 2026-05-28T20:16:14Z

Actionable comments posted: 0

coderabbitai · 2026-05-28T20:35:09Z

Actionable comments posted: 0

mchmarny

Model PR for adding a new enum value. The audit follows the CLAUDE.md "new enum" rule across every surface (Go const + parse + Get list, OpenAPI spec ×5 blocks, every doc page that enumerates accelerators, issue template dropdown, godoc on six packages), h200-any.yaml is byte-parallel with h100-any.yaml so the Hopper deployment floor carries over with zero divergence risk, and the GH200/H200 substring disambiguation in gpu_sku.go is exactly the right defense with negative tests to lock it in. Swapping the OriginExternal extensibility test value from h200 → mi300x (clearly non-NVIDIA) preserves the original test intent.

Note: PR is currently blocked only on in-progress CI (tests/E2E, Tier 1 deploy matrix still running at review time) — substance is approved.

LGTM 🚢

mchmarny · 2026-05-28T20:36:06Z

+	// first-class accelerator enum, so match it explicitly before the "H200"
+	// rule and leave it unresolved ("") rather than mislabeling it as h200 —
+	// same reason "GB200" precedes "B200" above.
+	{"GH200", ""},


Nice catch on the GH200 substring trap — Grace Hopper would otherwise silently mislabel as h200. The comment explicitly anchoring it to the existing GB200→B200 precedent makes the ordering invariant clear to future contributors, and the negative test cases (NVIDIA GH200, NVIDIA GH200 480GB) lock it in.

Adds CriteriaAcceleratorH200 to the criteria registry so users can pass `--accelerator h200` and have the recipe metadata reflect the hardware. H200 is the same Hopper generation as H100 (same R570/R580 driver line, same gpu-operator support floor), so its deployment-phase floor mirrors H100's. Updates every surface that enumerates accelerator types per the "Adding a new enum value" audit rule in .claude/CLAUDE.md: - pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes - pkg/recipe/criteria_test.go: parse + Get tests cover h200 - pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in) for mi300x as the extensibility-test example value - api/aicr/v1/server.yaml: all 5 enum blocks - .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries - docs/{README.md,user/cli-reference.md,user/api-reference.md, contributor/api-server.md,contributor/cli.md,contributor/data.md, contributor/validations.md,contributor/api-server-extending.md} - pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go, server/doc.go}: godoc enumerations Wires up the two consumer paths the enum alone left incomplete: - pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the snapshot -> fingerprint -> recipe path resolves real H200 hardware ("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku - recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0) so an `--accelerator h200` recipe inherits the same deployment-phase floor as H100 rather than landing on bare base - .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the model->accelerator mapping and valid-values tables Validated against a real H200 NVL cluster: GFD / DRA correctly identify the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0. End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu --intent training` resolves the identical deployment floor as the h100 equivalent and produces a recipe with criteria.accelerator: h200. Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out from PR NVIDIA#1089).

yuanchen8911 requested review from a team as code owners May 28, 2026 17:32

yuanchen8911 added enhancement New feature or request area/recipes area/docs needs-triage area/cli labels May 28, 2026

github-actions Bot added area/ci area/api size/M and removed area/recipes labels May 28, 2026

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread pkg/recipe/doc.go

yuanchen8911 marked this pull request as draft May 28, 2026 18:53

yuanchen8911 force-pushed the feat/h200-accelerator-1086 branch from 73564c1 to 5777f12 Compare May 28, 2026 19:16

yuanchen8911 force-pushed the feat/h200-accelerator-1086 branch from 5777f12 to 971fb99 Compare May 28, 2026 20:00

github-actions Bot added the area/recipes label May 28, 2026

yuanchen8911 force-pushed the feat/h200-accelerator-1086 branch from 971fb99 to 2e4d05e Compare May 28, 2026 20:03

mchmarny assigned yuanchen8911 May 28, 2026

yuanchen8911 force-pushed the feat/h200-accelerator-1086 branch from 2e4d05e to d09d8ba Compare May 28, 2026 20:08

yuanchen8911 force-pushed the feat/h200-accelerator-1086 branch 2 times, most recently from 74ff416 to 10a43cf Compare May 28, 2026 20:27

yuanchen8911 marked this pull request as ready for review May 28, 2026 20:29

yuanchen8911 requested a review from mchmarny May 28, 2026 20:30

yuanchen8911 force-pushed the feat/h200-accelerator-1086 branch from 10a43cf to 2b06bd0 Compare May 28, 2026 20:31

mchmarny approved these changes May 28, 2026

View reviewed changes

yuanchen8911 force-pushed the feat/h200-accelerator-1086 branch from 2b06bd0 to 6dff4c4 Compare May 28, 2026 20:44

yuanchen8911 enabled auto-merge (squash) May 28, 2026 20:45

yuanchen8911 merged commit ba787a5 into NVIDIA:main May 28, 2026
118 checks passed

mchmarny mentioned this pull request May 28, 2026

Address recipe gaps surfaced during BCM / H200 NVL validation #1086

Closed

4 tasks

This was referenced May 29, 2026

feat(recipes): add nodewright h100 tuning to H200 EKS recipes #1102

Merged

feat(recipe): per-field union merge for validation phase checks #1103

Merged

Conversation

yuanchen8911 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

mchmarny May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading