Skip to content

feat(recipe): register h200 as first-class accelerator type#1091

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/h200-accelerator-1086
May 28, 2026
Merged

feat(recipe): register h200 as first-class accelerator type#1091
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/h200-accelerator-1086

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented May 28, 2026

Summary

Add h200 as a built-in accelerator type so users with H200 hardware can pass --accelerator h200 and have the recipe metadata reflect what's actually deployed. H200 is the same Hopper generation as H100 (same R570/R580 driver line, same gpu-operator support floor, NVML auto-detects everything), so it shares H100's deployment-phase floor via a parallel h200-any.yaml criteria-wildcard overlay.

Motivation / Context

Fixes: checkbox 4 of #1086
Related: #1089 (the BCM overlay-gaps PR that landed checkboxes 1–3 of #1086)

Validated on a real H200 NVL cluster during the BCM service overlay work:

  • 2× NVIDIA H200 NVL (141GB HBM3e each), Hopper, compute 9.0
  • GFD labels and DRA ResourceSlice correctly identify the device as NVIDIA H200 NVL
  • The H100 overlay chain produces correct hydrated values when applied to H200 hardware — no functional divergence at the component-values layer

Before this PR:

  • Recipe metadata claimed accelerator: h100 for an H200 cluster — wrong on inspection, breaks reproducibility
  • If we ever want H200-specific tuning later (MIG profiles, gds.enabled gated on H200's larger HBM, or NVL-form-factor topology), there's no enum key to hang it on

Test infrastructure was already partially in place: pkg/recipe/criteria_registry_parse_test.go registered h200 via OriginExternal as the extensibility test value. This PR promotes h200 to first-class and swaps that test value to mi300x (a clearly non-built-in accelerator that retains the extensibility-path semantics).

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Documentation update

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Snapshot / fingerprint (pkg/fingerprint)
  • Docs/examples (docs/)

Implementation Notes

Surfaces touched per the .claude/CLAUDE.md "Adding a new enum value" audit rule:

Core Go:

  • pkg/recipe/criteria.go — new const CriteriaAcceleratorH200, parse case, GetCriteriaAcceleratorTypes() list, godoc enumeration

Snapshot detection (the auto-detection half of the enum):

  • pkg/fingerprint/gpu_sku.go — add the H200 ProductName pattern to gpuSKURegistry so the snapshot → fingerprint → recipe path resolves real H200 hardware (NVIDIA H200 NVL, …141GB HBM3e) to h200 instead of unknown-sku. Without this, only the manual --accelerator h200 path worked; snapshot-driven recipe generation (the path the wrong-metadata problem originates from) never produced h200.
  • pkg/fingerprint/gpu_sku_test.go — H200 NVL + 141GB HBM3e cases

Deployment-phase floor (so h200 truly mirrors h100):

  • recipes/overlays/h200-any.yaml — new criteria-wildcard overlay mirroring h100-any.yaml (4 standard checks: operator-health, expected-resources, gpu-operator-version, check-nvidia-smi; plus gpu-operator >= v24.6.0, the Hopper floor). Without it, an --accelerator h200 recipe matched no h100-* overlay and landed on bare base, silently dropping the deployment validation floor every h100 recipe gets. Mirrors the per-accelerator wildcard pattern from feat(recipe): deliver deployment-phase floor at per-accelerator wildcards #1001 (h100-any / b200-any / gb200-any / rtx-pro-6000-any).

Tests:

  • pkg/recipe/criteria_test.go — parse table covers h200 and H200 (uppercase); TestGetCriteriaAcceleratorTypes expected list updated
  • pkg/recipe/criteria_registry_parse_test.go and criteria_registry_test.goh200 was the OriginExternal extensibility example; swapped to mi300x (AMD, clearly non-built-in) so the tests retain their original intent

API surface:

  • api/aicr/v1/server.yaml — all 5 enum blocks (accelerator + gpu alias, on snapshot/recipe endpoints)

Docs:

  • docs/README.md (glossary)
  • docs/user/cli-reference.md, docs/user/api-reference.md
  • docs/contributor/api-server.md, docs/contributor/cli.md, docs/contributor/data.md, docs/contributor/validations.md, docs/contributor/api-server-extending.md
  • .claude/skills/analyzing-snapshots/SKILL.md — model→accelerator mapping table + valid-values list

Godoc:

  • pkg/recipe/doc.go, pkg/cli/recipe.go, pkg/api/doc.go, pkg/server/doc.go, pkg/fingerprint/doc.go, pkg/fingerprint/types.go

Issue templates:

  • .github/ISSUE_TEMPLATE/bug_report.yml (GPU type dropdown — 2 occurrences)

Intentionally not changed:

  • types.go top-level "e.g., h100, b200" example comment — e.g. is illustrative, not enumerative
  • docs/integrator/components/nodewright.md — that page lists NodeWright's current component support (h100, gb200), which is component-specific scope and would be a false claim to extend
  • Various e.g., h100, gb200 example references in docs/user/cli-reference.md and elsewhere — illustrative, not enumerative
  • Concrete service-bound h200-* leaf overlays (e.g. h200-bcm-training) — deferred until there's a real per-service tuning delta; the h200-any.yaml floor plus the accelerator-agnostic service overlays cover today's needs

Testing

yamllint -c .yamllint.yaml recipes/overlays/h200-any.yaml api/aicr/v1/server.yaml .github/ISSUE_TEMPLATE/bug_report.yml
# Clean.

go test -count=1 ./pkg/recipe/... ./pkg/fingerprint/...
# ok (pkg/recipe 90.8%, pkg/fingerprint 98.1%)

golangci-lint run -c .golangci.yaml ./pkg/fingerprint/...
# 0 issues.

make qualify
# Pass (touched packages pkg/recipe, pkg/fingerprint, recipes all ok).

# End-to-end — h200 now resolves the identical deployment floor as h100:
aicr recipe --service bcm --accelerator h200 --os ubuntu --intent training -o /tmp/h200.yaml
# deployment phase: operator-health, expected-resources, gpu-operator-version,
# check-nvidia-smi + Deployment.gpu-operator.version >= v24.6.0
# (byte-identical to the --accelerator h100 deployment phase)

Risk Assessment

  • Low — Additive enum value + parallel Hopper-floor overlay + one SKU-detection pattern. No breaking changes; existing h100 recipes are unaffected; h200-any.yaml only matches accelerator: h200 queries. H200 hardware now auto-detects to h200 on the snapshot path and inherits the same deployment floor as H100.

Rollout notes: None. Existing recipes that use --accelerator h100 for H200 hardware continue to work; users can opt into the more-accurate --accelerator h200 at their convenience.

Checklist

  • Tests pass locally (./pkg/recipe/..., ./pkg/fingerprint/..., and make qualify for the touched packages)
  • Linter passes (yamllint + golangci-lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (criteria_test.go parse + Get cases, gpu_sku_test.go H200 cases, extensibility tests swapped to mi300x)
  • I updated docs for user-facing behavior (all enum-enumerating doc pages audited per .claude/CLAUDE.md)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@github-actions
Copy link
Copy Markdown
Contributor

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds h200 as a supported GPU accelerator type across the AICR codebase. Changes include: updating the OpenAPI schema (api/aicr/v1/server.yaml) to include h200 in query parameter and schema enums; implementing h200 support in the Go criteria type system (pkg/recipe/criteria.go) with a new constant, parsing logic, and getter updates; extending unit tests to cover h200 parsing and adjust registry-based tests; updating fingerprint GPU SKU parsing and related tests; updating documentation, CLI help text, issue templates, and adding an h200 RecipeMetadata overlay.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/aicr#999 — Overlaps at accelerator parsing in pkg/recipe/criteria.go; both PRs modify parsing/registry behavior.

Suggested labels

area/tests, size/L

Suggested reviewers

  • mchmarny
  • lalitadithya
  • xdu31
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: promoting h200 to a first-class accelerator type with full enum registration, detection, and overlay support.
Description check ✅ Passed The description comprehensively explains the change, motivation, implementation details, testing, and risk assessment, all directly related to the h200 accelerator promotion.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/recipe/doc.go`:
- Around line 29-30: The package docs are missing CriteriaAcceleratorH200 in the
"Accelerator types for GPU selection" list; add CriteriaAcceleratorH200 to that
bullet list in pkg/recipe/doc.go so the documented accelerator list (the
comments showing values like h100, h200, gb200, ...) matches the updated
field/query docs and includes h200 alongside the other CriteriaAccelerator*
entries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0ee64d10-17bd-4920-9e70-2facff9a017d

📥 Commits

Reviewing files that changed from the base of the PR and between ae6c948 and 73564c1.

📒 Files selected for processing (20)
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/README.md
  • docs/contributor/api-server-extending.md
  • docs/contributor/api-server.md
  • docs/contributor/cli.md
  • docs/contributor/data.md
  • docs/contributor/validations.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/api/doc.go
  • pkg/cli/recipe.go
  • pkg/fingerprint/doc.go
  • pkg/fingerprint/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_registry_parse_test.go
  • pkg/recipe/criteria_registry_test.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • pkg/server/doc.go

Comment thread pkg/recipe/doc.go
@yuanchen8911 yuanchen8911 marked this pull request as draft May 28, 2026 18:53
@yuanchen8911 yuanchen8911 force-pushed the feat/h200-accelerator-1086 branch from 73564c1 to 5777f12 Compare May 28, 2026 19:16
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@yuanchen8911 yuanchen8911 force-pushed the feat/h200-accelerator-1086 branch from 2e4d05e to d09d8ba Compare May 28, 2026 20:08
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@yuanchen8911 yuanchen8911 force-pushed the feat/h200-accelerator-1086 branch 2 times, most recently from 74ff416 to 10a43cf Compare May 28, 2026 20:27
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 28, 2026 20:29
@yuanchen8911 yuanchen8911 requested a review from mchmarny May 28, 2026 20:30
@yuanchen8911 yuanchen8911 force-pushed the feat/h200-accelerator-1086 branch from 10a43cf to 2b06bd0 Compare May 28, 2026 20:31
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model PR for adding a new enum value. The audit follows the CLAUDE.md "new enum" rule across every surface (Go const + parse + Get list, OpenAPI spec ×5 blocks, every doc page that enumerates accelerators, issue template dropdown, godoc on six packages), h200-any.yaml is byte-parallel with h100-any.yaml so the Hopper deployment floor carries over with zero divergence risk, and the GH200/H200 substring disambiguation in gpu_sku.go is exactly the right defense with negative tests to lock it in. Swapping the OriginExternal extensibility test value from h200mi300x (clearly non-NVIDIA) preserves the original test intent.

Note: PR is currently blocked only on in-progress CI (tests/E2E, Tier 1 deploy matrix still running at review time) — substance is approved.

LGTM 🚢

// first-class accelerator enum, so match it explicitly before the "H200"
// rule and leave it unresolved ("") rather than mislabeling it as h200 —
// same reason "GB200" precedes "B200" above.
{"GH200", ""},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch on the GH200 substring trap — Grace Hopper would otherwise silently mislabel as h200. The comment explicitly anchoring it to the existing GB200B200 precedent makes the ordering invariant clear to future contributors, and the negative test cases (NVIDIA GH200, NVIDIA GH200 480GB) lock it in.

Adds CriteriaAcceleratorH200 to the criteria registry so users can
pass `--accelerator h200` and have the recipe metadata reflect the
hardware. H200 is the same Hopper generation as H100 (same R570/R580
driver line, same gpu-operator support floor), so its deployment-phase
floor mirrors H100's.

Updates every surface that enumerates accelerator types per the
"Adding a new enum value" audit rule in .claude/CLAUDE.md:

- pkg/recipe/criteria.go: new const, parse case, AllCriteriaAcceleratorTypes
- pkg/recipe/criteria_test.go: parse + Get tests cover h200
- pkg/recipe/criteria_registry_{,parse_}test.go: swap h200 (now built-in)
  for mi300x as the extensibility-test example value
- api/aicr/v1/server.yaml: all 5 enum blocks
- .github/ISSUE_TEMPLATE/bug_report.yml: GPU type dropdown entries
- docs/{README.md,user/cli-reference.md,user/api-reference.md,
  contributor/api-server.md,contributor/cli.md,contributor/data.md,
  contributor/validations.md,contributor/api-server-extending.md}
- pkg/{api,cli/recipe.go,fingerprint/{doc.go,types.go},recipe/doc.go,
  server/doc.go}: godoc enumerations

Wires up the two consumer paths the enum alone left incomplete:

- pkg/fingerprint/gpu_sku.go: add the H200 ProductName pattern so the
  snapshot -> fingerprint -> recipe path resolves real H200 hardware
  ("NVIDIA H200 NVL", "...141GB HBM3e") to h200 instead of unknown-sku
- recipes/overlays/h200-any.yaml: new criteria-wildcard overlay mirroring
  h100-any.yaml (4 standard deployment checks + gpu-operator >= v24.6.0)
  so an `--accelerator h200` recipe inherits the same deployment-phase
  floor as H100 rather than landing on bare base
- .claude/skills/analyzing-snapshots/SKILL.md: add h200 to the
  model->accelerator mapping and valid-values tables

Validated against a real H200 NVL cluster: GFD / DRA correctly identify
the device as "NVIDIA H200 NVL", 141GB HBM3e, Hopper, compute 9.0.
End-to-end: `aicr recipe --accelerator h200 --service bcm --os ubuntu
--intent training` resolves the identical deployment floor as the h100
equivalent and produces a recipe with criteria.accelerator: h200.

Addresses checkbox 4 of NVIDIA#1086 (the H200 registration item carved out
from PR NVIDIA#1089).
@yuanchen8911 yuanchen8911 force-pushed the feat/h200-accelerator-1086 branch from 2b06bd0 to 6dff4c4 Compare May 28, 2026 20:44
@yuanchen8911 yuanchen8911 enabled auto-merge (squash) May 28, 2026 20:45
@yuanchen8911 yuanchen8911 merged commit ba787a5 into NVIDIA:main May 28, 2026
118 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants