Skip to content

feat(recipe): per-field union merge for validation phase checks#1103

Merged
mchmarny merged 5 commits into
mainfrom
feat/recipe-validation-union-merge
May 29, 2026
Merged

feat(recipe): per-field union merge for validation phase checks#1103
mchmarny merged 5 commits into
mainfrom
feat/recipe-validation-union-merge

Conversation

@mchmarny
Copy link
Copy Markdown
Member

@mchmarny mchmarny commented May 29, 2026

Summary

Switch validation.{phase} merge from per-phase pointer replace to per-field union, matching the semantics already used at the top level for constraints and componentRefs. Leaves can now add or override checks and phase-level constraints without restating the entire inherited block.

This is a merge-semantics correction, not a two-file patch. External overlay authors and downstream tenant overlays that depend on the new per-field rules — including the nil-vs-empty contract for clearing inherited lists — pick up the corrected behavior automatically.

Motivation / Context

RecipeMetadataSpec.Merge was the only list-shaped field with replace semantics — if a leaf overlay declared a deployment: block, the leaf's block fully overwrote the inherited one. This was a silent footgun: future leaves that add a deployment: block (to set one constraint or one extra check) would silently drop every inherited check unless the author remembered to re-state them. The five *-any.yaml wildcard overlays carried YAML comments warning about this; comments are a weak guard, so the merge itself is now safe by construction.

Fixes: #1000
Related: #1001 (which added the warning comments that this PR removes), #969 (which exposed the issue)

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)
  • Docs/examples (docs/, examples/)
  • Other: recipe overlays (recipes/overlays/*.yaml)

Implementation Notes

Per-field merge rules in mergeValidationPhase:

  • Checks — three-way semantics on the overlay field:
    • nil (field omitted in YAML) → inherit the base list unchanged.
    • non-nil empty (checks: []) → explicitly clear the inherited list. Required by Slurm leaves (h100-eks-ubuntu-training-slurm, h100-gke-cos-training-slurm) to drop the K8s-native nccl-all-reduce-bw check that bypasses slurmd on Slurm-managed clusters.
    • non-empty → deduplicated union with base; base entries first, then overlay-only entries appended.
  • Constraints — same nil-vs-empty rule. Non-empty unions by Name; overlay wins on same-name (analogous to top-level RecipeMetadataSpec.Constraints).
  • NodeSelection — overlay replaces wholesale when non-nil; otherwise base preserved. Full struct replace because per-field semantics aren't obvious and the struct is small.
  • Timeout, Infrastructure — overlay-wins-if-non-empty.

The nil-vs-empty distinction is preserved end-to-end by yaml.v3: YAML checks: [] decodes as a non-nil empty slice (clear), while an omitted or null key decodes as nil (inherit).

A new mergeValidationPhase helper in pkg/recipe/validation.go encapsulates this. The existing cloneValidationPhase was refactored to share a new cloneNodeSelection helper. The Merge site in metadata.go now calls mergeValidationPhase for all four phase pointers in one line each.

The five *-any.yaml wildcard overlays (b200, gb200, h100, h200, rtx-pro-6000) had their "Per-phase replace semantics" warning comments replaced with a positive description of the new union semantics. Two rtx-pro-6000-eks-ubuntu-inference-* leaves had their inline comments updated similarly. Docs in docs/contributor/data.md and docs/integrator/recipe-development.md now describe both the per-field union and the nil-vs-empty clear contract.

Testing

make qualify

Results:

  • All Go tests pass (with -race).
  • golangci-lint clean (0 issues on ./pkg/recipe/...).
  • Chainsaw e2e tests: 22 passed / 0 failed.
  • Vulnerability scan: clean.

Test coverage in pkg/recipe/metadata_test.go (TestMergeValidationConfig):

  • Union-merge core: phase_checks_union_deduplicates_preserving_order, phase_constraints_union_with_overlay_overriding_same_name (slice-order asserted via reflect.DeepEqual).
  • NodeSelection: overlay-wins-when-set and inherited-when-overlay-omits-it.
  • Scalar fields: phase_scalar_fields_overlay_wins_when_set and phase_scalar_fields_inherited_when_overlay_omits_them.
  • Nil-vs-empty contract: phase_explicit_empty_checks_clears_inherited, phase_explicit_empty_constraints_clears_inherited, phase_nil_checks_inherits_from_base.
  • Alias safety: union_merge_does_not_mutate_source_slices (mutates the source-derived index 1 under base-first union).

Resolution-level regression guard in pkg/recipe/metadata_store_test.go:

  • TestSlurmLeavesClearInheritedPerformancePhase — loads the real metadata store and resolves h100-eks-ubuntu-training-slurm and h100-gke-cos-training-slurm through BuildRecipeResult, asserting performance.checks and performance.constraints resolve to empty so FilterEntriesByValidation skips the phase.

Coverage:

  • pkg/recipe: 91.3% → 91.9% (+0.6%)
  • mergeValidationPhase: 97.7%
  • Merge: 97.4%

Risk Assessment

  • Medium — Touches merge semantics that affect every recipe rendering a validation.<phase> block. Existing leaves that re-state the full inherited list still produce identical resolved output. Leaves that rely on checks: [] to clear inherited lists (Slurm leaves) keep their previous behavior because the nil-vs-empty contract is preserved.

Rollout notes: No migration needed for in-repo overlays. External/tenant overlays that previously relied on the per-phase pointer-replace semantics to wholesale override should:

  • Continue to work unchanged if they restated the full inherited list (union dedupes).
  • Migrate from "restate everything to override one value" to declaring only the override (cleaner, less duplication).
  • Continue to use checks: [] / constraints: [] to explicitly clear inherited lists — that contract is preserved.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

Switch validation.{phase} merge from per-phase pointer replace to
per-field union, matching the semantics already used at the top level.
Leaves can now add or override checks/constraints without restating the
entire inherited block.

Per-field rules:
- Checks: deduplicated union, preserving order (base then overlay-only)
- Constraints: union by name, overlay wins on same-name
- NodeSelection: overlay replaces wholesale when non-nil
- Timeout, Infrastructure: overlay-wins-if-non-empty

Removes the "per-phase replace" warning comments from the five *-any.yaml
wildcard overlays and the two rtx-pro-6000 leaves that referenced them.

Closes #1000
@mchmarny mchmarny requested review from a team as code owners May 29, 2026 17:53
@mchmarny mchmarny added the enhancement New feature or request label May 29, 2026
@mchmarny mchmarny self-assigned this May 29, 2026
@github-actions
Copy link
Copy Markdown
Contributor

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f4e64c89-b659-4608-9bff-cfa62288859a

📥 Commits

Reviewing files that changed from the base of the PR and between 1a08074 and e37bdb6.

📒 Files selected for processing (5)
  • docs/contributor/data.md
  • docs/integrator/recipe-development.md
  • pkg/recipe/metadata_store_test.go
  • pkg/recipe/metadata_test.go
  • pkg/recipe/validation.go

📝 Walkthrough

Walkthrough

This PR implements per-field union merge semantics for spec.validation.<phase> blocks in recipe overlays, replacing the prior per-phase wholesale replacement. It introduces a new mergeValidationPhase helper function that unions Checks and Constraints by name, applies overlay-wins semantics for scalar timeout/infrastructure fields, and replaces NodeSelection wholesale when the overlay provides it. The metadata merge orchestration is updated to call this helper for all four validation phases. Comprehensive tests validate all merge path combinations and mutation safety. Documentation in contributor and integrator guides explains the new behavior, and inline comments across hardware overlay files reflect the changed semantics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/aicr#1091: Related changes to validation-phase overlay merge semantics and overlay comments that interact with deployment validation floors.
  • NVIDIA/aicr#1046: Changes to RecipeMetadataSpec.Merge behavior that affect how EKS RTX overlays inherit/merge validation fields.

Suggested labels

area/tests

Suggested reviewers

  • lalitadithya
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: switching validation phase checks from per-phase replacement to per-field union merge semantics, which is the core objective of this PR.
Description check ✅ Passed The PR description is well-detailed and comprehensively covers the motivation, implementation, testing, and risk assessment for the per-field union merge feature, directly addressing the changeset.
Linked Issues check ✅ Passed The PR fully addresses issue #1000 by implementing per-field union merge for validation phases with correct semantics for checks (deduplicated union), constraints (union by name with overlay precedence), NodeSelection (overlay wholesale replace), and timeout/infrastructure (overlay wins when non-empty). All tests and documentation updates are present.
Out of Scope Changes check ✅ Passed All changes directly support the primary objective of implementing per-field union merge for validation phases: new merge logic in validation.go, metadata.go updates, comprehensive test coverage, documentation updates, and overlay comment clarifications are all in scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/recipe-validation-union-merge

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/recipe/metadata_test.go`:
- Around line 1275-1285: The test mutates base.Validation (the merged result) at
index 0 which is base-derived under base-first union semantics and thus won't
catch aliasing from source; instead mutate elements that are source-derived
(e.g., index 1 or the last element) so the mutation would leak if the merge
aliased source. Change the two mutation targets from
base.Validation.Deployment.Checks[0] and
base.Validation.Deployment.Constraints[0].Value to a source-derived index (e.g.,
base.Validation.Deployment.Checks[1] and
base.Validation.Deployment.Constraints[1].Value or use len()-1) and keep the
assertions checking source.Deployment.Checks[...] and
source.Deployment.Constraints[...] to verify the source remained unchanged.
- Around line 1119-1138: The test currently converts
base.Validation.Deployment.Constraints into a map (byName) which hides duplicate
names and loses order; instead assert the slice itself: check len(got) equals
expected (3), then assert each element in got at specific indexes has the
expected Name and Value (e.g. got[0].Name == "Deployment.gpu-operator.version"
and got[0].Value == ">= v25.10.1", etc.), and optionally keep the map assertions
as a secondary check for content; update the test around variables got and
base.Validation.Deployment.Constraints to validate uniqueness and ordering
directly rather than relying solely on byName.

In `@pkg/recipe/validation.go`:
- Around line 129-155: The second loop that appends overlay-only constraints
should use the resolved overlay winner from overlayByName instead of the raw
loop variable, so duplicate names in overlay respect the "later overlay entry
wins" rule; update the loop that iterates overlay.Constraints (which currently
does out.Constraints = append(out.Constraints, c)) to look up
overlayByName[c.Name] and append that value (using the existing overlayByName
map and seen map) so both the base-override and overlay-only paths consistently
use the resolved overlay entry.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c22545ff-2b68-4dcc-bdd0-3acbbf35845b

📥 Commits

Reviewing files that changed from the base of the PR and between 8396a5c and bdc2d5e.

📒 Files selected for processing (12)
  • docs/contributor/data.md
  • docs/integrator/recipe-development.md
  • pkg/recipe/metadata.go
  • pkg/recipe/metadata_test.go
  • pkg/recipe/validation.go
  • recipes/overlays/b200-any.yaml
  • recipes/overlays/gb200-any.yaml
  • recipes/overlays/h100-any.yaml
  • recipes/overlays/h200-any.yaml
  • recipes/overlays/rtx-pro-6000-any.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-nim.yaml

Comment thread pkg/recipe/metadata_test.go
Comment thread pkg/recipe/metadata_test.go Outdated
Comment thread pkg/recipe/validation.go Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

Coverage Report ✅

Metric Value
Coverage 77.1%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-77.1%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/recipe 91.87% (+0.53%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/recipe/metadata.go 97.36% (+0.33%) 265 (-4) 258 (-3) 7 (-1) 👍
github.com/NVIDIA/aicr/pkg/recipe/validation.go 95.65% (+32.02%) 69 (+47) 66 (+52) 3 (-5) 🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

mchmarny and others added 3 commits May 29, 2026 11:20
- mergeValidationPhase: append overlayByName[c.Name] in the overlay-only
  loop so intra-overlay duplicate-name constraints respect the same
  last-wins rule the base-override branch uses.
- TestMergeValidationConfig: assert constraints slice order via
  reflect.DeepEqual instead of map lookups; map-based assertions hide
  ordering regressions and duplicate-name leaks.
- TestMergeValidationConfig alias-safety: mutate the source-derived
  index (1) instead of the base-derived index (0). Under base-first
  union semantics, base occupies index 0 — mutating it only catches
  base aliasing, not source aliasing (the regression being guarded).
dims
dims previously approved these changes May 29, 2026
Copy link
Copy Markdown
Collaborator

@dims dims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-review by Claude Code + Codex + CodeRabbit

One confirmed blocking regression, reproduced end-to-end against the pre-PR base (8396a5c2) vs head (bdc2d5e2): the per-field union merge can no longer honor an explicit empty list. The two H100 Slurm leaves declare performance.checks: [] / constraints: [] to clear the inherited K8s-native nccl-all-reduce-bw check, but under union they resolve to performance.checks = [nccl-all-reduce-bw] and now run a Kubernetes-native NCCL test on a Slurm cluster.

See the four inline comments for specifics:

  1. Root cause + required fix — pkg/recipe/validation.go (preserve nil-vs-empty: explicit [] = clear).
  2. Required regression test — pkg/recipe/metadata_test.go (resolve both Slurm leaves, assert cleared).
  3. Docs follow-up — docs/contributor/data.md (document the explicit-empty clear).
  4. Impact framing — pkg/recipe/metadata.go (merge-semantics change also affects external/tenant overlays).

The additive union/override/ordering logic, alias safety (fresh allocation + cloneNodeSelection, guarded by the new no-mutate test), and the cloneValidationPhase refactor are all correct — the only gap is the explicit-empty-clear case. Recommend blocking until the fix lands with a resolution-level test.

Comment thread pkg/recipe/validation.go Outdated
Comment thread pkg/recipe/metadata_test.go
Comment thread docs/contributor/data.md
Comment thread pkg/recipe/metadata.go
Per #1000 review: mergeValidationPhase must distinguish nil (field
omitted → inherit) from a non-nil empty slice (YAML `checks: []` →
explicit clear). The first pass collapsed both into "inherit", which
regressed h100-eks-ubuntu-training-slurm and h100-gke-cos-training-slurm
— both declare `performance.checks: []` to drop the K8s-native
nccl-all-reduce-bw check that bypasses slurmd on a Slurm-managed
cluster. Under the broken merge, those leaves would re-inherit the
NCCL check and run it on a Slurm cluster (false validation).

This is a merge-semantics correction, not a two-file patch — external
overlay authors relying on `checks: []` to clear inherited checks
also depended on the now-restored behavior.

- mergeValidationPhase: treat overlay.Checks/Constraints == nil as
  inherit; treat non-nil len-0 as explicit clear; non-empty unions
  with base as before. yaml.v3 preserves this nil-vs-empty distinction
  (`checks: []` → non-nil empty; omitted/null → nil).
- Add regression unit tests for the explicit-clear and nil-inherit paths.
- Add TestSlurmLeavesClearInheritedPerformancePhase that resolves the
  two Slurm leaves end-to-end through BuildRecipeResult and asserts
  their performance phase has no checks/constraints (so
  FilterEntriesByValidation skips the phase).
- Document the nil-vs-empty contract in docs/contributor/data.md and
  docs/integrator/recipe-development.md so future overlay authors
  know how to clear inherited content.
@github-actions github-actions Bot added size/XL and removed size/L labels May 29, 2026
@mchmarny mchmarny requested review from dims and yuanchen8911 May 29, 2026 19:03
@mchmarny mchmarny enabled auto-merge (squash) May 29, 2026 19:11
Copy link
Copy Markdown
Collaborator

@dims dims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mchmarny mchmarny merged commit 76647b7 into main May 29, 2026
120 checks passed
@mchmarny mchmarny deleted the feat/recipe-validation-union-merge branch May 29, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(recipe): per-field union merge for validation phase checks

3 participants