feat(recipe): per-field union merge for validation phase checks#1103
Conversation
Switch validation.{phase} merge from per-phase pointer replace to
per-field union, matching the semantics already used at the top level.
Leaves can now add or override checks/constraints without restating the
entire inherited block.
Per-field rules:
- Checks: deduplicated union, preserving order (base then overlay-only)
- Constraints: union by name, overlay wins on same-name
- NodeSelection: overlay replaces wholesale when non-nil
- Timeout, Infrastructure: overlay-wins-if-non-empty
Removes the "per-phase replace" warning comments from the five *-any.yaml
wildcard overlays and the two rtx-pro-6000 leaves that referenced them.
Closes #1000
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThis PR implements per-field union merge semantics for Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/recipe/metadata_test.go`:
- Around line 1275-1285: The test mutates base.Validation (the merged result) at
index 0 which is base-derived under base-first union semantics and thus won't
catch aliasing from source; instead mutate elements that are source-derived
(e.g., index 1 or the last element) so the mutation would leak if the merge
aliased source. Change the two mutation targets from
base.Validation.Deployment.Checks[0] and
base.Validation.Deployment.Constraints[0].Value to a source-derived index (e.g.,
base.Validation.Deployment.Checks[1] and
base.Validation.Deployment.Constraints[1].Value or use len()-1) and keep the
assertions checking source.Deployment.Checks[...] and
source.Deployment.Constraints[...] to verify the source remained unchanged.
- Around line 1119-1138: The test currently converts
base.Validation.Deployment.Constraints into a map (byName) which hides duplicate
names and loses order; instead assert the slice itself: check len(got) equals
expected (3), then assert each element in got at specific indexes has the
expected Name and Value (e.g. got[0].Name == "Deployment.gpu-operator.version"
and got[0].Value == ">= v25.10.1", etc.), and optionally keep the map assertions
as a secondary check for content; update the test around variables got and
base.Validation.Deployment.Constraints to validate uniqueness and ordering
directly rather than relying solely on byName.
In `@pkg/recipe/validation.go`:
- Around line 129-155: The second loop that appends overlay-only constraints
should use the resolved overlay winner from overlayByName instead of the raw
loop variable, so duplicate names in overlay respect the "later overlay entry
wins" rule; update the loop that iterates overlay.Constraints (which currently
does out.Constraints = append(out.Constraints, c)) to look up
overlayByName[c.Name] and append that value (using the existing overlayByName
map and seen map) so both the base-override and overlay-only paths consistently
use the resolved overlay entry.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: c22545ff-2b68-4dcc-bdd0-3acbbf35845b
📒 Files selected for processing (12)
docs/contributor/data.mddocs/integrator/recipe-development.mdpkg/recipe/metadata.gopkg/recipe/metadata_test.gopkg/recipe/validation.gorecipes/overlays/b200-any.yamlrecipes/overlays/gb200-any.yamlrecipes/overlays/h100-any.yamlrecipes/overlays/h200-any.yamlrecipes/overlays/rtx-pro-6000-any.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-nim.yaml
Coverage Report ✅
Coverage BadgeMerging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
- mergeValidationPhase: append overlayByName[c.Name] in the overlay-only loop so intra-overlay duplicate-name constraints respect the same last-wins rule the base-override branch uses. - TestMergeValidationConfig: assert constraints slice order via reflect.DeepEqual instead of map lookups; map-based assertions hide ordering regressions and duplicate-name leaks. - TestMergeValidationConfig alias-safety: mutate the source-derived index (1) instead of the base-derived index (0). Under base-first union semantics, base occupies index 0 — mutating it only catches base aliasing, not source aliasing (the regression being guarded).
There was a problem hiding this comment.
Cross-review by Claude Code + Codex + CodeRabbit
One confirmed blocking regression, reproduced end-to-end against the pre-PR base (8396a5c2) vs head (bdc2d5e2): the per-field union merge can no longer honor an explicit empty list. The two H100 Slurm leaves declare performance.checks: [] / constraints: [] to clear the inherited K8s-native nccl-all-reduce-bw check, but under union they resolve to performance.checks = [nccl-all-reduce-bw] and now run a Kubernetes-native NCCL test on a Slurm cluster.
See the four inline comments for specifics:
- Root cause + required fix —
pkg/recipe/validation.go(preserve nil-vs-empty: explicit[]= clear). - Required regression test —
pkg/recipe/metadata_test.go(resolve both Slurm leaves, assert cleared). - Docs follow-up —
docs/contributor/data.md(document the explicit-empty clear). - Impact framing —
pkg/recipe/metadata.go(merge-semantics change also affects external/tenant overlays).
The additive union/override/ordering logic, alias safety (fresh allocation + cloneNodeSelection, guarded by the new no-mutate test), and the cloneValidationPhase refactor are all correct — the only gap is the explicit-empty-clear case. Recommend blocking until the fix lands with a resolution-level test.
Per #1000 review: mergeValidationPhase must distinguish nil (field omitted → inherit) from a non-nil empty slice (YAML `checks: []` → explicit clear). The first pass collapsed both into "inherit", which regressed h100-eks-ubuntu-training-slurm and h100-gke-cos-training-slurm — both declare `performance.checks: []` to drop the K8s-native nccl-all-reduce-bw check that bypasses slurmd on a Slurm-managed cluster. Under the broken merge, those leaves would re-inherit the NCCL check and run it on a Slurm cluster (false validation). This is a merge-semantics correction, not a two-file patch — external overlay authors relying on `checks: []` to clear inherited checks also depended on the now-restored behavior. - mergeValidationPhase: treat overlay.Checks/Constraints == nil as inherit; treat non-nil len-0 as explicit clear; non-empty unions with base as before. yaml.v3 preserves this nil-vs-empty distinction (`checks: []` → non-nil empty; omitted/null → nil). - Add regression unit tests for the explicit-clear and nil-inherit paths. - Add TestSlurmLeavesClearInheritedPerformancePhase that resolves the two Slurm leaves end-to-end through BuildRecipeResult and asserts their performance phase has no checks/constraints (so FilterEntriesByValidation skips the phase). - Document the nil-vs-empty contract in docs/contributor/data.md and docs/integrator/recipe-development.md so future overlay authors know how to clear inherited content.
Summary
Switch
validation.{phase}merge from per-phase pointer replace to per-field union, matching the semantics already used at the top level forconstraintsandcomponentRefs. Leaves can now add or override checks and phase-level constraints without restating the entire inherited block.This is a merge-semantics correction, not a two-file patch. External overlay authors and downstream tenant overlays that depend on the new per-field rules — including the nil-vs-empty contract for clearing inherited lists — pick up the corrected behavior automatically.
Motivation / Context
RecipeMetadataSpec.Mergewas the only list-shaped field with replace semantics — if a leaf overlay declared adeployment:block, the leaf's block fully overwrote the inherited one. This was a silent footgun: future leaves that add adeployment:block (to set one constraint or one extra check) would silently drop every inherited check unless the author remembered to re-state them. The five*-any.yamlwildcard overlays carried YAML comments warning about this; comments are a weak guard, so the merge itself is now safe by construction.Fixes: #1000
Related: #1001 (which added the warning comments that this PR removes), #969 (which exposed the issue)
Type of Change
Component(s) Affected
pkg/recipe)docs/,examples/)recipes/overlays/*.yaml)Implementation Notes
Per-field merge rules in
mergeValidationPhase:Checks— three-way semantics on the overlay field:nil(field omitted in YAML) → inherit the base list unchanged.checks: []) → explicitly clear the inherited list. Required by Slurm leaves (h100-eks-ubuntu-training-slurm,h100-gke-cos-training-slurm) to drop the K8s-nativenccl-all-reduce-bwcheck that bypasses slurmd on Slurm-managed clusters.Constraints— same nil-vs-empty rule. Non-empty unions byName; overlay wins on same-name (analogous to top-levelRecipeMetadataSpec.Constraints).NodeSelection— overlay replaces wholesale when non-nil; otherwise base preserved. Full struct replace because per-field semantics aren't obvious and the struct is small.Timeout,Infrastructure— overlay-wins-if-non-empty.The nil-vs-empty distinction is preserved end-to-end by
yaml.v3: YAMLchecks: []decodes as a non-nil empty slice (clear), while an omitted ornullkey decodes as nil (inherit).A new
mergeValidationPhasehelper inpkg/recipe/validation.goencapsulates this. The existingcloneValidationPhasewas refactored to share a newcloneNodeSelectionhelper. TheMergesite inmetadata.gonow callsmergeValidationPhasefor all four phase pointers in one line each.The five
*-any.yamlwildcard overlays (b200, gb200, h100, h200, rtx-pro-6000) had their "Per-phase replace semantics" warning comments replaced with a positive description of the new union semantics. Twortx-pro-6000-eks-ubuntu-inference-*leaves had their inline comments updated similarly. Docs indocs/contributor/data.mdanddocs/integrator/recipe-development.mdnow describe both the per-field union and the nil-vs-empty clear contract.Testing
Results:
-race).golangci-lintclean (0 issues on./pkg/recipe/...).Test coverage in
pkg/recipe/metadata_test.go(TestMergeValidationConfig):phase_checks_union_deduplicates_preserving_order,phase_constraints_union_with_overlay_overriding_same_name(slice-order asserted viareflect.DeepEqual).NodeSelection: overlay-wins-when-set and inherited-when-overlay-omits-it.phase_scalar_fields_overlay_wins_when_setandphase_scalar_fields_inherited_when_overlay_omits_them.phase_explicit_empty_checks_clears_inherited,phase_explicit_empty_constraints_clears_inherited,phase_nil_checks_inherits_from_base.union_merge_does_not_mutate_source_slices(mutates the source-derived index 1 under base-first union).Resolution-level regression guard in
pkg/recipe/metadata_store_test.go:TestSlurmLeavesClearInheritedPerformancePhase— loads the real metadata store and resolvesh100-eks-ubuntu-training-slurmandh100-gke-cos-training-slurmthroughBuildRecipeResult, assertingperformance.checksandperformance.constraintsresolve to empty soFilterEntriesByValidationskips the phase.Coverage:
pkg/recipe: 91.3% → 91.9% (+0.6%)mergeValidationPhase: 97.7%Merge: 97.4%Risk Assessment
validation.<phase>block. Existing leaves that re-state the full inherited list still produce identical resolved output. Leaves that rely onchecks: []to clear inherited lists (Slurm leaves) keep their previous behavior because the nil-vs-empty contract is preserved.Rollout notes: No migration needed for in-repo overlays. External/tenant overlays that previously relied on the per-phase pointer-replace semantics to wholesale override should:
checks: []/constraints: []to explicitly clear inherited lists — that contract is preserved.Checklist
make testwith-race)make lint)git commit -S)