Skip to content

fix(validators): reorder validation phases and fix stale paths and documentation issues#1201

Open
yuanchen8911 wants to merge 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/validate-phase-order-manifest-paths
Open

fix(validators): reorder validation phases and fix stale paths and documentation issues#1201
yuanchen8911 wants to merge 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/validate-phase-order-manifest-paths

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Jun 5, 2026

Summary

Two validator fixes surfaced while running aicr validate --phase all on a single-node EKS H100 cluster after a gpu-operator upgrade:

  1. Run the performance phase last — reorder the default phase order from deployment → performance → conformance to deployment → conformance → performance.
  2. Fix stale Equivalent: kubectl apply -f … hint paths in conformance check artifacts and evidence script (docs/conformance/cncf/manifests/ and pkg/evidence/scripts/manifests/ don't exist).

Motivation / Context

On a single GPU node (8×H100), --phase all had conformance dra-support fail with Unschedulable: cannot allocate all claims. Root cause: the performance phase ran immediately before conformance; its inference-perf benchmark uses all 8 GPUs (8 vLLM workers via DRA) and the DynamoGraphDeployment teardown releases those ResourceClaims asynchronously — so dra-support's 1-GPU test pod couldn't schedule. "Worked before" only because earlier runs were standalone phases. DRA itself was verified healthy (a standalone ResourceClaim pod allocated an H100 + ran nvidia-smi). Running performance last frees GPUs for conformance and keeps a flaky perf phase from blocking conformance under --fail-fast.

The hint paths are cosmetic ("equivalent kubectl" text); the checks build resources in-code via the K8s API, not from those files.

Changes

Phase order → deployment, conformance, performance (kept consistent across every surface):

  • pkg/validator/phases.go PhaseOrder (+ rationale comment) and pkg/validator/validator_test.go TestPhaseOrder.
  • pkg/client/v1: SDK facade — ValidateState method godoc, the package-level overview godoc, and TestValidateState_PhaseSelection "unset runs all phases" (was a CI-breaking stale assertion).
  • pkg/validator/v1/job_plan.go Plan() literal (test-only helper; removes latent drift).
  • Docs: docs/contributor/validator.md (order + why-last rationale), docs/user/validation.md (intro, "When to validate" table, "equivalent to" line), docs/design/002-validatorv2-adr.md (pipeline list + execution diagram).
  • Not changed (deliberate): pkg/evidence/attestation/AllPhases stays frozen — it's the canonical output order for reproducible attestation bytes; reordering it would change provenance output. Added a comment documenting that it's intentionally independent of execution PhaseOrder.

Stale manifest hint paths → pkg/evidence/cncf/scripts/manifests/:

  • dra_support_check.godra-gpu-test.yaml; pod_autoscaling_check.gohpa-gpu-test.yaml.
  • cluster_autoscaling_check.go: referenced a non-existent hpa-gpu-scale-test.yaml; since no cluster-autoscaling manifest exists and hpa-gpu-test.yaml is the pod-autoscaling manifest, the misleading apply hint is removed — the artifact now states the resources are built in-code (HPA + GPU deployment + Karpenter NodePool).
  • collect-evidence.sh: fixed all three **Test manifest:** text paths (dra / gang-scheduling / hpa) to the real pkg/evidence/cncf/scripts/manifests/ location.

Type of Change

  • Bug fix (conformance false-negative on single-node clusters + misleading/stale hint paths)

Component(s) Affected

  • Validator (pkg/validator, pkg/validator/v1, validators/conformance)
  • SDK facade (pkg/client/v1)
  • Docs + evidence tooling

Testing

go test ./pkg/client/v1 ./pkg/validator ./pkg/validator/v1 ./pkg/evidence/attestation ./validators/conformance   # ok
golangci-lint run -c .golangci.yaml ./pkg/client/v1/... ./pkg/validator/... ./pkg/evidence/attestation/... ./validators/conformance/...   # 0 issues
env -u GITLAB_TOKEN make test   # pass

Validated end-to-end on EKS H100 (aicr3, post gpu-operator v26.3.2 upgrade): reordered aicr validate --phase all passed all three phases — deployment 4/4, conformance 11/11 (incl. dra-support), performance 1/1 (107,350 tok/s / 1,939 ms). The reorder resolved the contention exactly as intended.

Risk Assessment

  • Low — phase-order slice change + artifact/string/doc edits; no control-flow or deployment-topology change. Attestation output order intentionally unchanged.

Checklist

  • Tests pass locally (incl. the previously-failing pkg/client/v1 assertion)
  • Linter passes (0 issues)
  • Docs updated (contributor + user + ADR, all order references aligned)
  • Commit cryptographically signed (-S)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 3ec6e336-56cd-4f49-9d16-3ff40e959726

📥 Commits

Reviewing files that changed from the base of the PR and between f0cc6cb and 1bdc29f.

📒 Files selected for processing (16)
  • docs/contributor/validator.md
  • docs/design/002-validatorv2-adr.md
  • docs/user/validation.md
  • pkg/client/v1/aicr.go
  • pkg/client/v1/aicr_test.go
  • pkg/evidence/attestation/types.go
  • pkg/evidence/cncf/scripts/collect-evidence.sh
  • pkg/evidence/cncf/scripts/manifests/dra-gpu-test.yaml
  • pkg/evidence/cncf/scripts/manifests/gang-scheduling-test.yaml
  • pkg/evidence/cncf/scripts/manifests/hpa-gpu-test.yaml
  • pkg/validator/phases.go
  • pkg/validator/v1/job_plan.go
  • pkg/validator/validator_test.go
  • validators/conformance/cluster_autoscaling_check.go
  • validators/conformance/dra_support_check.go
  • validators/conformance/pod_autoscaling_check.go

📝 Walkthrough

Walkthrough

This PR reorders validator phases to execute deployment → conformance → performance instead of deployment → performance → conformance. The change updates PhaseOrder and job planning, adjusts tests asserting phase order, revises ADR and user/contributor docs to match, clarifies attestation comments, and consolidates GPU test manifest paths to pkg/evidence/cncf/scripts/manifests/ in evidence scripts and conformance checks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

area/tests, documentation

Suggested reviewers

  • mchmarny
  • njhensley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the two main changes: reordering validation phases and fixing stale documentation paths.
Description check ✅ Passed The description comprehensively explains the rationale, changes, and testing for both the phase reordering and stale path fixes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/evidence/cncf/scripts/collect-evidence.sh`:
- Around line 245-246: Update the three "Test manifest" references that still
point to the stale path by replacing the prefix
"pkg/evidence/scripts/manifests/" with the consolidated
"pkg/evidence/cncf/scripts/manifests/"; search for the literal string "**Test
manifest:**" in pkg/evidence/cncf/scripts/collect-evidence.sh and change the
manifest path in each occurrence (including the one shown with dra-gpu-test.yaml
and the two other occurrences) so all emitted "Test manifest" lines use the
unified pkg/evidence/cncf/scripts/manifests/... path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 3ce80e71-0f3c-46f9-9606-08dae1d564ac

📥 Commits

Reviewing files that changed from the base of the PR and between 2d5259b and e8ac7d6.

📒 Files selected for processing (7)
  • docs/contributor/validator.md
  • pkg/evidence/cncf/scripts/collect-evidence.sh
  • pkg/validator/phases.go
  • pkg/validator/validator_test.go
  • validators/conformance/cluster_autoscaling_check.go
  • validators/conformance/dra_support_check.go
  • validators/conformance/pod_autoscaling_check.go

Comment thread pkg/evidence/cncf/scripts/collect-evidence.sh
@yuanchen8911 yuanchen8911 force-pushed the fix/validate-phase-order-manifest-paths branch 2 times, most recently from ba40b7a to f0cc6cb Compare June 5, 2026 00:49
@github-actions github-actions Bot added size/M and removed size/S labels Jun 5, 2026
@yuanchen8911 yuanchen8911 changed the title fix(validators): run performance phase last; fix stale conformance manifest hint paths fix(validators): reorder validation phases and fix stale paths and documentation issues Jun 5, 2026
…nifest hint paths

Reorder PhaseOrder from deployment→performance→conformance to
deployment→conformance→performance. The performance phase's inference-perf
benchmark saturates every GPU on the node and tears its DynamoGraphDeployment
(and DRA ResourceClaims) down asynchronously; running it before conformance
starved conformance's GPU-needing checks -- notably dra-support, whose 1-GPU
test pod failed 'Unschedulable: cannot allocate all claims' on single-node
clusters during 'aicr validate --phase all'. Running performance last frees
the GPUs for conformance and keeps a flaky perf phase from blocking
conformance under --fail-fast. DRA itself was verified healthy (standalone
GPU claim allocated); the failure was purely GPU contention.

Also fix stale 'Equivalent: kubectl apply -f' hint paths in the conformance
check artifacts: docs/conformance/cncf/manifests/ does not exist; the test
manifests live under pkg/evidence/cncf/scripts/manifests/. Repoint
dra-support, pod-autoscaling, and cluster-autoscaling hints (the latter
referenced a non-existent hpa-gpu-scale-test.yaml -> the real hpa-gpu-test.yaml)
and fix the same stale path in collect-evidence.sh. Cosmetic only -- the
checks build their resources in-code via the K8s API, not from these files.
@yuanchen8911 yuanchen8911 force-pushed the fix/validate-phase-order-manifest-paths branch from f0cc6cb to 1bdc29f Compare June 5, 2026 00:55
@yuanchen8911 yuanchen8911 requested a review from mchmarny June 5, 2026 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant