fix(validators): reorder validation phases and fix stale paths and documentation issues#1201
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (16)
📝 WalkthroughWalkthroughThis PR reorders validator phases to execute deployment → conformance → performance instead of deployment → performance → conformance. The change updates PhaseOrder and job planning, adjusts tests asserting phase order, revises ADR and user/contributor docs to match, clarifies attestation comments, and consolidates GPU test manifest paths to pkg/evidence/cncf/scripts/manifests/ in evidence scripts and conformance checks. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/evidence/cncf/scripts/collect-evidence.sh`:
- Around line 245-246: Update the three "Test manifest" references that still
point to the stale path by replacing the prefix
"pkg/evidence/scripts/manifests/" with the consolidated
"pkg/evidence/cncf/scripts/manifests/"; search for the literal string "**Test
manifest:**" in pkg/evidence/cncf/scripts/collect-evidence.sh and change the
manifest path in each occurrence (including the one shown with dra-gpu-test.yaml
and the two other occurrences) so all emitted "Test manifest" lines use the
unified pkg/evidence/cncf/scripts/manifests/... path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 3ce80e71-0f3c-46f9-9606-08dae1d564ac
📒 Files selected for processing (7)
docs/contributor/validator.mdpkg/evidence/cncf/scripts/collect-evidence.shpkg/validator/phases.gopkg/validator/validator_test.govalidators/conformance/cluster_autoscaling_check.govalidators/conformance/dra_support_check.govalidators/conformance/pod_autoscaling_check.go
ba40b7a to
f0cc6cb
Compare
…nifest hint paths Reorder PhaseOrder from deployment→performance→conformance to deployment→conformance→performance. The performance phase's inference-perf benchmark saturates every GPU on the node and tears its DynamoGraphDeployment (and DRA ResourceClaims) down asynchronously; running it before conformance starved conformance's GPU-needing checks -- notably dra-support, whose 1-GPU test pod failed 'Unschedulable: cannot allocate all claims' on single-node clusters during 'aicr validate --phase all'. Running performance last frees the GPUs for conformance and keeps a flaky perf phase from blocking conformance under --fail-fast. DRA itself was verified healthy (standalone GPU claim allocated); the failure was purely GPU contention. Also fix stale 'Equivalent: kubectl apply -f' hint paths in the conformance check artifacts: docs/conformance/cncf/manifests/ does not exist; the test manifests live under pkg/evidence/cncf/scripts/manifests/. Repoint dra-support, pod-autoscaling, and cluster-autoscaling hints (the latter referenced a non-existent hpa-gpu-scale-test.yaml -> the real hpa-gpu-test.yaml) and fix the same stale path in collect-evidence.sh. Cosmetic only -- the checks build their resources in-code via the K8s API, not from these files.
f0cc6cb to
1bdc29f
Compare
Summary
Two validator fixes surfaced while running
aicr validate --phase allon a single-node EKS H100 cluster after a gpu-operator upgrade:deployment → performance → conformancetodeployment → conformance → performance.Equivalent: kubectl apply -f …hint paths in conformance check artifacts and evidence script (docs/conformance/cncf/manifests/andpkg/evidence/scripts/manifests/don't exist).Motivation / Context
On a single GPU node (8×H100),
--phase allhad conformancedra-supportfail withUnschedulable: cannot allocate all claims. Root cause: the performance phase ran immediately before conformance; its inference-perf benchmark uses all 8 GPUs (8 vLLM workers via DRA) and the DynamoGraphDeployment teardown releases those ResourceClaims asynchronously — sodra-support's 1-GPU test pod couldn't schedule. "Worked before" only because earlier runs were standalone phases. DRA itself was verified healthy (a standalone ResourceClaim pod allocated an H100 + rannvidia-smi). Running performance last frees GPUs for conformance and keeps a flaky perf phase from blocking conformance under--fail-fast.The hint paths are cosmetic ("equivalent kubectl" text); the checks build resources in-code via the K8s API, not from those files.
Changes
Phase order →
deployment, conformance, performance(kept consistent across every surface):pkg/validator/phases.goPhaseOrder(+ rationale comment) andpkg/validator/validator_test.goTestPhaseOrder.pkg/client/v1: SDK facade —ValidateStatemethod godoc, the package-level overview godoc, andTestValidateState_PhaseSelection"unset runs all phases" (was a CI-breaking stale assertion).pkg/validator/v1/job_plan.goPlan()literal (test-only helper; removes latent drift).docs/contributor/validator.md(order + why-last rationale),docs/user/validation.md(intro, "When to validate" table, "equivalent to" line),docs/design/002-validatorv2-adr.md(pipeline list + execution diagram).pkg/evidence/attestation/AllPhasesstays frozen — it's the canonical output order for reproducible attestation bytes; reordering it would change provenance output. Added a comment documenting that it's intentionally independent of executionPhaseOrder.Stale manifest hint paths →
pkg/evidence/cncf/scripts/manifests/:dra_support_check.go→dra-gpu-test.yaml;pod_autoscaling_check.go→hpa-gpu-test.yaml.cluster_autoscaling_check.go: referenced a non-existenthpa-gpu-scale-test.yaml; since no cluster-autoscaling manifest exists andhpa-gpu-test.yamlis the pod-autoscaling manifest, the misleading apply hint is removed — the artifact now states the resources are built in-code (HPA + GPU deployment + Karpenter NodePool).collect-evidence.sh: fixed all three**Test manifest:**text paths (dra / gang-scheduling / hpa) to the realpkg/evidence/cncf/scripts/manifests/location.Type of Change
Component(s) Affected
pkg/validator,pkg/validator/v1,validators/conformance)pkg/client/v1)Testing
Validated end-to-end on EKS H100 (aicr3, post gpu-operator v26.3.2 upgrade): reordered
aicr validate --phase allpassed all three phases — deployment 4/4, conformance 11/11 (incl.dra-support), performance 1/1 (107,350 tok/s / 1,939 ms). The reorder resolved the contention exactly as intended.Risk Assessment
Checklist
pkg/client/v1assertion)-S)