Add opt-in component readiness gates (--readiness-hooks)#1110
Add opt-in component readiness gates (--readiness-hooks)#1110atif1996 wants to merge 13 commits into
Conversation
|
🌿 Preview your docs: https://nvidia-preview-feat-904-readiness-hooks.docs.buildwithfern.com/aicr |
Coverage Report ✅
Coverage BadgeMerging this branch changes the coverage (1 decrease, 5 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR implements optional per-component readiness gates for Kubernetes deployments using Chainsaw tests. It introduces a new Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/chainsawgate/runner/runner.go`:
- Around line 163-186: LoadBundleDir currently performs blocking filesystem I/O
without a context; change its signature to LoadBundleDir(ctx context.Context,
dir string) (map[string]string, error), update the caller in cmd/gate/main.go to
pass the application context created by signal.NotifyContext, and inside
LoadBundleDir check ctx.Err() at the start and after each blocking call
(os.ReadDir and os.ReadFile) returning a wrapped
context.Canceled/DeadlineExceeded error if canceled; keep the same file-reading
logic but thread the ctx through and wrap returned errors consistently using the
existing errors package.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 49fa3b81-4572-4dcc-8ac6-718e194240e8
📒 Files selected for processing (40)
.github/workflows/on-tag.yamlcmd/gate/Dockerfilecmd/gate/main.godocs/contributor/data.mddocs/integrator/recipe-development.mddocs/user/cli-reference.mdpkg/bundler/bundler.gopkg/bundler/config/config.gopkg/bundler/deployer/argocd/argocd.gopkg/bundler/deployer/argocd/argocd_test.gopkg/bundler/deployer/argocd/testdata/readiness_gate/001-gpu-operator/application.yamlpkg/bundler/deployer/argocd/testdata/readiness_gate/001-gpu-operator/values.yamlpkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/Chart.yamlpkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/application.yamlpkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yamlpkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/values.yamlpkg/bundler/deployer/argocdhelm/argocdhelm.gopkg/bundler/deployer/argocdhelm/argocdhelm_test.gopkg/bundler/deployer/argocdhelm/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yamlpkg/bundler/deployer/argocdhelm/testdata/readiness_gate/templates/gpu-operator-readiness.yamlpkg/bundler/deployer/argocdhelm/testdata/readiness_gate/templates/gpu-operator.yamlpkg/bundler/deployer/helm/helm.gopkg/bundler/deployer/helm/templates/deploy.sh.tmplpkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.shpkg/bundler/deployer/helm/testdata/manifest_only/deploy.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.shpkg/bundler/deployer/helm/testdata/mixed_with_pre/deploy.shpkg/bundler/deployer/helm/testdata/nodewright_present/deploy.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.shpkg/bundler/deployer/localformat/writer.gopkg/bundler/deployer/localformat/writer_test.gopkg/bundler/readiness.gopkg/bundler/readiness_test.gopkg/chainsawgate/runner/runner.gopkg/chainsawgate/runner/runner_test.gopkg/chainsawgate/runner/state.gopkg/chainsawgate/runner/state_test.gopkg/cli/bundle.gopkg/defaults/timeouts.gorecipes/components/gpu-operator/readiness.yaml
njhensley
left a comment
There was a problem hiding this comment.
Solid, well-structured feature — k8s-free runner, a pure/tested state machine, deterministic rendering, complete supply-chain wiring, thorough docs. Two things should gate merge: (1) the gate Job isn't re-run on redeploy and breaks helm upgrade (silent stale-pass on same version, immutable-field hard fail on version bump — see inline), and (2) the 90-min single-attempt no-restart Job is too fragile for the target clusters. (3) the static CRD RBAC should be resolved or documented before this is called "generic."
One coverage note: cmd/gate has no main_test.go (the run/loop/progressLine contract is untested) and there's no e2e/KWOK exercising the readiness path — only the version smoke test in the PR description. The runner/state packages are well covered. (For the shipped gpu-operator adopter the read-only RBAC is actually exactly sufficient — assert-only, cluster-scoped, existing namespace — so this is a test-coverage gap, not an RBAC-correctness problem.)
yuanchen8911
left a comment
There was a problem hiding this comment.
Reviewed the --readiness-hooks feature. Solid design overall — fail-closed flux/helmfile rejection and the derived helm --timeout are nice. Two blockers and a coverage-gate gap to resolve before merge; details inline.
- Critical: gate image tag missing
vprefix → ImagePullBackOff on release builds. - Major: readiness Job never re-runs on redeploy and breaks
helm upgradeon image/arg change. - Major: new exported functions at 0% coverage (
ReadinessHooks,WithReadinessHooks,cmd/gate). - Minor:
*-readiness)glob over-matches release names; inaccurate exit-code comment.
A few open questions (gate RBAC scope, chainsaw assert-on-empty semantics, elapsed-time accounting) are noted inline as well.
Normalize aicr-gate image tags to the v-prefixed release contract, make the gate Job re-run on deploy (helm hooks / Argo Replace=true), broaden scoped RBAC for NVIDIA CRDs and CRD definitions without Secrets access, add Job backoffLimit, validate readiness.yaml at bundle time, and anchor stability-window timing after evaluation. Extract gate manifest rendering into pkg/bundler/gatemanifest to keep deployer golden tests importable without a cycle. Add synctest-based cmd/gate loop coverage and expand bundler/config tests.
|
Follow-up commit b6a1eff addresses review items #1–#6 and #9:
Manifest rendering extracted to Deferred (unchanged): #7 chainsaw Scope note: deploy-time gate remains one-shot by design — unchanged manifests do not re-sync on cluster scale-up; ongoing convergence is operator reconciliation + monitoring. |
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
cmd/gate/main.go (1)
53-53: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueFix exit-code comment accuracy.
The comment states SIGTERM yields exit code
143, but line 157 returnsexitSignal(130) for both SIGINT and SIGTERM. The comment should reflect that both signals yield 130, or the code should distinguish the signals if different exit codes are desired.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/gate/main.go` at line 53, The comment for exitSignal on the variable exitSignal (currently "128 + SIGINT; SIGTERM caller can read $? as 143") is inaccurate relative to how signals are handled; either update the comment to state both SIGINT and SIGTERM currently produce exit code 130, or modify the signal handling logic around the code that returns exitSignal (the signal handling/return path) to distinguish SIGTERM and set 143 when SIGTERM is received; locate the variable exitSignal and the signal handling/return logic (where exitSignal is used/returned) and implement one of these two fixes so the comment and behavior match.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.sh`:
- Around line 272-285: Update the comment that explains readiness gate wait
behavior to clarify that both --wait and --timeout are intentionally set in
COMPONENT_WAIT_ARGS; change the ambiguous sentence that currently says "only
--timeout is needed here" to explicitly state that "Helm --wait blocks on hook
completion (--wait-for-jobs is not needed), so we set --wait --timeout here" and
keep the rest of the explanation about COMPONENT_HELM_TIMEOUT and
COMPONENT_MAX_RETRIES intact so readers see why COMPONENT_WAIT_ARGS includes
both flags.
In `@pkg/bundler/gatemanifest/manifest.go`:
- Around line 67-72: Update the ClusterRole RBAC rules so readiness checks
aren't limited to the nvidia.com group: change the rule that currently lists
apiGroups: ["nvidia.com"] / resources: ["*"] / verbs: ["get","list","watch"] to
allow reading CRDs in any apiGroup (e.g., set apiGroups: ["*"] for the
CRD-related rule or add a separate rule granting get/list/watch on
customresourcedefinitions across all apiGroups), and add support for an optional
per-component override (convention: readiness-rbac.yaml) so components can
supply templated RBAC if they need narrower scopes; refer to the ClusterRole
definition and the apiGroups/resources/verbs fields when making the change.
- Around line 56-86: The ClusterRole and ClusterRoleBinding names (the manifest
blocks for ClusterRole and ClusterRoleBinding that currently use the single
token %[1]s and reference the ServiceAccount name) collide across namespaces;
update the name fields for ClusterRole and ClusterRoleBinding (and the
ServiceAccount subject name/roleRef.name usage) to include a per-deployment
qualifier such as {{ .Release.Namespace }} or a bundle-instance ID (e.g., append
"-{{ .Release.Namespace }}" or similar) so the ClusterRole, ClusterRoleBinding,
and roleRef.subjects[0].name/roleRef.name all use the same fully qualified name
and avoid cross-namespace overwrites; ensure roleRef and subject entries still
match the new qualified name and preserve existing formatting token usage (the
%[1]s placeholder) where appropriate.
In `@pkg/bundler/readiness_test.go`:
- Line 171: The call to recipe.SetDataProvider(badLayered) fails because
SetDataProvider doesn't exist; replicate the same fix you applied earlier at
lines 103-105 by replacing SetDataProvider with the correct method used there
(e.g., recipe.SetProvider(badLayered)) or the equivalent setter you used
previously for the recipe type so the test binds badLayered via the existing
API.
---
Duplicate comments:
In `@cmd/gate/main.go`:
- Line 53: The comment for exitSignal on the variable exitSignal (currently "128
+ SIGINT; SIGTERM caller can read $? as 143") is inaccurate relative to how
signals are handled; either update the comment to state both SIGINT and SIGTERM
currently produce exit code 130, or modify the signal handling logic around the
code that returns exitSignal (the signal handling/return path) to distinguish
SIGTERM and set 143 when SIGTERM is received; locate the variable exitSignal and
the signal handling/return logic (where exitSignal is used/returned) and
implement one of these two fixes so the comment and behavior match.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 23903644-227a-4288-ad42-30bd058a5d42
📒 Files selected for processing (19)
cmd/gate/main.gocmd/gate/main_test.gopkg/bundler/config/config_test.gopkg/bundler/deployer/argocd/argocd_test.gopkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yamlpkg/bundler/deployer/argocdhelm/argocdhelm_test.gopkg/bundler/deployer/argocdhelm/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yamlpkg/bundler/deployer/helm/templates/deploy.sh.tmplpkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.shpkg/bundler/deployer/helm/testdata/manifest_only/deploy.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.shpkg/bundler/deployer/helm/testdata/mixed_with_pre/deploy.shpkg/bundler/deployer/helm/testdata/nodewright_present/deploy.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.shpkg/bundler/gatemanifest/manifest.gopkg/bundler/gatemanifest/manifest_test.gopkg/bundler/readiness.gopkg/bundler/readiness_test.gopkg/defaults/timeouts.go
Normalize aicr-gate image tags to the v-prefixed release contract, make the gate Job re-run on deploy (helm hooks / Argo Replace=true), broaden scoped RBAC for NVIDIA CRDs and CRD definitions without Secrets access, add Job backoffLimit, validate readiness.yaml at bundle time, and anchor stability-window timing after evaluation. Extract gate manifest rendering into pkg/bundler/gatemanifest to keep deployer golden tests importable without a cycle. Add synctest-based cmd/gate loop coverage and expand bundler/config tests.
b6a1eff to
52721eb
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/bundler/deployer/argocdhelm/argocdhelm.go (1)
244-264:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve the delegated generator’s structured error code
argocdGen.Generatecan return a coded*errors.StructuredError; wrapping it witherrors.Wrap(errors.ErrCodeInternal, ...)forces the error toINTERNALand masks the originalCode. Useerrors.PropagateOrWrapso coded errors are preserved as-is.Suggested fix
if _, genErr := argocdGen.Generate(ctx, tmpDir); genErr != nil { - return nil, errors.Wrap(errors.ErrCodeInternal, "failed to generate base Argo CD output", genErr) + return nil, errors.PropagateOrWrap(genErr, errors.ErrCodeInternal, "failed to generate base Argo CD output") }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/bundler/deployer/argocdhelm/argocdhelm.go` around lines 244 - 264, The current if-block wrapping argocdGen.Generate's error uses errors.Wrap which forces any returned *errors.StructuredError to INTERNAL and loses its original Code; change the error handling in the if _, genErr := argocdGen.Generate(ctx, tmpDir); genErr != nil { ... } block to use errors.PropagateOrWrap so that if genErr is a structured/coded error its Code is preserved and otherwise it is wrapped with the provided internal code and message; update the return call in that block to call errors.PropagateOrWrap(genErr, errors.ErrCodeInternal, "failed to generate base Argo CD output") (keeping the same context/message and variable names such as argocdGen.Generate and genErr).
♻️ Duplicate comments (2)
cmd/gate/main.go (1)
53-53:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix exit-code comment.
The comment states "SIGTERM caller can read $? as 143", but the shared
ctx.Done()return path (line 157) returnsexitSignal(130) for both SIGINT and SIGTERM. Either correct the comment to reflect the actual exit code, or distinguish the signals if 143 is required for SIGTERM.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/gate/main.go` at line 53, The comment for exitSignal is incorrect because the code returns exitSignal (130) for both SIGINT and SIGTERM via the ctx.Done() return path; either update the comment to state that 130 is used for both signals (referencing the exitSignal variable) or implement distinct handling so SIGINT sets exitSignal=130 and SIGTERM sets exitSignal=143 before the ctx.Done() return; locate the signal handling logic / main function where ctx is created and where exitSignal is read (the ctx.Done() return path) and apply the chosen fix and corresponding comment update.pkg/chainsawgate/runner/runner.go (1)
144-153:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winByte-slice truncation can split a UTF-8 rune.
"..." + out[len(out)-maxMsgLen:]slices by bytes and may cut mid-rune, producing invalid UTF-8 in the message. Same pattern exists instate.go(truncate) andcmd/gate/main.go(shortMsg). Use a rune-aware trim.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/chainsawgate/runner/runner.go` around lines 144 - 153, The message truncation slices the string by bytes and can cut a UTF-8 rune; update the truncation logic used here (the block handling out := buf.String() in runner.go) and the similar helpers truncate (state.go) and shortMsg (cmd/gate/main.go) to be rune-aware: convert the string to runes (or iterate with utf8.DecodeRuneInString) and if its rune count exceeds maxMsgLen return "..." + last maxMsgLen runes (or otherwise trim safely), ensuring you preserve valid UTF-8 in the returned Message values (ComponentResult.Message) and keep the same "..." prefix semantics for trimmed outputs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/on-tag.yaml:
- Around line 214-216: Add a short explanatory comment above the permissions
block that documents why this job explicitly sets permissions.contents: read and
permissions.packages: write (e.g., to restrict repository contents access while
allowing package publishing) so the static analyzer (zizmor) and reviewers
understand the intentional deviation from implicit permission inheritance;
locate the permissions block containing "contents: read" and "packages: write"
and insert a concise one-line comment clarifying intent.
In `@docs/integrator/recipe-development.md`:
- Line 406: The doc text incorrectly states Helm blocks via `--wait-for-jobs`;
update the Helm wording to reflect the current behavior by replacing
`--wait-for-jobs` with `--wait` in the sentence describing the helm deployer
(the paragraph mentioning `--readiness-hooks`, the `NNN-<name>-readiness/`
folder and the `Job` that runs the `gate` CLI); keep the rest of the guidance
(Argo CD behavior and the note about `spec.timeouts.assert` vs gate per-test
timeout) unchanged.
In `@docs/user/cli-reference.md`:
- Around line 1402-1403: Update the Helm readiness-gate wording to stop claiming
reliance on `--wait-for-jobs`: in the `helm` / `deploy.sh` description remove
the reference to `--wait-for-jobs` and state that `helm upgrade --install
--wait` (with the bundler deriving Helm's `--timeout` from the gate's
`--max-wait` plus buffer) is sufficient to cover readiness hooks; keep the note
that the gate owns the deadline and Helm does not preempt it.
In `@pkg/chainsawgate/runner/runner.go`:
- Around line 110-124: The loop that iterates over bundle entries (for key,
testYAML := range bundle) doesn't honor context cancellation and may continue
creating compDir, writing chainsaw-test.yaml and calling runComponentFn(ctx,
opts.Timeout, opts.Namespace, compDir) after ctx is canceled; modify the loop to
check ctx.Done() (or ctx.Err()) at the start (or before heavy work and before
invoking runComponentFn) and break/return early when canceled, ensuring
components map is left in a consistent state and no further os.MkdirAll /
os.WriteFile / runComponentFn calls occur after cancellation.
---
Outside diff comments:
In `@pkg/bundler/deployer/argocdhelm/argocdhelm.go`:
- Around line 244-264: The current if-block wrapping argocdGen.Generate's error
uses errors.Wrap which forces any returned *errors.StructuredError to INTERNAL
and loses its original Code; change the error handling in the if _, genErr :=
argocdGen.Generate(ctx, tmpDir); genErr != nil { ... } block to use
errors.PropagateOrWrap so that if genErr is a structured/coded error its Code is
preserved and otherwise it is wrapped with the provided internal code and
message; update the return call in that block to call
errors.PropagateOrWrap(genErr, errors.ErrCodeInternal, "failed to generate base
Argo CD output") (keeping the same context/message and variable names such as
argocdGen.Generate and genErr).
---
Duplicate comments:
In `@cmd/gate/main.go`:
- Line 53: The comment for exitSignal is incorrect because the code returns
exitSignal (130) for both SIGINT and SIGTERM via the ctx.Done() return path;
either update the comment to state that 130 is used for both signals
(referencing the exitSignal variable) or implement distinct handling so SIGINT
sets exitSignal=130 and SIGTERM sets exitSignal=143 before the ctx.Done()
return; locate the signal handling logic / main function where ctx is created
and where exitSignal is read (the ctx.Done() return path) and apply the chosen
fix and corresponding comment update.
In `@pkg/chainsawgate/runner/runner.go`:
- Around line 144-153: The message truncation slices the string by bytes and can
cut a UTF-8 rune; update the truncation logic used here (the block handling out
:= buf.String() in runner.go) and the similar helpers truncate (state.go) and
shortMsg (cmd/gate/main.go) to be rune-aware: convert the string to runes (or
iterate with utf8.DecodeRuneInString) and if its rune count exceeds maxMsgLen
return "..." + last maxMsgLen runes (or otherwise trim safely), ensuring you
preserve valid UTF-8 in the returned Message values (ComponentResult.Message)
and keep the same "..." prefix semantics for trimmed outputs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: c6db6dd4-a839-4e39-8039-8d1bb83efc97
📒 Files selected for processing (44)
.github/workflows/on-tag.yamlcmd/gate/Dockerfilecmd/gate/main.gocmd/gate/main_test.godocs/contributor/data.mddocs/integrator/recipe-development.mddocs/user/cli-reference.mdpkg/bundler/bundler.gopkg/bundler/config/config.gopkg/bundler/config/config_test.gopkg/bundler/deployer/argocd/argocd.gopkg/bundler/deployer/argocd/argocd_test.gopkg/bundler/deployer/argocd/testdata/readiness_gate/001-gpu-operator/application.yamlpkg/bundler/deployer/argocd/testdata/readiness_gate/001-gpu-operator/values.yamlpkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/Chart.yamlpkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/application.yamlpkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yamlpkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/values.yamlpkg/bundler/deployer/argocdhelm/argocdhelm.gopkg/bundler/deployer/argocdhelm/argocdhelm_test.gopkg/bundler/deployer/argocdhelm/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yamlpkg/bundler/deployer/argocdhelm/testdata/readiness_gate/templates/gpu-operator-readiness.yamlpkg/bundler/deployer/argocdhelm/testdata/readiness_gate/templates/gpu-operator.yamlpkg/bundler/deployer/helm/helm.gopkg/bundler/deployer/helm/templates/deploy.sh.tmplpkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.shpkg/bundler/deployer/helm/testdata/manifest_only/deploy.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.shpkg/bundler/deployer/helm/testdata/mixed_with_pre/deploy.shpkg/bundler/deployer/helm/testdata/nodewright_present/deploy.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.shpkg/bundler/deployer/localformat/writer.gopkg/bundler/deployer/localformat/writer_test.gopkg/bundler/gatemanifest/manifest.gopkg/bundler/gatemanifest/manifest_test.gopkg/bundler/readiness.gopkg/bundler/readiness_test.gopkg/chainsawgate/runner/runner.gopkg/chainsawgate/runner/runner_test.gopkg/chainsawgate/runner/state.gopkg/chainsawgate/runner/state_test.gopkg/cli/bundle.gopkg/defaults/timeouts.gorecipes/components/gpu-operator/readiness.yaml
💤 Files with no reviewable changes (1)
- recipes/components/gpu-operator/readiness.yaml
Address two open review threads on PR #1110. Major: the readiness-gate ClusterRole and ClusterRoleBinding were named with only the component name, so two bundles deploying the same component to different namespaces (the multi-tenant --app-name case, #1011) would overwrite each other's cluster-scoped objects and the binding subject would point at a single namespace. Suffix the three cluster-scoped name references (ClusterRole, ClusterRoleBinding, roleRef) with the resolved namespace; the namespaced ServiceAccount, subject, ConfigMap, and Job keep the bare component name. Medium: the gate exec'd `chainsaw test` with no pinned --config, leaving namespace/resource cleanup to the base image's chainsaw defaults. Bake a chainsaw Configuration (cleanup.skipDelete: true) into the aicr-gate image, thread an optional ConfigPath through the runner as --config, and add a --chainsaw-config flag defaulting to the in-image path. skipDelete matches the read-only RBAC (no delete verbs). Regenerate the argocd and argocdhelm readiness goldens.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/bundler/gatemanifest/manifest.go (1)
109-125:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
backoffLimit: 6can materially delay Argo CD marking the Job “Degraded” (failed) until Kubernetes exhausts Job retries—confirm the gate timeout behavior.Argo CD’s Job health follows the Kubernetes Job lifecycle and marks the Job Degraded when it reaches a failed terminal state (e.g., after
spec.backoffLimitis exceeded). WithrestartPolicy: Never, if thegatecontainer exits non-zero when--max-waitexpires, each timeout likely counts as a failed pod attempt, so terminal Job failure (and thus Argo CD degradation) may take roughly(backoffLimit + 1) * --max-waitbefore surfacing. If that delay is undesirable for persistent readiness failures, lowerbackoffLimit(and/or adjust the gate so a--max-waitexpiry doesn’t consume Job retries).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/bundler/gatemanifest/manifest.go` around lines 109 - 125, The Job manifest sets spec.backoffLimit (currently 6) which can delay Argo CD marking the Job degraded; update the template to reduce spec.backoffLimit to a lower value (or make it configurable) and/or change the gate container behavior so that a --max-wait expiry does not exit non-zero and consume retries; look for the backoffLimit entry and the gate container args (notably --max-wait and restartPolicy: Never) and either lower backoffLimit or alter the gate's exit logic to avoid treating timeout expiries as failed attempts.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/bundler/gatemanifest/manifest.go`:
- Around line 109-125: The Job manifest sets spec.backoffLimit (currently 6)
which can delay Argo CD marking the Job degraded; update the template to reduce
spec.backoffLimit to a lower value (or make it configurable) and/or change the
gate container behavior so that a --max-wait expiry does not exit non-zero and
consume retries; look for the backoffLimit entry and the gate container args
(notably --max-wait and restartPolicy: Never) and either lower backoffLimit or
alter the gate's exit logic to avoid treating timeout expiries as failed
attempts.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: cfca7231-99b9-42b3-8a08-cb7c963f13d1
📒 Files selected for processing (10)
cmd/gate/Dockerfilecmd/gate/chainsaw-config.yamlcmd/gate/main.gocmd/gate/main_test.gopkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yamlpkg/bundler/deployer/argocdhelm/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yamlpkg/bundler/gatemanifest/manifest.gopkg/bundler/gatemanifest/manifest_test.gopkg/chainsawgate/runner/runner.gopkg/chainsawgate/runner/runner_test.go
|
@atif1996 this PR now has merge conflicts with |
|
Need to review more yet, but I am liking it so far. I feel like might want to invert the flag tbh, |
…ates Adds a --readiness-hooks bundle flag (default off) that, when a component ships a readiness.yaml chainsaw test, emits a NNN-<name>-readiness Helm hook chart running the gate CLI as a Job to gate the deploy on component readiness. Prototypes the convention with gpu-operator (ClusterPolicy.status.state=ready). deploy.sh coordinates helm --timeout above the gate's --max-wait and forces --wait --wait-for-jobs for readiness folders, since helm --wait alone does not block on bare Jobs. Non-helm deployers fail clearly when the flag is set. Default bundle output is unchanged.
Ports the chainsaw-test evaluation library from chainsaw-gate into pkg/chainsawgate/runner (Evaluate, RunComponent, LoadBundleDir, and the pure ComputeReadyState/ApplyDeadline/FailingSummary stability-window and deadline state machine). The package stays Kubernetes-API-free so both the gate CLI and the chainsaw-gate controller can depend on it. Adds cmd/gate, the standalone polling gate CLI that drives the runner, refactored to AICR conventions: pkg/errors with codes instead of fmt.Errorf, and slog for output instead of fmt.Println/Fprintf. Image build + CI wiring (ghcr.io/nvidia/aicr-gate with chainsaw baked in) and docs/BOM land in a follow-up commit on this branch.
Make the gate's --max-wait the single deadline knob and derive the readiness folder's helm --timeout from it (max-wait + buffer), so the invariant helm --timeout > gate --max-wait can never silently drift. Helm cannot wait indefinitely (--wait/--wait-for-jobs is bounded by --timeout; --timeout 0 is not infinite), so a derived backstop is required. - Centralize gate timings in pkg/defaults (exec/poll/stability/max-wait + helm-timeout buffer); max-wait raised to 90m to cover gpu-operator's operand rollout, which can exceed an hour on a large cluster. - readiness.go renders the Job's gate args from those constants. - helm.go derives deployTemplateData.ReadinessHelmTimeout = max-wait + buffer; deploy.sh.tmpl consumes it (no second hand-tuned timeout literal). - Consolidate the redundant *-readiness) case blocks in deploy.sh.tmpl. - Drop CRD / chainsaw-gate-project references from the migrated gate CLI and runner comments; AICR drives the gate from a Job, no CRD or controller. - Assert rendered gate args match the defaults constants; regenerate goldens.
Extend --readiness-hooks beyond the helm deployer to the argocd and argocd-helm deployers. The gate is emitted as a local-helm folder right after the component's primary chart, so its Argo CD Application inherits the next sync-wave. Argo CD then blocks that wave on the gate Job via its built-in batch/Job health (Progressing -> Healthy on success, Degraded on failure) -- the analog of helm --wait-for-jobs. No custom health Lua and no ClusterPolicy watch: readiness logic stays encapsulated in the chainsaw Test the Job runs. - argocd/argocdhelm Generators: forward ComponentReadiness through localformat.Write (argocdhelm delegates to argocd). - argocdhelm: strip the -readiness folder suffix when resolving the parent component's override key (mirrors existing -pre/-post handling). - bundler: narrow the fail-closed guard to flux/helmfile only. - cli: document the supported deployers on --readiness-hooks. See #904.
Add cmd/gate/Dockerfile and wire the readiness-gate image into the on-tag release pipeline alongside the validator images. - Dockerfile embeds the chainsaw binary by copying it out of the official, cosign-attested ghcr.io/kyverno/chainsaw image (gate execs `chainsaw test`). CHAINSAW_VERSION is the single knob, sourced from .settings.yaml (testing_tools.chainsaw, currently v0.2.15) by CI via the load-versions action so the in-cluster gate runs the same chainsaw AICR validates with. - on-tag.yaml: new build-gate job (native per-arch build), gate folded into the docker-manifest create/verify loops, the image-vuln-scan matrix, the attest job, and the release summary. Built and smoke-tested locally: gate compiles, chainsaw reports 0.2.15. See #904.
- cli-reference: add --readiness-hooks to the bundle flag table and a "Readiness Gates" section covering the gate Job, the gate image, and how each supported deployer (helm via --wait-for-jobs, argocd/argocd-helm via sync-wave + built-in Job health) blocks on it. - recipe-development: add "Adding a Component Readiness Gate" describing the recipes/components/<name>/readiness.yaml Chainsaw-test convention. - data.md: list readiness.yaml in the component directory layout. See #904.
Normalize aicr-gate image tags to the v-prefixed release contract, make the gate Job re-run on deploy (helm hooks / Argo Replace=true), broaden scoped RBAC for NVIDIA CRDs and CRD definitions without Secrets access, add Job backoffLimit, validate readiness.yaml at bundle time, and anchor stability-window timing after evaluation. Extract gate manifest rendering into pkg/bundler/gatemanifest to keep deployer golden tests importable without a cycle. Add synctest-based cmd/gate loop coverage and expand bundler/config tests.
Address two open review threads on PR #1110. Major: the readiness-gate ClusterRole and ClusterRoleBinding were named with only the component name, so two bundles deploying the same component to different namespaces (the multi-tenant --app-name case, #1011) would overwrite each other's cluster-scoped objects and the binding subject would point at a single namespace. Suffix the three cluster-scoped name references (ClusterRole, ClusterRoleBinding, roleRef) with the resolved namespace; the namespaced ServiceAccount, subject, ConfigMap, and Job keep the bare component name. Medium: the gate exec'd `chainsaw test` with no pinned --config, leaving namespace/resource cleanup to the base image's chainsaw defaults. Bake a chainsaw Configuration (cleanup.skipDelete: true) into the aicr-gate image, thread an optional ConfigPath through the runner as --config, and add a --chainsaw-config flag defaulting to the in-image path. skipDelete matches the read-only RBAC (no delete verbs). Regenerate the argocd and argocdhelm readiness goldens.
…comment The helm readiness-gate folder runs with `helm upgrade --install --wait`, not `--wait-for-jobs`: the gate Job is a post-install/post-upgrade hook and `--wait` blocks on hook completion regardless of `--wait-for-jobs`. Update the recipe-development and cli-reference docs to match. The exitSignal comment claimed SIGTERM yields 143, but signal.NotifyContext collapses SIGINT/SIGTERM into one cancellation and loop always returns 130, so the process exits 130 for either signal. Correct the comment.
Add TruncHead/TruncTail helpers that back off to a UTF-8 rune boundary so the three message-trimming sites (RunComponent failure tail, state.go truncate, cmd/gate shortMsg) never split a multi-byte rune. All three now route through the shared helpers. Evaluate now checks ctx.Err() between components so a SIGINT/SIGTERM or caller deadline stops the loop instead of spawning more chainsaw execs; the gate loop maps that cancellation error to exit 130 (interrupt) rather than exit 2 (config error).
The helm deploy.sh applies gate wait-semantics (derived --timeout, forced --wait, MAX_RETRIES=1) to every release matching its static "*-readiness)" glob. A component literally named "<x>-readiness" would silently inherit that behavior even with --readiness-hooks off, since the glob is name-driven not gate-driven. Reject such names at bundle time in localformat.Write, making the glob exact-by-construction. Unlike the -pre/-post collision guards, the suffix is reserved unconditionally.
Add an explanatory comment for the build-gate job's permissions (zizmor nitpick): contents: read for checkout, packages: write to push the per-arch aicr-gate image to GHCR.
a23f6fb to
4cf9724
Compare
Summary
Add an opt-in
--readiness-hooksbundle flag that emits per-component readiness gates so a deploy blocks on component-specific readiness signals (e.g.gpu-operator'sClusterPolicyreachingstatus.state: ready), not just on the chart's own resources reporting Ready. Supported on thehelm,argocd, andargocd-helmdeployers.Motivation / Context
Helm
--waitand Argo CD's native health can't assess operator-specific readiness that lives in a custom resource — an operator Deployment going Available only means the process started, not that its operands rolled out. This wires a Chainsaw-test-driven gate (thegateCLI, run as a post-component Job) into the bundle so the deploy converges on the real signal.Fixes: #904
Related: N/A
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)cmd/gate,pkg/chainsawgate/runner,.github/workflows/on-tag.yaml)Implementation Notes
recipes/components/<name>/readiness.yaml(a ChainsawTest); the bundler discovers it by path — no registry field.gpu-operatoris the first adopter (assertsClusterPolicystateready).pkg/chainsawgate/runner(the chainsaw-test evaluation + stability-window/deadline state machine, Kubernetes-API-free) andcmd/gate(the standalone polling gate) are ported from thechainsaw-gaterepo so AICR is the single source of truth.helm:deploy.shruns the readiness folder withhelm upgrade --install --wait --wait-for-jobs; Helm's--timeoutis derived by the bundler from the gate's--max-wait+ buffer so the gate owns the deadline.argocd/argocd-helm: the readiness gate is emitted in the next sync-wave after its component and blocks on Argo CD's built-inbatch/Jobhealth (Progressing → Healthy / Degraded). No custom health Lua.flux/helmfile: explicitly rejected when--readiness-hooksis set.cmd/gate/Dockerfilebuildsghcr.io/nvidia/aicr-gate, baking in the Chainsaw version pinned in.settings.yaml(so the in-cluster gate runs the same Chainsaw AICR validates with). Wired intoon-tag.yamlfor multi-arch build, manifest, vuln-scan, and SLSA attestation alongside the validator images.--readiness-hooksis not exposed via the bundle API (parity with how the feature was scoped).container-images.mdis component-scoped and auto-generated; the gate image andreadiness.yamldon't feed it (same as the validator images), somake bom-docsis a no-op here.Testing
pkg/bundler/readiness_test.go,pkg/chainsawgate/runner/{runner,state}_test.go, golden + sync-wave-ordering tests for the argocd / argocd-helm deployers, and localformat writer tests.ghcr.io/nvidia/aicr-gatebuilt and smoke-tested locally: gate compiles, embedded chainsaw reports the pinned0.2.15.Risk Assessment
Rollout notes: Feature is off unless
--readiness-hooksis passed; components without areadiness.yamlare unaffected. Theaicr-gateimage publishes on the next tag. No migration required.Checklist
make testwith-race)make lint)git commit -S)