Skip to content

Add opt-in component readiness gates (--readiness-hooks)#1110

Open
atif1996 wants to merge 13 commits into
mainfrom
feat/904-readiness-hooks
Open

Add opt-in component readiness gates (--readiness-hooks)#1110
atif1996 wants to merge 13 commits into
mainfrom
feat/904-readiness-hooks

Conversation

@atif1996
Copy link
Copy Markdown
Contributor

Summary

Add an opt-in --readiness-hooks bundle flag that emits per-component readiness gates so a deploy blocks on component-specific readiness signals (e.g. gpu-operator's ClusterPolicy reaching status.state: ready), not just on the chart's own resources reporting Ready. Supported on the helm, argocd, and argocd-helm deployers.

Motivation / Context

Helm --wait and Argo CD's native health can't assess operator-specific readiness that lives in a custom resource — an operator Deployment going Available only means the process started, not that its operands rolled out. This wires a Chainsaw-test-driven gate (the gate CLI, run as a post-component Job) into the bundle so the deploy converges on the real signal.

Fixes: #904
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: gate CLI + image (cmd/gate, pkg/chainsawgate/runner, .github/workflows/on-tag.yaml)

Implementation Notes

  • Opt-in only. Off by default. A component opts in by shipping recipes/components/<name>/readiness.yaml (a Chainsaw Test); the bundler discovers it by path — no registry field. gpu-operator is the first adopter (asserts ClusterPolicy state ready).
  • Gate library + CLI migrated into AICR. pkg/chainsawgate/runner (the chainsaw-test evaluation + stability-window/deadline state machine, Kubernetes-API-free) and cmd/gate (the standalone polling gate) are ported from the chainsaw-gate repo so AICR is the single source of truth.
  • Deployer blocking.
    • helm: deploy.sh runs the readiness folder with helm upgrade --install --wait --wait-for-jobs; Helm's --timeout is derived by the bundler from the gate's --max-wait + buffer so the gate owns the deadline.
    • argocd / argocd-helm: the readiness gate is emitted in the next sync-wave after its component and blocks on Argo CD's built-in batch/Job health (Progressing → Healthy / Degraded). No custom health Lua.
    • flux / helmfile: explicitly rejected when --readiness-hooks is set.
  • Image. New cmd/gate/Dockerfile builds ghcr.io/nvidia/aicr-gate, baking in the Chainsaw version pinned in .settings.yaml (so the in-cluster gate runs the same Chainsaw AICR validates with). Wired into on-tag.yaml for multi-arch build, manifest, vuln-scan, and SLSA attestation alongside the validator images.
  • CLI-only. --readiness-hooks is not exposed via the bundle API (parity with how the feature was scoped).
  • BOM unaffected. container-images.md is component-scoped and auto-generated; the gate image and readiness.yaml don't feed it (same as the validator images), so make bom-docs is a no-op here.

Testing

make qualify
  • Green with the pinned toolchain (golangci-lint v2.12.2, helmfile v1.5.2): lint clean, all unit + e2e tests pass, no vulnerabilities.
  • New unit coverage: pkg/bundler/readiness_test.go, pkg/chainsawgate/runner/{runner,state}_test.go, golden + sync-wave-ordering tests for the argocd / argocd-helm deployers, and localformat writer tests.
  • ghcr.io/nvidia/aicr-gate built and smoke-tested locally: gate compiles, embedded chainsaw reports the pinned 0.2.15.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: Feature is off unless --readiness-hooks is passed; components without a readiness.yaml are unaffected. The aicr-gate image publishes on the next tag. No migration required.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

Coverage Report ✅

Metric Value
Coverage 76.5%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.5%25-green)

Merging this branch changes the coverage (1 decrease, 5 increase)

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/cmd/gate 72.37% (+72.37%) 🌟
github.com/NVIDIA/aicr/pkg/bundler 64.22% (+1.23%) 👍
github.com/NVIDIA/aicr/pkg/bundler/config 96.55% (+0.04%) 👍
github.com/NVIDIA/aicr/pkg/bundler/deployer/argocd 81.21% (ø)
github.com/NVIDIA/aicr/pkg/bundler/deployer/argocdhelm 81.79% (ø)
github.com/NVIDIA/aicr/pkg/bundler/deployer/helm 84.04% (ø)
github.com/NVIDIA/aicr/pkg/bundler/deployer/localformat 73.37% (-0.11%) 👎
github.com/NVIDIA/aicr/pkg/bundler/gatemanifest 85.00% (+85.00%) 🌟
github.com/NVIDIA/aicr/pkg/chainsawgate/runner 77.78% (+77.78%) 🌟
github.com/NVIDIA/aicr/pkg/cli 66.33% (ø)
github.com/NVIDIA/aicr/pkg/defaults 100.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/cmd/gate/main.go 72.37% (+72.37%) 76 (+76) 55 (+55) 21 (+21) 🌟
github.com/NVIDIA/aicr/pkg/bundler/bundler.go 72.78% (+0.04%) 632 (+12) 460 (+9) 172 (+3) 👍
github.com/NVIDIA/aicr/pkg/bundler/config/config.go 95.69% (+0.06%) 209 (+3) 200 (+3) 9 👍
github.com/NVIDIA/aicr/pkg/bundler/deployer/argocd/argocd.go 81.21% (ø) 165 134 31
github.com/NVIDIA/aicr/pkg/bundler/deployer/argocdhelm/argocdhelm.go 81.79% (ø) 390 319 71
github.com/NVIDIA/aicr/pkg/bundler/deployer/helm/helm.go 84.04% (ø) 94 79 15
github.com/NVIDIA/aicr/pkg/bundler/deployer/localformat/writer.go 73.61% (-0.37%) 216 (+20) 159 (+14) 57 (+6) 👎
github.com/NVIDIA/aicr/pkg/bundler/gatemanifest/manifest.go 85.00% (+85.00%) 20 (+20) 17 (+17) 3 (+3) 🌟
github.com/NVIDIA/aicr/pkg/bundler/readiness.go 85.71% (+85.71%) 35 (+35) 30 (+30) 5 (+5) 🌟
github.com/NVIDIA/aicr/pkg/chainsawgate/runner/runner.go 68.25% (+68.25%) 63 (+63) 43 (+43) 20 (+20) 🌟
github.com/NVIDIA/aicr/pkg/chainsawgate/runner/state.go 100.00% (+100.00%) 27 (+27) 27 (+27) 0 🌟
github.com/NVIDIA/aicr/pkg/cli/bundle.go 52.41% (ø) 187 98 89
github.com/NVIDIA/aicr/pkg/defaults/timeouts.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR implements optional per-component readiness gates for Kubernetes deployments using Chainsaw tests. It introduces a new aicr-gate CLI that evaluates readiness tests until achieving stability or a deadline, extends the bundler to discover and render per-component readiness manifests, integrates with all supported deployers (Helm, Argo CD, Argo CD-Helm) to emit standalone readiness Job charts, updates Helm deploy scripts for readiness-specific timeout handling, provides comprehensive test coverage, updates documentation with usage guidance and examples, and modifies CI to build, publish, scan, and attest the gate container image. The implementation follows a clear flow: configuration and defaults, gate CLI/runner/state computation, manifest synthesis, bundler collection and wiring, local format emission, deployer integration, deploy script customization, test coverage, documentation, and CI pipeline updates.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

  • njhensley
  • mchmarny
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/904-readiness-hooks

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/chainsawgate/runner/runner.go`:
- Around line 163-186: LoadBundleDir currently performs blocking filesystem I/O
without a context; change its signature to LoadBundleDir(ctx context.Context,
dir string) (map[string]string, error), update the caller in cmd/gate/main.go to
pass the application context created by signal.NotifyContext, and inside
LoadBundleDir check ctx.Err() at the start and after each blocking call
(os.ReadDir and os.ReadFile) returning a wrapped
context.Canceled/DeadlineExceeded error if canceled; keep the same file-reading
logic but thread the ctx through and wrap returned errors consistently using the
existing errors package.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 49fa3b81-4572-4dcc-8ac6-718e194240e8

📥 Commits

Reviewing files that changed from the base of the PR and between 01948eb and 4c2958d.

📒 Files selected for processing (40)
  • .github/workflows/on-tag.yaml
  • cmd/gate/Dockerfile
  • cmd/gate/main.go
  • docs/contributor/data.md
  • docs/integrator/recipe-development.md
  • docs/user/cli-reference.md
  • pkg/bundler/bundler.go
  • pkg/bundler/config/config.go
  • pkg/bundler/deployer/argocd/argocd.go
  • pkg/bundler/deployer/argocd/argocd_test.go
  • pkg/bundler/deployer/argocd/testdata/readiness_gate/001-gpu-operator/application.yaml
  • pkg/bundler/deployer/argocd/testdata/readiness_gate/001-gpu-operator/values.yaml
  • pkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/Chart.yaml
  • pkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/application.yaml
  • pkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yaml
  • pkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/values.yaml
  • pkg/bundler/deployer/argocdhelm/argocdhelm.go
  • pkg/bundler/deployer/argocdhelm/argocdhelm_test.go
  • pkg/bundler/deployer/argocdhelm/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yaml
  • pkg/bundler/deployer/argocdhelm/testdata/readiness_gate/templates/gpu-operator-readiness.yaml
  • pkg/bundler/deployer/argocdhelm/testdata/readiness_gate/templates/gpu-operator.yaml
  • pkg/bundler/deployer/helm/helm.go
  • pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/deploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_with_pre/deploy.sh
  • pkg/bundler/deployer/helm/testdata/nodewright_present/deploy.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.sh
  • pkg/bundler/deployer/localformat/writer.go
  • pkg/bundler/deployer/localformat/writer_test.go
  • pkg/bundler/readiness.go
  • pkg/bundler/readiness_test.go
  • pkg/chainsawgate/runner/runner.go
  • pkg/chainsawgate/runner/runner_test.go
  • pkg/chainsawgate/runner/state.go
  • pkg/chainsawgate/runner/state_test.go
  • pkg/cli/bundle.go
  • pkg/defaults/timeouts.go
  • recipes/components/gpu-operator/readiness.yaml

Comment thread pkg/chainsawgate/runner/runner.go
Copy link
Copy Markdown
Member

@njhensley njhensley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, well-structured feature — k8s-free runner, a pure/tested state machine, deterministic rendering, complete supply-chain wiring, thorough docs. Two things should gate merge: (1) the gate Job isn't re-run on redeploy and breaks helm upgrade (silent stale-pass on same version, immutable-field hard fail on version bump — see inline), and (2) the 90-min single-attempt no-restart Job is too fragile for the target clusters. (3) the static CRD RBAC should be resolved or documented before this is called "generic."

One coverage note: cmd/gate has no main_test.go (the run/loop/progressLine contract is untested) and there's no e2e/KWOK exercising the readiness path — only the version smoke test in the PR description. The runner/state packages are well covered. (For the shipped gpu-operator adopter the read-only RBAC is actually exactly sufficient — assert-only, cluster-scoped, existing namespace — so this is a test-coverage gap, not an RBAC-correctness problem.)

Comment thread pkg/bundler/readiness.go Outdated
Comment thread pkg/bundler/readiness.go Outdated
Comment thread pkg/bundler/readiness.go Outdated
Comment thread cmd/gate/main.go Outdated
Comment thread pkg/chainsawgate/runner/runner.go Outdated
Comment thread pkg/bundler/readiness.go
Comment thread pkg/chainsawgate/runner/runner.go Outdated
Comment thread cmd/gate/main.go Outdated
Comment thread pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
Comment thread cmd/gate/Dockerfile
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the --readiness-hooks feature. Solid design overall — fail-closed flux/helmfile rejection and the derived helm --timeout are nice. Two blockers and a coverage-gate gap to resolve before merge; details inline.

  • Critical: gate image tag missing v prefix → ImagePullBackOff on release builds.
  • Major: readiness Job never re-runs on redeploy and breaks helm upgrade on image/arg change.
  • Major: new exported functions at 0% coverage (ReadinessHooks, WithReadinessHooks, cmd/gate).
  • Minor: *-readiness) glob over-matches release names; inaccurate exit-code comment.

A few open questions (gate RBAC scope, chainsaw assert-on-empty semantics, elapsed-time accounting) are noted inline as well.

Comment thread pkg/bundler/readiness.go
Comment thread pkg/bundler/readiness.go Outdated
Comment thread pkg/bundler/config/config.go
Comment thread pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
Comment thread cmd/gate/main.go Outdated
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few issues.

atif1996 added a commit that referenced this pull request Jun 1, 2026
Normalize aicr-gate image tags to the v-prefixed release contract,
make the gate Job re-run on deploy (helm hooks / Argo Replace=true),
broaden scoped RBAC for NVIDIA CRDs and CRD definitions without
Secrets access, add Job backoffLimit, validate readiness.yaml at
bundle time, and anchor stability-window timing after evaluation.

Extract gate manifest rendering into pkg/bundler/gatemanifest to keep
deployer golden tests importable without a cycle. Add synctest-based
cmd/gate loop coverage and expand bundler/config tests.
@atif1996
Copy link
Copy Markdown
Contributor Author

atif1996 commented Jun 1, 2026

Follow-up commit b6a1eff addresses review items #1#6 and #9:

Manifest rendering extracted to pkg/bundler/gatemanifest so deployer golden tests can import without an import cycle. Goldens regenerated; make qualify green.

Deferred (unchanged): #7 chainsaw --config, #8 multi-bundle RBAC name collision (#1011), #10#15 low/minor.

Scope note: deploy-time gate remains one-shot by design — unchanged manifests do not re-sync on cluster scale-up; ongoing convergence is operator reconciliation + monitoring.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
cmd/gate/main.go (1)

53-53: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Fix exit-code comment accuracy.

The comment states SIGTERM yields exit code 143, but line 157 returns exitSignal (130) for both SIGINT and SIGTERM. The comment should reflect that both signals yield 130, or the code should distinguish the signals if different exit codes are desired.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/gate/main.go` at line 53, The comment for exitSignal on the variable
exitSignal (currently "128 + SIGINT; SIGTERM caller can read $? as 143") is
inaccurate relative to how signals are handled; either update the comment to
state both SIGINT and SIGTERM currently produce exit code 130, or modify the
signal handling logic around the code that returns exitSignal (the signal
handling/return path) to distinguish SIGTERM and set 143 when SIGTERM is
received; locate the variable exitSignal and the signal handling/return logic
(where exitSignal is used/returned) and implement one of these two fixes so the
comment and behavior match.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.sh`:
- Around line 272-285: Update the comment that explains readiness gate wait
behavior to clarify that both --wait and --timeout are intentionally set in
COMPONENT_WAIT_ARGS; change the ambiguous sentence that currently says "only
--timeout is needed here" to explicitly state that "Helm --wait blocks on hook
completion (--wait-for-jobs is not needed), so we set --wait --timeout here" and
keep the rest of the explanation about COMPONENT_HELM_TIMEOUT and
COMPONENT_MAX_RETRIES intact so readers see why COMPONENT_WAIT_ARGS includes
both flags.

In `@pkg/bundler/gatemanifest/manifest.go`:
- Around line 67-72: Update the ClusterRole RBAC rules so readiness checks
aren't limited to the nvidia.com group: change the rule that currently lists
apiGroups: ["nvidia.com"] / resources: ["*"] / verbs: ["get","list","watch"] to
allow reading CRDs in any apiGroup (e.g., set apiGroups: ["*"] for the
CRD-related rule or add a separate rule granting get/list/watch on
customresourcedefinitions across all apiGroups), and add support for an optional
per-component override (convention: readiness-rbac.yaml) so components can
supply templated RBAC if they need narrower scopes; refer to the ClusterRole
definition and the apiGroups/resources/verbs fields when making the change.
- Around line 56-86: The ClusterRole and ClusterRoleBinding names (the manifest
blocks for ClusterRole and ClusterRoleBinding that currently use the single
token %[1]s and reference the ServiceAccount name) collide across namespaces;
update the name fields for ClusterRole and ClusterRoleBinding (and the
ServiceAccount subject name/roleRef.name usage) to include a per-deployment
qualifier such as {{ .Release.Namespace }} or a bundle-instance ID (e.g., append
"-{{ .Release.Namespace }}" or similar) so the ClusterRole, ClusterRoleBinding,
and roleRef.subjects[0].name/roleRef.name all use the same fully qualified name
and avoid cross-namespace overwrites; ensure roleRef and subject entries still
match the new qualified name and preserve existing formatting token usage (the
%[1]s placeholder) where appropriate.

In `@pkg/bundler/readiness_test.go`:
- Line 171: The call to recipe.SetDataProvider(badLayered) fails because
SetDataProvider doesn't exist; replicate the same fix you applied earlier at
lines 103-105 by replacing SetDataProvider with the correct method used there
(e.g., recipe.SetProvider(badLayered)) or the equivalent setter you used
previously for the recipe type so the test binds badLayered via the existing
API.

---

Duplicate comments:
In `@cmd/gate/main.go`:
- Line 53: The comment for exitSignal on the variable exitSignal (currently "128
+ SIGINT; SIGTERM caller can read $? as 143") is inaccurate relative to how
signals are handled; either update the comment to state both SIGINT and SIGTERM
currently produce exit code 130, or modify the signal handling logic around the
code that returns exitSignal (the signal handling/return path) to distinguish
SIGTERM and set 143 when SIGTERM is received; locate the variable exitSignal and
the signal handling/return logic (where exitSignal is used/returned) and
implement one of these two fixes so the comment and behavior match.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 23903644-227a-4288-ad42-30bd058a5d42

📥 Commits

Reviewing files that changed from the base of the PR and between 4c2958d and b6a1eff.

📒 Files selected for processing (19)
  • cmd/gate/main.go
  • cmd/gate/main_test.go
  • pkg/bundler/config/config_test.go
  • pkg/bundler/deployer/argocd/argocd_test.go
  • pkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yaml
  • pkg/bundler/deployer/argocdhelm/argocdhelm_test.go
  • pkg/bundler/deployer/argocdhelm/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yaml
  • pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/deploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_with_pre/deploy.sh
  • pkg/bundler/deployer/helm/testdata/nodewright_present/deploy.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.sh
  • pkg/bundler/gatemanifest/manifest.go
  • pkg/bundler/gatemanifest/manifest_test.go
  • pkg/bundler/readiness.go
  • pkg/bundler/readiness_test.go
  • pkg/defaults/timeouts.go

Comment thread pkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.sh
Comment thread pkg/bundler/gatemanifest/manifest.go Outdated
Comment thread pkg/bundler/gatemanifest/manifest.go
Comment thread pkg/bundler/readiness_test.go Outdated
atif1996 added a commit that referenced this pull request Jun 2, 2026
Normalize aicr-gate image tags to the v-prefixed release contract,
make the gate Job re-run on deploy (helm hooks / Argo Replace=true),
broaden scoped RBAC for NVIDIA CRDs and CRD definitions without
Secrets access, add Job backoffLimit, validate readiness.yaml at
bundle time, and anchor stability-window timing after evaluation.

Extract gate manifest rendering into pkg/bundler/gatemanifest to keep
deployer golden tests importable without a cycle. Add synctest-based
cmd/gate loop coverage and expand bundler/config tests.
@atif1996 atif1996 force-pushed the feat/904-readiness-hooks branch from b6a1eff to 52721eb Compare June 2, 2026 04:17
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/bundler/deployer/argocdhelm/argocdhelm.go (1)

244-264: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve the delegated generator’s structured error code

argocdGen.Generate can return a coded *errors.StructuredError; wrapping it with errors.Wrap(errors.ErrCodeInternal, ...) forces the error to INTERNAL and masks the original Code. Use errors.PropagateOrWrap so coded errors are preserved as-is.

Suggested fix
 	if _, genErr := argocdGen.Generate(ctx, tmpDir); genErr != nil {
-		return nil, errors.Wrap(errors.ErrCodeInternal, "failed to generate base Argo CD output", genErr)
+		return nil, errors.PropagateOrWrap(genErr, errors.ErrCodeInternal, "failed to generate base Argo CD output")
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/bundler/deployer/argocdhelm/argocdhelm.go` around lines 244 - 264, The
current if-block wrapping argocdGen.Generate's error uses errors.Wrap which
forces any returned *errors.StructuredError to INTERNAL and loses its original
Code; change the error handling in the if _, genErr := argocdGen.Generate(ctx,
tmpDir); genErr != nil { ... } block to use errors.PropagateOrWrap so that if
genErr is a structured/coded error its Code is preserved and otherwise it is
wrapped with the provided internal code and message; update the return call in
that block to call errors.PropagateOrWrap(genErr, errors.ErrCodeInternal,
"failed to generate base Argo CD output") (keeping the same context/message and
variable names such as argocdGen.Generate and genErr).
♻️ Duplicate comments (2)
cmd/gate/main.go (1)

53-53: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix exit-code comment.

The comment states "SIGTERM caller can read $? as 143", but the shared ctx.Done() return path (line 157) returns exitSignal (130) for both SIGINT and SIGTERM. Either correct the comment to reflect the actual exit code, or distinguish the signals if 143 is required for SIGTERM.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/gate/main.go` at line 53, The comment for exitSignal is incorrect because
the code returns exitSignal (130) for both SIGINT and SIGTERM via the ctx.Done()
return path; either update the comment to state that 130 is used for both
signals (referencing the exitSignal variable) or implement distinct handling so
SIGINT sets exitSignal=130 and SIGTERM sets exitSignal=143 before the ctx.Done()
return; locate the signal handling logic / main function where ctx is created
and where exitSignal is read (the ctx.Done() return path) and apply the chosen
fix and corresponding comment update.
pkg/chainsawgate/runner/runner.go (1)

144-153: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Byte-slice truncation can split a UTF-8 rune.

"..." + out[len(out)-maxMsgLen:] slices by bytes and may cut mid-rune, producing invalid UTF-8 in the message. Same pattern exists in state.go (truncate) and cmd/gate/main.go (shortMsg). Use a rune-aware trim.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/chainsawgate/runner/runner.go` around lines 144 - 153, The message
truncation slices the string by bytes and can cut a UTF-8 rune; update the
truncation logic used here (the block handling out := buf.String() in runner.go)
and the similar helpers truncate (state.go) and shortMsg (cmd/gate/main.go) to
be rune-aware: convert the string to runes (or iterate with
utf8.DecodeRuneInString) and if its rune count exceeds maxMsgLen return "..." +
last maxMsgLen runes (or otherwise trim safely), ensuring you preserve valid
UTF-8 in the returned Message values (ComponentResult.Message) and keep the same
"..." prefix semantics for trimmed outputs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/on-tag.yaml:
- Around line 214-216: Add a short explanatory comment above the permissions
block that documents why this job explicitly sets permissions.contents: read and
permissions.packages: write (e.g., to restrict repository contents access while
allowing package publishing) so the static analyzer (zizmor) and reviewers
understand the intentional deviation from implicit permission inheritance;
locate the permissions block containing "contents: read" and "packages: write"
and insert a concise one-line comment clarifying intent.

In `@docs/integrator/recipe-development.md`:
- Line 406: The doc text incorrectly states Helm blocks via `--wait-for-jobs`;
update the Helm wording to reflect the current behavior by replacing
`--wait-for-jobs` with `--wait` in the sentence describing the helm deployer
(the paragraph mentioning `--readiness-hooks`, the `NNN-<name>-readiness/`
folder and the `Job` that runs the `gate` CLI); keep the rest of the guidance
(Argo CD behavior and the note about `spec.timeouts.assert` vs gate per-test
timeout) unchanged.

In `@docs/user/cli-reference.md`:
- Around line 1402-1403: Update the Helm readiness-gate wording to stop claiming
reliance on `--wait-for-jobs`: in the `helm` / `deploy.sh` description remove
the reference to `--wait-for-jobs` and state that `helm upgrade --install
--wait` (with the bundler deriving Helm's `--timeout` from the gate's
`--max-wait` plus buffer) is sufficient to cover readiness hooks; keep the note
that the gate owns the deadline and Helm does not preempt it.

In `@pkg/chainsawgate/runner/runner.go`:
- Around line 110-124: The loop that iterates over bundle entries (for key,
testYAML := range bundle) doesn't honor context cancellation and may continue
creating compDir, writing chainsaw-test.yaml and calling runComponentFn(ctx,
opts.Timeout, opts.Namespace, compDir) after ctx is canceled; modify the loop to
check ctx.Done() (or ctx.Err()) at the start (or before heavy work and before
invoking runComponentFn) and break/return early when canceled, ensuring
components map is left in a consistent state and no further os.MkdirAll /
os.WriteFile / runComponentFn calls occur after cancellation.

---

Outside diff comments:
In `@pkg/bundler/deployer/argocdhelm/argocdhelm.go`:
- Around line 244-264: The current if-block wrapping argocdGen.Generate's error
uses errors.Wrap which forces any returned *errors.StructuredError to INTERNAL
and loses its original Code; change the error handling in the if _, genErr :=
argocdGen.Generate(ctx, tmpDir); genErr != nil { ... } block to use
errors.PropagateOrWrap so that if genErr is a structured/coded error its Code is
preserved and otherwise it is wrapped with the provided internal code and
message; update the return call in that block to call
errors.PropagateOrWrap(genErr, errors.ErrCodeInternal, "failed to generate base
Argo CD output") (keeping the same context/message and variable names such as
argocdGen.Generate and genErr).

---

Duplicate comments:
In `@cmd/gate/main.go`:
- Line 53: The comment for exitSignal is incorrect because the code returns
exitSignal (130) for both SIGINT and SIGTERM via the ctx.Done() return path;
either update the comment to state that 130 is used for both signals
(referencing the exitSignal variable) or implement distinct handling so SIGINT
sets exitSignal=130 and SIGTERM sets exitSignal=143 before the ctx.Done()
return; locate the signal handling logic / main function where ctx is created
and where exitSignal is read (the ctx.Done() return path) and apply the chosen
fix and corresponding comment update.

In `@pkg/chainsawgate/runner/runner.go`:
- Around line 144-153: The message truncation slices the string by bytes and can
cut a UTF-8 rune; update the truncation logic used here (the block handling out
:= buf.String() in runner.go) and the similar helpers truncate (state.go) and
shortMsg (cmd/gate/main.go) to be rune-aware: convert the string to runes (or
iterate with utf8.DecodeRuneInString) and if its rune count exceeds maxMsgLen
return "..." + last maxMsgLen runes (or otherwise trim safely), ensuring you
preserve valid UTF-8 in the returned Message values (ComponentResult.Message)
and keep the same "..." prefix semantics for trimmed outputs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c6db6dd4-a839-4e39-8039-8d1bb83efc97

📥 Commits

Reviewing files that changed from the base of the PR and between b6a1eff and 52721eb.

📒 Files selected for processing (44)
  • .github/workflows/on-tag.yaml
  • cmd/gate/Dockerfile
  • cmd/gate/main.go
  • cmd/gate/main_test.go
  • docs/contributor/data.md
  • docs/integrator/recipe-development.md
  • docs/user/cli-reference.md
  • pkg/bundler/bundler.go
  • pkg/bundler/config/config.go
  • pkg/bundler/config/config_test.go
  • pkg/bundler/deployer/argocd/argocd.go
  • pkg/bundler/deployer/argocd/argocd_test.go
  • pkg/bundler/deployer/argocd/testdata/readiness_gate/001-gpu-operator/application.yaml
  • pkg/bundler/deployer/argocd/testdata/readiness_gate/001-gpu-operator/values.yaml
  • pkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/Chart.yaml
  • pkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/application.yaml
  • pkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yaml
  • pkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/values.yaml
  • pkg/bundler/deployer/argocdhelm/argocdhelm.go
  • pkg/bundler/deployer/argocdhelm/argocdhelm_test.go
  • pkg/bundler/deployer/argocdhelm/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yaml
  • pkg/bundler/deployer/argocdhelm/testdata/readiness_gate/templates/gpu-operator-readiness.yaml
  • pkg/bundler/deployer/argocdhelm/testdata/readiness_gate/templates/gpu-operator.yaml
  • pkg/bundler/deployer/helm/helm.go
  • pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/deploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_with_pre/deploy.sh
  • pkg/bundler/deployer/helm/testdata/nodewright_present/deploy.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.sh
  • pkg/bundler/deployer/localformat/writer.go
  • pkg/bundler/deployer/localformat/writer_test.go
  • pkg/bundler/gatemanifest/manifest.go
  • pkg/bundler/gatemanifest/manifest_test.go
  • pkg/bundler/readiness.go
  • pkg/bundler/readiness_test.go
  • pkg/chainsawgate/runner/runner.go
  • pkg/chainsawgate/runner/runner_test.go
  • pkg/chainsawgate/runner/state.go
  • pkg/chainsawgate/runner/state_test.go
  • pkg/cli/bundle.go
  • pkg/defaults/timeouts.go
  • recipes/components/gpu-operator/readiness.yaml
💤 Files with no reviewable changes (1)
  • recipes/components/gpu-operator/readiness.yaml

Comment thread .github/workflows/on-tag.yaml
Comment thread docs/integrator/recipe-development.md Outdated
Comment thread docs/user/cli-reference.md Outdated
Comment thread pkg/chainsawgate/runner/runner.go
atif1996 added a commit that referenced this pull request Jun 2, 2026
Address two open review threads on PR #1110.

Major: the readiness-gate ClusterRole and ClusterRoleBinding were named
with only the component name, so two bundles deploying the same component
to different namespaces (the multi-tenant --app-name case, #1011) would
overwrite each other's cluster-scoped objects and the binding subject
would point at a single namespace. Suffix the three cluster-scoped name
references (ClusterRole, ClusterRoleBinding, roleRef) with the resolved
namespace; the namespaced ServiceAccount, subject, ConfigMap, and Job
keep the bare component name.

Medium: the gate exec'd `chainsaw test` with no pinned --config, leaving
namespace/resource cleanup to the base image's chainsaw defaults. Bake a
chainsaw Configuration (cleanup.skipDelete: true) into the aicr-gate
image, thread an optional ConfigPath through the runner as --config, and
add a --chainsaw-config flag defaulting to the in-image path. skipDelete
matches the read-only RBAC (no delete verbs).

Regenerate the argocd and argocdhelm readiness goldens.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/bundler/gatemanifest/manifest.go (1)

109-125: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

backoffLimit: 6 can materially delay Argo CD marking the Job “Degraded” (failed) until Kubernetes exhausts Job retries—confirm the gate timeout behavior.

Argo CD’s Job health follows the Kubernetes Job lifecycle and marks the Job Degraded when it reaches a failed terminal state (e.g., after spec.backoffLimit is exceeded). With restartPolicy: Never, if the gate container exits non-zero when --max-wait expires, each timeout likely counts as a failed pod attempt, so terminal Job failure (and thus Argo CD degradation) may take roughly (backoffLimit + 1) * --max-wait before surfacing. If that delay is undesirable for persistent readiness failures, lower backoffLimit (and/or adjust the gate so a --max-wait expiry doesn’t consume Job retries).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/bundler/gatemanifest/manifest.go` around lines 109 - 125, The Job
manifest sets spec.backoffLimit (currently 6) which can delay Argo CD marking
the Job degraded; update the template to reduce spec.backoffLimit to a lower
value (or make it configurable) and/or change the gate container behavior so
that a --max-wait expiry does not exit non-zero and consume retries; look for
the backoffLimit entry and the gate container args (notably --max-wait and
restartPolicy: Never) and either lower backoffLimit or alter the gate's exit
logic to avoid treating timeout expiries as failed attempts.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/bundler/gatemanifest/manifest.go`:
- Around line 109-125: The Job manifest sets spec.backoffLimit (currently 6)
which can delay Argo CD marking the Job degraded; update the template to reduce
spec.backoffLimit to a lower value (or make it configurable) and/or change the
gate container behavior so that a --max-wait expiry does not exit non-zero and
consume retries; look for the backoffLimit entry and the gate container args
(notably --max-wait and restartPolicy: Never) and either lower backoffLimit or
alter the gate's exit logic to avoid treating timeout expiries as failed
attempts.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: cfca7231-99b9-42b3-8a08-cb7c963f13d1

📥 Commits

Reviewing files that changed from the base of the PR and between 52721eb and 42bbf99.

📒 Files selected for processing (10)
  • cmd/gate/Dockerfile
  • cmd/gate/chainsaw-config.yaml
  • cmd/gate/main.go
  • cmd/gate/main_test.go
  • pkg/bundler/deployer/argocd/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yaml
  • pkg/bundler/deployer/argocdhelm/testdata/readiness_gate/002-gpu-operator-readiness/templates/readiness.yaml
  • pkg/bundler/gatemanifest/manifest.go
  • pkg/bundler/gatemanifest/manifest_test.go
  • pkg/chainsawgate/runner/runner.go
  • pkg/chainsawgate/runner/runner_test.go

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

@atif1996 this PR now has merge conflicts with main. Please rebase to resolve them.

@lockwobr
Copy link
Copy Markdown
Contributor

lockwobr commented Jun 2, 2026

Need to review more yet, but I am liking it so far. I feel like might want to invert the flag tbh, --no-readiness to opt-out instead, feels like something we want!

@atif1996 atif1996 requested a review from njhensley June 3, 2026 02:54
@atif1996 atif1996 requested a review from yuanchen8911 June 3, 2026 02:54
atif1996 added 12 commits June 2, 2026 22:56
…ates

Adds a --readiness-hooks bundle flag (default off) that, when a component
ships a readiness.yaml chainsaw test, emits a NNN-<name>-readiness Helm
hook chart running the gate CLI as a Job to gate the deploy on component
readiness. Prototypes the convention with gpu-operator
(ClusterPolicy.status.state=ready).

deploy.sh coordinates helm --timeout above the gate's --max-wait and
forces --wait --wait-for-jobs for readiness folders, since helm --wait
alone does not block on bare Jobs. Non-helm deployers fail clearly when
the flag is set. Default bundle output is unchanged.
Ports the chainsaw-test evaluation library from chainsaw-gate into
pkg/chainsawgate/runner (Evaluate, RunComponent, LoadBundleDir, and the
pure ComputeReadyState/ApplyDeadline/FailingSummary stability-window and
deadline state machine). The package stays Kubernetes-API-free so both
the gate CLI and the chainsaw-gate controller can depend on it.

Adds cmd/gate, the standalone polling gate CLI that drives the runner,
refactored to AICR conventions: pkg/errors with codes instead of
fmt.Errorf, and slog for output instead of fmt.Println/Fprintf.

Image build + CI wiring (ghcr.io/nvidia/aicr-gate with chainsaw baked
in) and docs/BOM land in a follow-up commit on this branch.
Make the gate's --max-wait the single deadline knob and derive the readiness
folder's helm --timeout from it (max-wait + buffer), so the invariant
helm --timeout > gate --max-wait can never silently drift. Helm cannot wait
indefinitely (--wait/--wait-for-jobs is bounded by --timeout; --timeout 0 is
not infinite), so a derived backstop is required.

- Centralize gate timings in pkg/defaults (exec/poll/stability/max-wait +
  helm-timeout buffer); max-wait raised to 90m to cover gpu-operator's
  operand rollout, which can exceed an hour on a large cluster.
- readiness.go renders the Job's gate args from those constants.
- helm.go derives deployTemplateData.ReadinessHelmTimeout = max-wait + buffer;
  deploy.sh.tmpl consumes it (no second hand-tuned timeout literal).
- Consolidate the redundant *-readiness) case blocks in deploy.sh.tmpl.
- Drop CRD / chainsaw-gate-project references from the migrated gate CLI and
  runner comments; AICR drives the gate from a Job, no CRD or controller.
- Assert rendered gate args match the defaults constants; regenerate goldens.
Extend --readiness-hooks beyond the helm deployer to the argocd and
argocd-helm deployers. The gate is emitted as a local-helm folder right
after the component's primary chart, so its Argo CD Application inherits
the next sync-wave. Argo CD then blocks that wave on the gate Job via its
built-in batch/Job health (Progressing -> Healthy on success, Degraded on
failure) -- the analog of helm --wait-for-jobs. No custom health Lua and
no ClusterPolicy watch: readiness logic stays encapsulated in the chainsaw
Test the Job runs.

- argocd/argocdhelm Generators: forward ComponentReadiness through
  localformat.Write (argocdhelm delegates to argocd).
- argocdhelm: strip the -readiness folder suffix when resolving the parent
  component's override key (mirrors existing -pre/-post handling).
- bundler: narrow the fail-closed guard to flux/helmfile only.
- cli: document the supported deployers on --readiness-hooks.

See #904.
Add cmd/gate/Dockerfile and wire the readiness-gate image into the on-tag
release pipeline alongside the validator images.

- Dockerfile embeds the chainsaw binary by copying it out of the official,
  cosign-attested ghcr.io/kyverno/chainsaw image (gate execs `chainsaw test`).
  CHAINSAW_VERSION is the single knob, sourced from .settings.yaml
  (testing_tools.chainsaw, currently v0.2.15) by CI via the load-versions
  action so the in-cluster gate runs the same chainsaw AICR validates with.
- on-tag.yaml: new build-gate job (native per-arch build), gate folded into
  the docker-manifest create/verify loops, the image-vuln-scan matrix, the
  attest job, and the release summary.

Built and smoke-tested locally: gate compiles, chainsaw reports 0.2.15.
See #904.
- cli-reference: add --readiness-hooks to the bundle flag table and a
  "Readiness Gates" section covering the gate Job, the gate image, and how
  each supported deployer (helm via --wait-for-jobs, argocd/argocd-helm via
  sync-wave + built-in Job health) blocks on it.
- recipe-development: add "Adding a Component Readiness Gate" describing the
  recipes/components/<name>/readiness.yaml Chainsaw-test convention.
- data.md: list readiness.yaml in the component directory layout.

See #904.
Normalize aicr-gate image tags to the v-prefixed release contract,
make the gate Job re-run on deploy (helm hooks / Argo Replace=true),
broaden scoped RBAC for NVIDIA CRDs and CRD definitions without
Secrets access, add Job backoffLimit, validate readiness.yaml at
bundle time, and anchor stability-window timing after evaluation.

Extract gate manifest rendering into pkg/bundler/gatemanifest to keep
deployer golden tests importable without a cycle. Add synctest-based
cmd/gate loop coverage and expand bundler/config tests.
Address two open review threads on PR #1110.

Major: the readiness-gate ClusterRole and ClusterRoleBinding were named
with only the component name, so two bundles deploying the same component
to different namespaces (the multi-tenant --app-name case, #1011) would
overwrite each other's cluster-scoped objects and the binding subject
would point at a single namespace. Suffix the three cluster-scoped name
references (ClusterRole, ClusterRoleBinding, roleRef) with the resolved
namespace; the namespaced ServiceAccount, subject, ConfigMap, and Job
keep the bare component name.

Medium: the gate exec'd `chainsaw test` with no pinned --config, leaving
namespace/resource cleanup to the base image's chainsaw defaults. Bake a
chainsaw Configuration (cleanup.skipDelete: true) into the aicr-gate
image, thread an optional ConfigPath through the runner as --config, and
add a --chainsaw-config flag defaulting to the in-image path. skipDelete
matches the read-only RBAC (no delete verbs).

Regenerate the argocd and argocdhelm readiness goldens.
…comment

The helm readiness-gate folder runs with `helm upgrade --install --wait`,
not `--wait-for-jobs`: the gate Job is a post-install/post-upgrade hook and
`--wait` blocks on hook completion regardless of `--wait-for-jobs`. Update
the recipe-development and cli-reference docs to match.

The exitSignal comment claimed SIGTERM yields 143, but signal.NotifyContext
collapses SIGINT/SIGTERM into one cancellation and loop always returns 130,
so the process exits 130 for either signal. Correct the comment.
Add TruncHead/TruncTail helpers that back off to a UTF-8 rune boundary so
the three message-trimming sites (RunComponent failure tail, state.go
truncate, cmd/gate shortMsg) never split a multi-byte rune. All three now
route through the shared helpers.

Evaluate now checks ctx.Err() between components so a SIGINT/SIGTERM or
caller deadline stops the loop instead of spawning more chainsaw execs;
the gate loop maps that cancellation error to exit 130 (interrupt) rather
than exit 2 (config error).
The helm deploy.sh applies gate wait-semantics (derived --timeout, forced
--wait, MAX_RETRIES=1) to every release matching its static "*-readiness)"
glob. A component literally named "<x>-readiness" would silently inherit
that behavior even with --readiness-hooks off, since the glob is name-driven
not gate-driven. Reject such names at bundle time in localformat.Write,
making the glob exact-by-construction. Unlike the -pre/-post collision
guards, the suffix is reserved unconditionally.
Add an explanatory comment for the build-gate job's permissions (zizmor
nitpick): contents: read for checkout, packages: write to push the
per-arch aicr-gate image to GHCR.
@atif1996 atif1996 force-pushed the feat/904-readiness-hooks branch from a23f6fb to 4cf9724 Compare June 3, 2026 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: component readiness via chainsaw post-install hook (replaces #610)

5 participants