fix(kwok): drop recipe suffix from argocd-helm-oci in-cluster repoURL#1047
Conversation
The argocd-helm-oci wrapper script was passing the FULL bundle URL to `helm install --set repoURL=…` (including the per-recipe chart name at the end). That matched the pre-PR-NVIDIA#1032 contract where the parent Application's `source.chart` was hardcoded to `aicr-bundle`. PR NVIDIA#1032 (and NVIDIA#1035's reinforcement) changed the parent App template to expect the parent-namespace-only repoURL and to append .Chart.Name itself via the separate `source.chart` field. The wrapper script wasn't updated to match. Result on every PR with argocd-helm-oci Tier-1 KWOK coverage: the parent App resolves to `oci://registry.aicr-registry.svc.cluster.local:5000/aicr/<recipe>/<recipe>:<tag>`, the OCI artifact lookup 404s, gpu-operator-post's Application can never sync, and the whole stack times out on `GitOps sync timeout strike 1/3`. The failure was masked on `main` because the most-recent KWOK Cluster Validation run on `main` (#26469449378 at 0d3e62d, success) ran *before* PR NVIDIA#1035 merged. After NVIDIA#1035 / NVIDIA#1036 / NVIDIA#1038 all landed on main, no fresh KWOK run has triggered on `main` yet — but the next one will fail the same way every open PR's argocd-helm-oci Tier-1 jobs are currently failing. Fix is a one-line drop of the per-recipe suffix from OCI_IN_CLUSTER_REF in the argocd-helm-oci branch of generate_bundle. The flux branch keeps the per-recipe suffix because flux's OCIRepository CR consumes the FULL artifact URL (recipe segment included). Updated the surrounding comment to point at the post-NVIDIA#1032 contract so the next reader understands the asymmetry. End-to-end check (verified from PR NVIDIA#1030's debug artifact at b3f2296): repo-server log shows `registry.aicr-registry.svc.cluster.local:5000/aicr/<recipe>/<recipe>:<tag>: not found`, caused by the same double-append. With the recipe suffix dropped, Argo's resolution `<repoURL>/<chart>:<tag>` aligns with the pushed artifact at `oci://…/aicr/<recipe>:<tag>`. Refs PR NVIDIA#1030 (where this surfaced), PR NVIDIA#1032 (contract change), PR NVIDIA#1035 (parent App template enforcement).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR updates the argocd-helm-oci bundling logic in Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codex caught a real regression — thank you. The The fix is to make the assignment per-lane, matching the existing ```bash And update the log line at 777 so each lane prints the URL it'll actually dereference (full path for Amending the PR with this fix shortly. |
|
Amended in 0bb6b56 to make
The |
|
Follow-up PR opened: #1048. Restores the per-recipe segment for the |
PR NVIDIA#1047 / NVIDIA#1048 tried to fix the argocd-helm-oci KWOK failures by adjusting how the wrapper script passes `--set repoURL=…`, but the root cause is in the bundler's parent App template, not the wrapper. # Why the previous fix did not hold `parentAppTemplate` (pkg/bundler/deployer/argocdhelm/argocdhelm.go) rendered a Helm-OCI shape: source: repoURL: <Values.repoURL> # e.g., oci://reg/path chart: <.Chart.Name> # e.g., my-bundle targetRevision: <tag> The design assumed Argo CD would resolve this as `<repoURL>/<chart>:<tag>`. But Argo CD's `oci://` scheme triggers the **native OCI** code path, where `repoURL` is the FULL artifact reference and `source.chart` is silently ignored. CI confirms it: HEAD /v2/aicr/manifests/0.0.0-<sha>-argocd-helm-oci (artifact actually lives at /v2/aicr/<recipe>/manifests/...) The Argo CD user guide splits the two shapes explicitly: - Helm OCI charts (no `oci://`, `chart:` set): https://argo-cd.readthedocs.io/en/stable/user-guide/helm/ - Native OCI artifacts (`oci://` repoURL, `path:` set): https://argo-cd.readthedocs.io/en/stable/user-guide/oci/ # Fix: dispatch parent App shape on repoURL scheme Conditional template: - `oci://...` repoURL → native OCI shape: append `.Chart.Name` to repoURL, set `path: "."`, drop `chart:`. Mirrors the path-based child template at injectValuesIntoSingleSource. - HTTPS / HTTP repoURL → Helm chart repository shape: keep repoURL as-is, emit `chart: .Chart.Name`, drop `path:`. Preserves the pure-Helm install path the README still advertises (ChartMuseum, GitHub Pages, etc.). # Tests - TestHelmTemplate_RewrittenSources / TestHelmTemplate_PathBasedChildRepoURL updated for the native-OCI shape (parent.repoURL == namespace + "/" + chart, chart == "", path == "."). - TestHelmTemplate_RendersWithHelmRepoRepoURL added: verifies the HTTPS-mode rendering produces repoURL + chart with no path field. - TestGenerate_CustomChartName now asserts chart name in repoURL (was: source.chart). - TestHelmTemplate_AppNameOverride heuristic switched from `source.chart \!= ""` to `source.path == "."`. # Goldens testdata/{helm_and_manifest_only,mixed_component}/templates/aicr-stack.yaml regenerated. # Doc comments Stale `source.chart` / `repoURL/chart:tag` references in parentAppTemplate, Generator.ChartName, DefaultChartName, and the "Publication backends and the path field" block updated to describe the dual-mode contract and cite the Argo CD docs split. # KWOK wrapper script The PR NVIDIA#1047 comment block in kwok/scripts/validate-scheduling.sh referenced the now-wrong source.chart contract; updated to cite the new parentAppTemplate design. The conditional itself (per-lane assignment from NVIDIA#1048) is unchanged. Closes the argocd-helm-oci CI breakage on main introduced by PR NVIDIA#1047.
Summary
Fix the KWOK matrix wrapper script so the
argocd-helm-ocideployer tier passes the parent-namespace-only repoURL that the Argo CD parent Application template (PR #1032 / #1035) actually expects.Motivation / Context
PR #1032 ("configurable parent Application name") and PR #1035 ("assemble full OCI artifact in path-based child apps") changed the argocd-helm parent Application template to append
.Chart.Nameitself via thesource.chartfield. The wrapper script inkwok/scripts/validate-scheduling.shwas not updated to match — it kept passing the full bundle URL (including the per-recipe chart name) as--set repoURL=…. The parent App template then appended.Chart.Nameon top, producing a doubled recipe segment:The OCI artifact actually lives at
oci://…/aicr/<recipe>:<tag>(single segment), so the doubled form 404s. Argo CD's repo-server logs the resolved digest lookup failure:The
argocd-helm-ociparent Application never syncs,gpu-operator-post(a child of the parent App) never reconciles, and the matrix job times out onGitOps sync timeout strike 1/3.Blast radius: every PR with Tier-1 KWOK coverage on
argocd-helm-ocifails this way — currently 10+ unrelated services (aks, aks-inference, aks-training, eks-inference, gke-cos, gke-cos-inference, gke-cos-training, kind-inference, lke-inference, oke-inference). Theargocd-oci,flux-oci, and plainhelmdeployers are unaffected because their resolution paths don't go throughhelm install --set repoURL=…against the parent App template.Fixes the
argocd-helm-ociCI breakage currently blocking PR #1046 (rtx-pro-6000 EKS overlays) and any other open PR.Fixes: N/A (no dedicated tracking issue)
Related: #1032, #1035, #1046
Type of Change
Component(s) Affected
kwok/scripts/validate-scheduling.sh)Implementation Notes
Single change to the
argocd-helm-ocibranch ofgenerate_bundle():Plus an explanatory comment block citing PR #1032 / #1035, and a log-line tweak so the printed pull URL shows the full path (
${in_cluster_repo}/${recipe}:${tag}) — i.e., the URL Argo CD will actually dereference — instead of the truncated parent-namespace URL.Why not also change the
flux-ocibranch (line 844):The
flux-ocibranch stays on…/aicr/${recipe}because Flux's source-controller doesn't go through the same parent-App template path. The Flux deployer doesn't bakerepoURLinto the bundle — the OCIRepository CR is applied at deploy time with the full path baked in directly. Its current value already matches the pushed artifact ataicr/${recipe}:${tag}and Flux works correctly today. Touching it would break a working path.Testing
Both pass. Full
make qualifywas skipped because this is an infra-only change to a CI shell script — Go tests, e2e, vulnerability scan, and yamllint cannot regress from a bash one-liner. PerCLAUDE.local.md's doc-only/infra-only carve-out, scoped checks are the appropriate gate here.Real validation will come from CI itself: the
argocd-helm-ocimatrix jobs in this PR's run should turn green where the rtx-pro-6000 PR (#1046) currently shows 10+ failures, confirming the contract is restored.Risk Assessment
aicrback withaicr/${recipe}on line 758).Rollout notes: No runtime impact — affects only the KWOK CI matrix test driver. After merge, every open PR's
argocd-helm-ociTier-1 jobs should pass when rebased onto the new main.Checklist
bash -n+shellcheck— infra-only change, fullmake qualifynot applicable)git commit -S)