Skip to content

fix(kwok): drop recipe suffix from argocd-helm-oci in-cluster repoURL#1047

Merged
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/kwok-argocd-helm-oci-repourl-1032
May 26, 2026
Merged

fix(kwok): drop recipe suffix from argocd-helm-oci in-cluster repoURL#1047
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/kwok-argocd-helm-oci-repourl-1032

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Fix the KWOK matrix wrapper script so the argocd-helm-oci deployer tier passes the parent-namespace-only repoURL that the Argo CD parent Application template (PR #1032 / #1035) actually expects.

Motivation / Context

PR #1032 ("configurable parent Application name") and PR #1035 ("assemble full OCI artifact in path-based child apps") changed the argocd-helm parent Application template to append .Chart.Name itself via the source.chart field. The wrapper script in kwok/scripts/validate-scheduling.sh was not updated to match — it kept passing the full bundle URL (including the per-recipe chart name) as --set repoURL=…. The parent App template then appended .Chart.Name on top, producing a doubled recipe segment:

oci://registry.aicr-registry.svc.cluster.local:5000/aicr/<recipe>/<recipe>:<tag>
                                                       ^^^^^^^^ ^^^^^^^^
                                                       passed in   appended
                                                       repoURL     by Argo

The OCI artifact actually lives at oci://…/aicr/<recipe>:<tag> (single segment), so the doubled form 404s. Argo CD's repo-server logs the resolved digest lookup failure:

failed to resolve revision "0.0.0-<sha>-argocd-helm-oci":
cannot get digest for revision 0.0.0-<sha>-argocd-helm-oci:
registry.aicr-registry.svc.cluster.local:5000/aicr/<recipe>/<recipe>:0.0.0-<sha>-argocd-helm-oci: not found

The argocd-helm-oci parent Application never syncs, gpu-operator-post (a child of the parent App) never reconciles, and the matrix job times out on GitOps sync timeout strike 1/3.

Blast radius: every PR with Tier-1 KWOK coverage on argocd-helm-oci fails this way — currently 10+ unrelated services (aks, aks-inference, aks-training, eks-inference, gke-cos, gke-cos-inference, gke-cos-training, kind-inference, lke-inference, oke-inference). The argocd-oci, flux-oci, and plain helm deployers are unaffected because their resolution paths don't go through helm install --set repoURL=… against the parent App template.

Fixes the argocd-helm-oci CI breakage currently blocking PR #1046 (rtx-pro-6000 EKS overlays) and any other open PR.

Fixes: N/A (no dedicated tracking issue)
Related: #1032, #1035, #1046

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Build/CI/tooling

Component(s) Affected

  • Other: KWOK deployer matrix test wrapper (kwok/scripts/validate-scheduling.sh)

Implementation Notes

Single change to the argocd-helm-oci branch of generate_bundle():

- OCI_IN_CLUSTER_REF="oci://registry.aicr-registry.svc.cluster.local:5000/aicr/${recipe}"
+ OCI_IN_CLUSTER_REF="oci://registry.aicr-registry.svc.cluster.local:5000/aicr"

Plus an explanatory comment block citing PR #1032 / #1035, and a log-line tweak so the printed pull URL shows the full path (${in_cluster_repo}/${recipe}:${tag}) — i.e., the URL Argo CD will actually dereference — instead of the truncated parent-namespace URL.

Why not also change the flux-oci branch (line 844):
The flux-oci branch stays on …/aicr/${recipe} because Flux's source-controller doesn't go through the same parent-App template path. The Flux deployer doesn't bake repoURL into the bundle — the OCIRepository CR is applied at deploy time with the full path baked in directly. Its current value already matches the pushed artifact at aicr/${recipe}:${tag} and Flux works correctly today. Touching it would break a working path.

Testing

# Shell syntax check
bash -n kwok/scripts/validate-scheduling.sh

# Static analysis
shellcheck kwok/scripts/validate-scheduling.sh
# Only pre-existing SC1091 info on line 101 (unrelated `source` directive)

Both pass. Full make qualify was skipped because this is an infra-only change to a CI shell script — Go tests, e2e, vulnerability scan, and yamllint cannot regress from a bash one-liner. Per CLAUDE.local.md's doc-only/infra-only carve-out, scoped checks are the appropriate gate here.

Real validation will come from CI itself: the argocd-helm-oci matrix jobs in this PR's run should turn green where the rtx-pro-6000 PR (#1046) currently shows 10+ failures, confirming the contract is restored.

Risk Assessment

  • Low — Isolated single-line change to CI test infrastructure, easy to revert (replace aicr back with aicr/${recipe} on line 758).

Rollout notes: No runtime impact — affects only the KWOK CI matrix test driver. After merge, every open PR's argocd-helm-oci Tier-1 jobs should pass when rebased onto the new main.

Checklist

  • Tests pass locally (bash -n + shellcheck — infra-only change, full make qualify not applicable)
  • Linter passes (shellcheck, no new warnings)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (N/A — this is a contract realignment between the wrapper script and the existing parent App template)
  • I updated docs if user-facing behavior changed (N/A — CI internal)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

The argocd-helm-oci wrapper script was passing the FULL bundle URL to
`helm install --set repoURL=…` (including the per-recipe chart name
at the end). That matched the pre-PR-NVIDIA#1032 contract where the parent
Application's `source.chart` was hardcoded to `aicr-bundle`.

PR NVIDIA#1032 (and NVIDIA#1035's reinforcement) changed the parent App template
to expect the parent-namespace-only repoURL and to append .Chart.Name
itself via the separate `source.chart` field. The wrapper script
wasn't updated to match. Result on every PR with argocd-helm-oci
Tier-1 KWOK coverage: the parent App resolves to
`oci://registry.aicr-registry.svc.cluster.local:5000/aicr/<recipe>/<recipe>:<tag>`,
the OCI artifact lookup 404s, gpu-operator-post's Application can
never sync, and the whole stack times out on
`GitOps sync timeout strike 1/3`.

The failure was masked on `main` because the most-recent KWOK
Cluster Validation run on `main` (#26469449378 at 0d3e62d, success)
ran *before* PR NVIDIA#1035 merged. After NVIDIA#1035 / NVIDIA#1036 / NVIDIA#1038 all landed
on main, no fresh KWOK run has triggered on `main` yet — but the
next one will fail the same way every open PR's argocd-helm-oci
Tier-1 jobs are currently failing.

Fix is a one-line drop of the per-recipe suffix from
OCI_IN_CLUSTER_REF in the argocd-helm-oci branch of generate_bundle.
The flux branch keeps the per-recipe suffix because flux's
OCIRepository CR consumes the FULL artifact URL (recipe segment
included). Updated the surrounding comment to point at the
post-NVIDIA#1032 contract so the next reader understands the asymmetry.

End-to-end check (verified from PR NVIDIA#1030's debug artifact at
b3f2296): repo-server log shows
`registry.aicr-registry.svc.cluster.local:5000/aicr/<recipe>/<recipe>:<tag>: not found`,
caused by the same double-append. With the recipe suffix dropped,
Argo's resolution `<repoURL>/<chart>:<tag>` aligns with the pushed
artifact at `oci://…/aicr/<recipe>:<tag>`.

Refs PR NVIDIA#1030 (where this surfaced), PR NVIDIA#1032 (contract change),
PR NVIDIA#1035 (parent App template enforcement).
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner May 26, 2026 22:20
@github-actions github-actions Bot added size/S and removed area/ci labels May 26, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: caee614f-2b43-4e81-a7a0-44949216f9c5

📥 Commits

Reviewing files that changed from the base of the PR and between 91a4552 and 52fb835.

📒 Files selected for processing (1)
  • kwok/scripts/validate-scheduling.sh

📝 Walkthrough

Walkthrough

This PR updates the argocd-helm-oci bundling logic in kwok/scripts/validate-scheduling.sh. The OCI_IN_CLUSTER_REF variable is changed from a recipe-specific OCI URL to a base repository URL by removing the /<recipe> suffix. The corresponding log message is updated to clarify that the parent Argo CD App appends the per-chart name (.Chart.Name) to resolve the final artifact reference.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: removing the recipe suffix from the argocd-helm-oci in-cluster repoURL, which is the primary modification in the changeset.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the bug, root cause, impact, implementation details, testing, and risk assessment for the argocd-helm-oci repoURL fix.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@mchmarny mchmarny enabled auto-merge (squash) May 26, 2026 22:28
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Codex caught a real regression — thank you. The OCI_IN_CLUSTER_REF assignment at line 769 lives inside the shared argocd-oci|argocd-helm-oci) case branch (label at line 722), so the change strips the per-recipe segment for both lanes. With this PR as-is, argocd-oci would start failing — the root app resolves repoURL = oci://…/aicr to oci://…/aicr:\${tag}, but the artifact actually lives at oci://…/aicr/\${recipe}:\${tag}.

The fix is to make the assignment per-lane, matching the existing [[ \"\$DEPLOYER\" == \"argocd-helm-oci\" ]] conditionals at lines 742 and 763:

```bash
if [[ "$DEPLOYER" == "argocd-helm-oci" ]]; then
OCI_IN_CLUSTER_REF="oci://registry.aicr-registry.svc.cluster.local:5000/aicr"
else
OCI_IN_CLUSTER_REF="oci://registry.aicr-registry.svc.cluster.local:5000/aicr/${recipe}"
fi
```

And update the log line at 777 so each lane prints the URL it'll actually dereference (full path for argocd-oci, parent + .Chart.Name for argocd-helm-oci).

Amending the PR with this fix shortly.

@mchmarny mchmarny merged commit 45f3f1a into NVIDIA:main May 26, 2026
66 of 101 checks passed
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Amended in 0bb6b56 to make OCI_IN_CLUSTER_REF per-lane:

  • argocd-oci keeps the full aicr/${recipe} form (its root nvidia-stack app references the artifact by full path).
  • argocd-helm-oci uses the parent-only aicr form so the parent App template can append .Chart.Name itself.

The log_info line at 777 is also split per-lane so each prints the URL it will actually dereference. Verified bash -n and shellcheck clean (only the pre-existing SC1091 info on the unrelated line 101 source directive).

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Follow-up PR opened: #1048. Restores the per-recipe segment for the argocd-oci lane (which this PR's case-branch-wide change accidentally stripped along with argocd-helm-oci). Codex confirmed the per-lane conditional reads correctly.

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 26, 2026
PR NVIDIA#1047 / NVIDIA#1048 tried to fix the argocd-helm-oci KWOK failures by
adjusting how the wrapper script passes `--set repoURL=…`, but the
root cause is in the bundler's parent App template, not the wrapper.

# Why the previous fix did not hold

`parentAppTemplate` (pkg/bundler/deployer/argocdhelm/argocdhelm.go)
rendered a Helm-OCI shape:

    source:
      repoURL: <Values.repoURL>     # e.g., oci://reg/path
      chart:   <.Chart.Name>        # e.g., my-bundle
      targetRevision: <tag>

The design assumed Argo CD would resolve this as
`<repoURL>/<chart>:<tag>`. But Argo CD's `oci://` scheme triggers
the **native OCI** code path, where `repoURL` is the FULL artifact
reference and `source.chart` is silently ignored. CI confirms it:

    HEAD /v2/aicr/manifests/0.0.0-<sha>-argocd-helm-oci
    (artifact actually lives at /v2/aicr/<recipe>/manifests/...)

The Argo CD user guide splits the two shapes explicitly:

  - Helm OCI charts (no `oci://`, `chart:` set):
    https://argo-cd.readthedocs.io/en/stable/user-guide/helm/
  - Native OCI artifacts (`oci://` repoURL, `path:` set):
    https://argo-cd.readthedocs.io/en/stable/user-guide/oci/

# Fix: dispatch parent App shape on repoURL scheme

Conditional template:

  - `oci://...` repoURL → native OCI shape: append `.Chart.Name` to
    repoURL, set `path: "."`, drop `chart:`. Mirrors the path-based
    child template at injectValuesIntoSingleSource.

  - HTTPS / HTTP repoURL → Helm chart repository shape: keep repoURL
    as-is, emit `chart: .Chart.Name`, drop `path:`. Preserves the
    pure-Helm install path the README still advertises (ChartMuseum,
    GitHub Pages, etc.).

# Tests

  - TestHelmTemplate_RewrittenSources / TestHelmTemplate_PathBasedChildRepoURL
    updated for the native-OCI shape (parent.repoURL == namespace +
    "/" + chart, chart == "", path == ".").
  - TestHelmTemplate_RendersWithHelmRepoRepoURL added: verifies the
    HTTPS-mode rendering produces repoURL + chart with no path field.
  - TestGenerate_CustomChartName now asserts chart name in repoURL
    (was: source.chart).
  - TestHelmTemplate_AppNameOverride heuristic switched from
    `source.chart \!= ""` to `source.path == "."`.

# Goldens

testdata/{helm_and_manifest_only,mixed_component}/templates/aicr-stack.yaml
regenerated.

# Doc comments

Stale `source.chart` / `repoURL/chart:tag` references in
parentAppTemplate, Generator.ChartName, DefaultChartName, and the
"Publication backends and the path field" block updated to describe
the dual-mode contract and cite the Argo CD docs split.

# KWOK wrapper script

The PR NVIDIA#1047 comment block in kwok/scripts/validate-scheduling.sh
referenced the now-wrong source.chart contract; updated to cite the
new parentAppTemplate design. The conditional itself (per-lane
assignment from NVIDIA#1048) is unchanged.

Closes the argocd-helm-oci CI breakage on main introduced by PR NVIDIA#1047.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants