Skip to content

fix(kwok): make argocd OCI repoURL per-lane (follow-up to #1047)#1048

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kwok-argocd-oci-repourl-conditional
May 26, 2026
Merged

fix(kwok): make argocd OCI repoURL per-lane (follow-up to #1047)#1048
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kwok-argocd-oci-repourl-conditional

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Follow-up to #1047 — restore the per-recipe segment for the argocd-oci lane, which was incorrectly stripped together with argocd-helm-oci.

Motivation / Context

PR #1047 fixed the argocd-helm-oci repoURL contract (PR #1032 / #1035) by removing the per-recipe segment from OCI_IN_CLUSTER_REF. But the assignment lives inside the shared argocd-oci|argocd-helm-oci) case branch (line 722), so the change applied to both lanes. Codex flagged this on the post-merge review:

  • argocd-helm-oci (intended target): parent App template appends .Chart.Name itself → parent-only repoURL is correct → currently green.
  • argocd-oci (collateral damage): root nvidia-stack app references the artifact by full path → it now tries oci://…/aicr:<tag>, but the artifact lives at oci://…/aicr/<recipe>:<tag> → 404.

This PR makes the assignment per-lane:

if [[ "$DEPLOYER" == "argocd-helm-oci" ]]; then
    OCI_IN_CLUSTER_REF="oci://registry.aicr-registry.svc.cluster.local:5000/aicr"
else
    OCI_IN_CLUSTER_REF="oci://registry.aicr-registry.svc.cluster.local:5000/aicr/${recipe}"
fi

The log_info line at 777 is split per-lane too so each prints the URL it will actually dereference.

Fixes: N/A (no dedicated tracking issue)
Related: #1047, #1032, #1035

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Build/CI/tooling

Component(s) Affected

  • Other: KWOK deployer matrix test wrapper (kwok/scripts/validate-scheduling.sh)

Implementation Notes

Matches the existing per-lane conditional pattern in the same branch:

  • Line 742: [[ "$DEPLOYER" == "argocd-helm-oci" ]] → adjust tag format
  • Line 763: [[ "$DEPLOYER" == "argocd-helm-oci" ]] && deployer_arg="argocd-helm" → adjust deployer arg
  • New (this PR): [[ "$DEPLOYER" == "argocd-helm-oci" ]] → adjust in-cluster repoURL form

The explanatory comment block is replaced with a per-lane explanation citing both contracts.

The flux-oci branch at line ~844 is untouched (still aicr/${recipe}) — Flux deploys via OCIRepository CR with the full path baked in, doesn't go through any repoURL template indirection.

Testing

bash -n kwok/scripts/validate-scheduling.sh   # pass
shellcheck kwok/scripts/validate-scheduling.sh # only pre-existing SC1091 info on line 101

Both pass. Full make qualify was skipped because this is an infra-only change to a CI shell script — Go tests, e2e, vulnerability scan, and yamllint cannot regress from a bash conditional. Per CLAUDE.local.md's doc-only/infra-only carve-out, scoped checks are the appropriate gate here.

CI itself is the real validation: after this lands and PRs rebase onto main, both argocd-oci and argocd-helm-oci Tier-1 matrix jobs should be green.

Risk Assessment

Rollout notes: No runtime impact. After merge, every open PR's argocd-oci Tier-1 jobs should clear; argocd-helm-oci jobs continue passing (the PR #1047 fix is preserved).

Checklist

  • Tests pass locally (bash -n + shellcheck — infra-only change, full make qualify not applicable)
  • Linter passes (shellcheck, no new warnings)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (N/A — contract realignment)
  • I updated docs if user-facing behavior changed (N/A — CI internal)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

PR NVIDIA#1047 stripped the per-recipe segment from OCI_IN_CLUSTER_REF for
both deployer lanes, but the assignment lives inside the shared
`argocd-oci|argocd-helm-oci)` case branch, so plain argocd-oci was
incorrectly switched to the parent-only form too.

Result: argocd-oci's root `nvidia-stack` app dials
oci://…/aicr:<tag> while the artifact is pushed to
oci://…/aicr/<recipe>:<tag>, so the OCI artifact lookup 404s on
every PR. (Caught by Codex review after NVIDIA#1047 merged.)

Make the assignment per-lane to restore the working argocd-oci
contract while keeping the parent-only form for argocd-helm-oci:

  - argocd-oci      -> oci://…/aicr/<recipe>  (full path, unchanged
                       from pre-NVIDIA#1047 behavior; root app references
                       the artifact by full path)
  - argocd-helm-oci -> oci://…/aicr           (parent-only; the
                       parent App template appends .Chart.Name)

The log_info line is split per-lane too so each prints the URL it
will actually dereference.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 3a9a1c10-9d5a-4a8d-8bf0-468f688f4a35

📥 Commits

Reviewing files that changed from the base of the PR and between 45f3f1a and 2e3665c.

📒 Files selected for processing (1)
  • kwok/scripts/validate-scheduling.sh

📝 Walkthrough

Walkthrough

This PR updates kwok/scripts/validate-scheduling.sh to conditionally set the Argo CD OCI in-cluster repository reference based on the deployer lane. The argocd-helm-oci lane now uses the parent OCI base path, while argocd-oci uses the recipe-specific path. Log output is adjusted accordingly to reflect the lane-specific URL format that Argo CD will pull from during bundle generation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • NVIDIA/aicr#1047: Prior change to adjust OCI_IN_CLUSTER_REF for argocd-helm-oci to use the parent in-cluster OCI path; this PR extends that pattern with deployer-specific handling.

Suggested labels

size/S

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: making the Argo CD OCI repoURL configurable per deployment lane, with explicit reference to the related follow-up PR #1047.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the bug that was introduced, the specific fix implemented in the script, and providing clear motivation and context.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, targeted follow-up to #1047. The per-lane conditional restores the correct in-cluster repoURL form for argocd-oci (full aicr/<recipe> path needed by the nvidia-stack root app) while preserving #1047's parent-namespace-only form for argocd-helm-oci (parent App appends .Chart.Name itself).

Diff matches the existing per-lane conditional pattern already present in the same case branch (tag-format at L742, deployer-arg at L774). Verified that OCI_IN_CLUSTER_REF for the argocd-oci lane is only consumed via --repo to aicr bundle — there's no separate deploy-time consumer to keep in sync. The split log_info lines are accurate for both lanes.

CI: Lint, Security Scan, and CLI E2E green at review time; Tier 1 matrix (the real validation) still in progress — both argocd-oci and argocd-helm-oci jobs across oke*/lke should clear with this fix. LGTM.

@mchmarny mchmarny enabled auto-merge (squash) May 26, 2026 22:54
@mchmarny mchmarny merged commit b0bf12d into NVIDIA:main May 26, 2026
86 of 104 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 26, 2026
PR NVIDIA#1047 / NVIDIA#1048 tried to fix the argocd-helm-oci KWOK failures by
adjusting how the wrapper script passes `--set repoURL=…`, but the
root cause is in the bundler's parent App template, not the wrapper.

# Why the previous fix did not hold

`parentAppTemplate` (pkg/bundler/deployer/argocdhelm/argocdhelm.go)
rendered a Helm-OCI shape:

    source:
      repoURL: <Values.repoURL>     # e.g., oci://reg/path
      chart:   <.Chart.Name>        # e.g., my-bundle
      targetRevision: <tag>

The design assumed Argo CD would resolve this as
`<repoURL>/<chart>:<tag>`. But Argo CD's `oci://` scheme triggers
the **native OCI** code path, where `repoURL` is the FULL artifact
reference and `source.chart` is silently ignored. CI confirms it:

    HEAD /v2/aicr/manifests/0.0.0-<sha>-argocd-helm-oci
    (artifact actually lives at /v2/aicr/<recipe>/manifests/...)

The Argo CD user guide splits the two shapes explicitly:

  - Helm OCI charts (no `oci://`, `chart:` set):
    https://argo-cd.readthedocs.io/en/stable/user-guide/helm/
  - Native OCI artifacts (`oci://` repoURL, `path:` set):
    https://argo-cd.readthedocs.io/en/stable/user-guide/oci/

# Fix: dispatch parent App shape on repoURL scheme

Conditional template:

  - `oci://...` repoURL → native OCI shape: append `.Chart.Name` to
    repoURL, set `path: "."`, drop `chart:`. Mirrors the path-based
    child template at injectValuesIntoSingleSource.

  - HTTPS / HTTP repoURL → Helm chart repository shape: keep repoURL
    as-is, emit `chart: .Chart.Name`, drop `path:`. Preserves the
    pure-Helm install path the README still advertises (ChartMuseum,
    GitHub Pages, etc.).

# Tests

  - TestHelmTemplate_RewrittenSources / TestHelmTemplate_PathBasedChildRepoURL
    updated for the native-OCI shape (parent.repoURL == namespace +
    "/" + chart, chart == "", path == ".").
  - TestHelmTemplate_RendersWithHelmRepoRepoURL added: verifies the
    HTTPS-mode rendering produces repoURL + chart with no path field.
  - TestGenerate_CustomChartName now asserts chart name in repoURL
    (was: source.chart).
  - TestHelmTemplate_AppNameOverride heuristic switched from
    `source.chart \!= ""` to `source.path == "."`.

# Goldens

testdata/{helm_and_manifest_only,mixed_component}/templates/aicr-stack.yaml
regenerated.

# Doc comments

Stale `source.chart` / `repoURL/chart:tag` references in
parentAppTemplate, Generator.ChartName, DefaultChartName, and the
"Publication backends and the path field" block updated to describe
the dual-mode contract and cite the Argo CD docs split.

# KWOK wrapper script

The PR NVIDIA#1047 comment block in kwok/scripts/validate-scheduling.sh
referenced the now-wrong source.chart contract; updated to cite the
new parentAppTemplate design. The conditional itself (per-lane
assignment from NVIDIA#1048) is unchanged.

Closes the argocd-helm-oci CI breakage on main introduced by PR NVIDIA#1047.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants