Skip to content

oetf: resolve spec paths from caller's runfiles repo; restore script bundle#1114

Open
jiaenren wants to merge 2 commits into
mainfrom
jiaenr/fix-oetf-cross-repo-spec-resolution
Open

oetf: resolve spec paths from caller's runfiles repo; restore script bundle#1114
jiaenren wants to merge 2 commits into
mainfrom
jiaenr/fix-oetf-cross-repo-spec-resolution

Conversation

@jiaenren

@jiaenren jiaenren commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Fixes two OETF migration breakages that surfaced in the gitlab-master staging Jenkins pipeline run after MR !6103 merged:

1. Spec-file resolution under bzlmod

runner_fixture._workspace_root() uses TEST_WORKSPACE, which bzlmod always sets to _main regardless of which module the test target lives in. Tests in @osmo_workspace//test/scenarios:* that call self.workflow("test/workflow/X.yaml") resolved against <TEST_SRCDIR>/_main/ — but their YAML data deps land under <TEST_SRCDIR>/osmo_workspace+/. Every external scenario in staging Jenkins was failing with:

FileNotFoundError: spec file not found: .../runfiles/_main/test/workflow/<X>.yaml

Fix: new _caller_runfiles_repo_root() helper walks from the caller's __file__ up to the immediate child of TEST_SRCDIR — that's the runfiles repo dir that actually owns the test. _resolve_spec_path() tries this candidate first for workspace-relative paths, falling back to _workspace_root() for BUILD_WORKSPACE_DIRECTORY / pre-bzlmod compat.

Unit test added: test_runner_fixture.py::CallerRunfilesRepoRootTest covers both _main and dep-module (<name>+) layouts.

2. Script + input bundles dropped during the OETF migration

Several yamls under test/workflow/ (app_cli.yaml, dataset_cli.yaml, parallel_workflow_mount_v2.yaml, etc.) reference localpath: scripts/X.sh or localpath: input/Y. The scripts/ and input/ subdirs existed under validation/workflow/ in the internal repo but weren't moved over with the YAMLs. The yaml-only migration left every script-referencing workflow broken end-to-end (even after fix #1 unblocks the YAML load).

Fix: restore the script bundle alongside the yamls (bit-identical copy from internal validation/workflow/{scripts,input}/) so the public yamls are self-sufficient. test/workflow/BUILD's filegroup already globs scripts/** + input/** with allow_empty=True; no BUILD edit needed.

3. Platform/bucket auto-injection in WorkflowBuilder.submit()

Public yamls use {{platform}} placeholders with KIND-oriented default-values: (cpu-x86); on the internal staging Jenkins they need x86-5090 / agx-orin-jp6 / etc. WorkflowBuilder.submit() now auto-injects platform={fixture.default_platform} and bucket={fixture.default_bucket} as workflow variables when the caller hasn't set them explicitly. The fixture defaults are env-aware (OETF_DEFAULT_PLATFORM, OETF_DEFAULT_BUCKET), so the same scenarios run unchanged across KIND (cpu), staging (x86-5090), and other envs.

The internal-repo follow-up exports OETF_DEFAULT_PLATFORM=x86-5090 in jenkins/vars/oetfTests.groovy so this kicks in for the staging job.

Verification

Local against https://staging.osmo.nvidia.com on isaac-nightly pool with OETF_DEFAULT_PLATFORM=x86-5090:

Test Before After
app-cli FileNotFoundError at 3.4s submitted, ran 106s on staging (app-cli-1621); task-level exit-1 is a separate pre-existing issue
command-validation FileNotFoundError at 3.3s PASS
mount-validation FileNotFoundError at 3.3s PASS

All existing framework unit tests + pylint stay green: bazel test //test/oetf/tests:test_runner_fixture //test/oetf:runner_fixture-pylint.

Pairs with

Internal gitlab-master OSMO follow-up to bump the external submodule + export OETF_DEFAULT_PLATFORM in Jenkins is queued separately.

Summary by CodeRabbit

  • Tests

    • Improved test fixture path handling for Bazel bzlmod execution so workspace-relative spec files resolve correctly.
    • Updated CLI submission testing for localpath: to run from the staging directory and improved failure diagnostics with richer, cleaned CLI output.
    • Enhanced workflow test submissions to automatically apply default platform/bucket when not explicitly provided.
    • Added new CLI end-to-end test scripts and strengthened query/input fixtures, plus a new test suite validating repository discovery.
  • Chores

    • Added/adjusted workflow YAML templates and test input files to drive platform selection via {{platform}}.

…bundle

Two migration-induced breakages on the staging Jenkins OETF pipeline:

1. Spec-file resolution under bzlmod. _workspace_root() uses TEST_WORKSPACE
   which bzlmod always sets to "_main", regardless of which module the test
   target lives in. So tests in @osmo_workspace//test/scenarios:* that call
   self.workflow("test/workflow/X.yaml") resolved against <TEST_SRCDIR>/_main/
   even though their data deps land under <TEST_SRCDIR>/osmo_workspace+/.
   Every external scenario in the staging Jenkins run was failing with
     FileNotFoundError: spec file not found: ...runfiles/_main/test/workflow/X.yaml

   Fix: new _caller_runfiles_repo_root() helper walks from the caller's
   __file__ up to the immediate child of TEST_SRCDIR — that's the runfiles
   repo dir that actually owns the test (e.g. osmo_workspace+ or _main).
   _resolve_spec_path() tries this candidate first for workspace-relative
   paths, falling back to _workspace_root() for the BUILD_WORKSPACE_DIRECTORY
   path (bazel run) and pre-bzlmod compat.

2. Script + input bundles dropped during the OETF migration. Several yamls
   under test/workflow/ (app_cli.yaml, dataset_cli.yaml, parallel_workflow_
   mount_v2.yaml, etc.) reference localpath: scripts/X.sh or localpath:
   input/Y — the scripts/ and input/ subdirs existed under
   validation/workflow/ in the internal repo but weren't moved over with
   the YAMLs. The yaml-only migration left every script-referencing
   workflow broken end-to-end (even after fix #1 unblocks the YAML load).

   Fix: restore the script bundle alongside the yamls (bit-identical copy
   from the internal repo's validation/workflow/{scripts,input}/) so the
   public yamls are self-sufficient. test/workflow/BUILD's filegroup
   already globs scripts/** + input/** with allow_empty=True; no BUILD edit
   needed.

Bonus: WorkflowBuilder.submit() now auto-injects platform={default_platform}
and bucket={default_bucket} as workflow variables when the caller hasn't
set them explicitly. Public yamls use {{platform}} placeholders with KIND-
oriented default-values (cpu-x86); on the internal staging Jenkins they
need x86-5090 / agx-orin-jp6 / etc. Auto-injection from the fixture's env-
aware defaults (OETF_DEFAULT_PLATFORM, OETF_DEFAULT_BUCKET) lets the same
scenarios run unchanged across KIND, staging, and any other env.

Verification (local, against staging.osmo.nvidia.com on isaac-nightly
pool with OETF_DEFAULT_PLATFORM=x86-5090):

  @osmo_workspace//test/scenarios:app-cli            submitted → ran 106s on
                                                     staging (workflow
                                                     app-cli-1621)
  @osmo_workspace//test/scenarios:command-validation PASS
  @osmo_workspace//test/scenarios:mount-validation   PASS

(Some scenarios still fail for unrelated reasons — actual task runtime,
expected-outcome drift — but the migration breakage that took out the
whole suite is gone.)

Unit test added: test_runner_fixture.py::CallerRunfilesRepoRootTest covers
both _main and dep-module (<name>+) layouts. All existing test_runner_
fixture tests + the runner_fixture pylint target still green:

  bazel test //test/oetf/tests:test_runner_fixture //test/oetf:runner_fixture-pylint
@jiaenren jiaenren requested a review from a team as a code owner June 16, 2026 23:38
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds _caller_runfiles_repo_root() to the OETF runner fixture for Bazel bzlmod-aware spec path resolution, enhances CLI submission and error handling, auto-injects platform and bucket variables into WorkflowBuilder.submit(), and introduces a CallerRunfilesRepoRootTest suite. Also adds a collection of new end-to-end CLI integration test scripts covering app, dataset/collection, workflow, and server/client scenarios, plus parameterizes 14 workflow test fixtures to use platform template variables.

Changes

OETF Runner Fixture Enhancements

Layer / File(s) Summary
Caller runfiles repo root helper and spec path resolution
test/oetf/runner_fixture.py
Adds _caller_runfiles_repo_root() that scans the Python call stack and walks the filesystem to locate the caller's runfiles repo root under TEST_SRCDIR. Updates _resolve_spec_path() to try this root first, falling back to _workspace_root() when the candidate does not exist.
CLI submission and error handling enhancements
test/oetf/runner_fixture.py
Updates _submit_via_cli() to run osmo workflow submit with cwd=temp_dir for correct localpath resolution. Enhances _raise_cli_submission_error() to filter repeated "New client available" warnings from stderr and combine cleaned stderr with truncated stdout for richer error context.
WorkflowBuilder.submit() variable auto-injection
test/oetf/runner_fixture.py
Builds a derived args list in submit() and conditionally appends platform= and bucket= from fixture defaults when not already present; the enriched args are forwarded to both CLI and API submission paths.
Unit tests for _caller_runfiles_repo_root
test/oetf/tests/test_runner_fixture.py
Adds CallerRunfilesRepoRootTest covering _main, bzlmod dependency (<name>+), nested directory traversal, and None when TEST_SRCDIR is unset, using exec-based synthetic caller paths.

Workflow Integration Test Scripts and Fixtures

Layer / File(s) Summary
Test input file fixtures
test/workflow/input/folder_input_0/test_file.txt, test/workflow/input/folder_input_1/test_file.txt
Populates two test input files with test_file_content for use in dataset upload/diff tests.
App lifecycle CLI integration test
test/workflow/scripts/app_spec.yaml, test/workflow/scripts/app_cli.sh
Adds a single-task app-cli-test workflow spec and a shell script that creates or updates an app, validates it via osmo app info/show/spec, deletes it, and asserts non-existence.
Dataset and collection lifecycle CLI integration test
test/workflow/scripts/dataset_cli.sh, test/workflow/scripts/dataset_query.txt, test/workflow/scripts/collection_query.txt
Adds query templates and a 270-line script exercising upload, tag/label/metadata, download-diff, update-add/remove, collection create/query, and deletion with existence verification for both datasets and collections.
Workflow CLI smoke-test
test/workflow/scripts/workflow_cli.sh
Exercises osmo commands for profiles, workflow list/query/tag, spec/template retrieval, and pool/resource/task listing with multiple flag variants; exits on first failure.
Server/client TCP echo and hostname scripts
test/workflow/scripts/server.sh, test/workflow/scripts/client.sh, test/workflow/scripts/echo_hostname.sh
Adds a nc-based TCP server on port 24831, a retry-looping client that saves the echo output, and a hostname-printing utility for multi-task workflow tests.
Workflow test YAML platform parameterization
test/workflow/bad_command.yaml, test/workflow/bad_image.yaml, test/workflow/empty_command.yaml, test/workflow/exec_timeout.yaml, test/workflow/host_mount.yaml, test/workflow/invalid_mount.yaml, test/workflow/invalid_src_dest_mount.yaml, test/workflow/multi_arch_containers.yaml, test/workflow/osmo_client_test.yaml, test/workflow/per_group_exec_timeout.yaml, test/workflow/regex_workflow.yaml, test/workflow/resource_type.yaml, test/workflow/serial_workflow_multi_arch.yaml, test/workflow/serial_workflow_update_dataset.yaml
Updates 14 workflow test YAML files to replace hard-coded platform: cpu-x86 with templated platform: {{platform}}, enabling tests to parameterize the execution platform at runtime.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • fernandol-nvidia
  • RyaliNvidia

Poem

🐰 Hoppity-hop through the runfiles tree,
TEST_SRCDIR now points straight to me!
Platforms and buckets auto-inject with grace,
Shell scripts galore run the test race.
From dataset diffs to TCP pings—
This rabbit delights in all the new things! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly describes the main changes: fixing spec path resolution from the caller's runfiles repo under bzlmod, and restoring the script bundle that was missing after migration.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jiaenr/fix-oetf-cross-repo-spec-resolution

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
test/oetf/runner_fixture.py (1)

602-606: 💤 Low value

Consider logging the enriched args for debugging clarity.

Line 605 logs self._args (user-specified args) rather than args (which includes auto-injected platform/bucket). When debugging submission failures, seeing the complete set of variables actually submitted may be more helpful.

💡 Optional diff
             print(
                 f"OETF submitted: {workflow_id} ({workflow_url}) "
                 f"[spec={os.path.basename(self._spec_path)} pool={self._pool} "
-                f"client={self._client} args={list(self._args)}]",
+                f"client={self._client} args={args}]",
                 file=sys.stderr, flush=True,
             )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/oetf/runner_fixture.py` around lines 602 - 606, In the print statement
that logs the OETF workflow submission details around line 605, change the
logged args variable from self._args to args. This will ensure the log includes
the complete set of arguments actually submitted, including any auto-injected
platform and bucket values, rather than just the user-specified arguments. This
provides more complete debugging information when investigating submission
issues.
test/workflow/scripts/app_cli.sh (1)

21-21: 💤 Low value

Consider checking exit codes directly.

ShellCheck suggests checking exit codes directly (e.g., if ! osmo app list; then) instead of testing $? for improved readability.

Also applies to: 38-38, 46-46, 54-54, 62-62, 70-70, 80-80, 88-88, 98-98

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/workflow/scripts/app_cli.sh` at line 21, Replace the indirect exit code
checks using `$?` with direct command checks for improved readability and
idiomaticity. Instead of executing a command followed by `if [ $? -ne 0 ];
then`, place the command directly in the conditional statement using the
negation operator `!` (e.g., `if ! osmo app list; then`). Apply this change at
all eight locations noted in the comment: lines 21, 38, 46, 54, 62, 70, 80, 88,
and 98, ensuring each command is executed within its respective conditional
statement.

Source: Linters/SAST tools

test/workflow/scripts/client.sh (2)

31-31: 💤 Low value

Optional: Simplify arithmetic expression.

Same as Line 18, the $ prefix on retries is unnecessary in arithmetic context.

♻️ Optional simplification
-  retries=$(($retries - 1))
+  retries=$((retries - 1))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/workflow/scripts/client.sh` at line 31, In the arithmetic expansion
expression for the retries variable decrement operation, remove the unnecessary
dollar sign prefix before the variable name inside the parentheses. The variable
reference inside the arithmetic context $((...)) does not require the $ prefix,
so simplify the expression by removing it from the first occurrence of retries
while keeping the structure of the subtraction operation intact.

18-18: 💤 Low value

Optional: Simplify arithmetic expression.

Shellcheck suggests that the $ prefix on retries is unnecessary in arithmetic context. While the current syntax works correctly, the simpler form is more idiomatic.

♻️ Optional simplification
-  retries=$(($retries - 1))
+  retries=$((retries - 1))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/workflow/scripts/client.sh` at line 18, In the arithmetic expression for
the retries variable assignment, remove the unnecessary `$` prefix from the
variable name inside the arithmetic expansion. Change the expression from
`retries=$(($retries - 1))` to `retries=$((retries - 1))` to use the more
idiomatic shellcheck-compliant syntax where variable expansion with `$` is
optional within arithmetic context.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/workflow/scripts/app_cli.sh`:
- Around line 26-30: The elif branch checking for "created successfully" is
unreachable because it exists inside an if ! block that only executes when the
command fails. Remove the unreachable elif branch (the one checking for "created
successfully" with grep) from inside the failure block. Restructure the logic so
that the success case is handled in the main if block and the failure case is in
the else block, ensuring that when the create command exits with code 0
(success), it properly acknowledges the successful creation before continuing.
- Around line 78-84: The logic in the osmo app info check is inverted. After the
app is deleted, the osmo app info command should fail because the app no longer
exists. Currently the code treats this expected failure as an error with if [ $?
-ne 0 ]. Invert the condition to if [ $? -eq 0 ] so that success (the command
finding the deleted app) is treated as the error condition, matching the correct
pattern used in the app show and app spec checks that follow.

In `@test/workflow/scripts/app_spec.yaml`:
- Around line 1-9: The copyright header in the app_spec.yaml file uses invalid
YAML syntax with ".." delimiters instead of proper YAML comments. Replace the
".." at the beginning (before "Copyright") and the ".." at the end (after the
closing line about strict prohibition) with the "#" prefix to make them valid
YAML comments. Each line in the copyright header block should start with "#"
followed by a space, consistent with proper YAML comment formatting.

In `@test/workflow/scripts/client.sh`:
- Around line 35-36: The client.sh script is missing a line that writes the TCP
echo output to the workflow output directory. Between the existing `cat
tmp/tcp_echo.txt` command and the `sleep 1000` command, add a new line that
redirects the TCP echo file content to the workflow output directory using the
{{output}} placeholder syntax. This ensures the workflow produces the expected
tcp_echo.txt output artifact as required by the parallel_workflow_v2.yaml
configuration.

In `@test/workflow/scripts/dataset_cli.sh`:
- Around line 248-257: The osmo dataset delete command on line 248 does not
validate its exit status before proceeding to check if the dataset was deleted.
Add an exit status check immediately after the osmo dataset delete
${BUCKET_NAME}/${DATASET_NAME}_{{workflow_id}} command using the same pattern as
the osmo dataset info check below it, so that if the delete operation itself
fails, the script can detect and report this failure rather than letting it
proceed to the post-delete validation checks which may give misleading results.

In `@test/workflow/scripts/server.sh`:
- Around line 9-12: The server.sh script is missing the output artifact creation
step that is required by the workflow definition. After the netcat command (nc
-w 50 -l -p 24831), add a line that copies the /tmp/hello.txt file to the output
artifact directory using the syntax cat /tmp/hello.txt > {{output}}/hello.txt to
ensure the workflow produces the expected hello.txt output artifact.

---

Nitpick comments:
In `@test/oetf/runner_fixture.py`:
- Around line 602-606: In the print statement that logs the OETF workflow
submission details around line 605, change the logged args variable from
self._args to args. This will ensure the log includes the complete set of
arguments actually submitted, including any auto-injected platform and bucket
values, rather than just the user-specified arguments. This provides more
complete debugging information when investigating submission issues.

In `@test/workflow/scripts/app_cli.sh`:
- Line 21: Replace the indirect exit code checks using `$?` with direct command
checks for improved readability and idiomaticity. Instead of executing a command
followed by `if [ $? -ne 0 ]; then`, place the command directly in the
conditional statement using the negation operator `!` (e.g., `if ! osmo app
list; then`). Apply this change at all eight locations noted in the comment:
lines 21, 38, 46, 54, 62, 70, 80, 88, and 98, ensuring each command is executed
within its respective conditional statement.

In `@test/workflow/scripts/client.sh`:
- Line 31: In the arithmetic expansion expression for the retries variable
decrement operation, remove the unnecessary dollar sign prefix before the
variable name inside the parentheses. The variable reference inside the
arithmetic context $((...)) does not require the $ prefix, so simplify the
expression by removing it from the first occurrence of retries while keeping the
structure of the subtraction operation intact.
- Line 18: In the arithmetic expression for the retries variable assignment,
remove the unnecessary `$` prefix from the variable name inside the arithmetic
expansion. Change the expression from `retries=$(($retries - 1))` to
`retries=$((retries - 1))` to use the more idiomatic shellcheck-compliant syntax
where variable expansion with `$` is optional within arithmetic context.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d3170a21-fe64-4892-9380-ab9fb5f54efd

📥 Commits

Reviewing files that changed from the base of the PR and between cc7e0bf and 3fce3f1.

📒 Files selected for processing (13)
  • test/oetf/runner_fixture.py
  • test/oetf/tests/test_runner_fixture.py
  • test/workflow/input/folder_input_0/test_file.txt
  • test/workflow/input/folder_input_1/test_file.txt
  • test/workflow/scripts/app_cli.sh
  • test/workflow/scripts/app_spec.yaml
  • test/workflow/scripts/client.sh
  • test/workflow/scripts/collection_query.txt
  • test/workflow/scripts/dataset_cli.sh
  • test/workflow/scripts/dataset_query.txt
  • test/workflow/scripts/echo_hostname.sh
  • test/workflow/scripts/server.sh
  • test/workflow/scripts/workflow_cli.sh

Comment on lines +26 to +30
elif echo "$output" | grep -q "created successfully"; then
echo "[App Create Done]"
else
echo "[App Create Failed] Failed to create app"
exit 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Unreachable success case inside failure block.

Lines 26-28 check for "created successfully" inside the if ! block that only executes when the create command fails. A successful create would exit with code 0, skip this entire block, and continue to line 34. This branch is unreachable.

♻️ Proposed fix to remove unreachable branch
         echo "[App Update Done]"
-    elif echo "$output" | grep -q "created successfully"; then
-        echo "[App Create Done]"
     else
         echo "[App Create Failed] Failed to create app"
         exit 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif echo "$output" | grep -q "created successfully"; then
echo "[App Create Done]"
else
echo "[App Create Failed] Failed to create app"
exit 1
elif echo "$output" | grep -q "already exists"; then
echo "[App Update Done]"
else
echo "[App Create Failed] Failed to create app"
exit 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/workflow/scripts/app_cli.sh` around lines 26 - 30, The elif branch
checking for "created successfully" is unreachable because it exists inside an
if ! block that only executes when the command fails. Remove the unreachable
elif branch (the one checking for "created successfully" with grep) from inside
the failure block. Restructure the logic so that the success case is handled in
the main if block and the failure case is in the else block, ensuring that when
the create command exits with code 0 (success), it properly acknowledges the
successful creation before continuing.

Comment on lines +78 to +84
echo "[App Info Start] ------------------------------------------------"
osmo app info ${APP_NAME}
if [ $? -ne 0 ]; then
echo "[App Info Failed] Failed to get app info"
exit 1
fi
echo "[App Info Done]"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Inverted post-deletion check logic.

After deleting the app, osmo app info should fail because the app no longer exists. The current code treats this expected failure as an error. Lines 87-93 and 97-103 correctly expect failure for app show and app spec.

🐛 Proposed fix to expect failure after deletion
 echo "[App Info Start] ------------------------------------------------"
 osmo app info ${APP_NAME}
-if [ $? -ne 0 ]; then
-    echo "[App Info Failed] Failed to get app info"
-    exit 1
+if [ $? -eq 0 ]; then
+    echo "[Info] Deleted App Info is still available"
+    exit 1
 fi
 echo "[App Info Done]"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "[App Info Start] ------------------------------------------------"
osmo app info ${APP_NAME}
if [ $? -ne 0 ]; then
echo "[App Info Failed] Failed to get app info"
exit 1
fi
echo "[App Info Done]"
echo "[App Info Start] ------------------------------------------------"
osmo app info ${APP_NAME}
if [ $? -eq 0 ]; then
echo "[Info] Deleted App Info is still available"
exit 1
fi
echo "[App Info Done]"
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 79-79: Double quote to prevent globbing and word splitting.

(SC2086)


[style] 80-80: Check exit code directly with e.g. 'if ! mycmd;', not indirectly with $?.

(SC2181)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/workflow/scripts/app_cli.sh` around lines 78 - 84, The logic in the osmo
app info check is inverted. After the app is deleted, the osmo app info command
should fail because the app no longer exists. Currently the code treats this
expected failure as an error with if [ $? -ne 0 ]. Invert the condition to if [
$? -eq 0 ] so that success (the command finding the deleted app) is treated as
the error condition, matching the correct pattern used in the app show and app
spec checks that follow.

Comment on lines +1 to +9
..
Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

NVIDIA CORPORATION and its licensors retain all intellectual property
and proprietary rights in and to this software, related documentation
and any modifications thereto. Any use, reproduction, disclosure or
distribution of this software and related documentation without an express
license agreement from NVIDIA CORPORATION is strictly prohibited.
..

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix YAML syntax error in copyright header.

The ".." delimiters are not valid YAML syntax and cause a parsing error. YAML comments must use # prefix.

🐛 Proposed fix for copyright header syntax
-..
-  Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
-
-  NVIDIA CORPORATION and its licensors retain all intellectual property
-  and proprietary rights in and to this software, related documentation
-  and any modifications thereto. Any use, reproduction, disclosure or
-  distribution of this software and related documentation without an express
-  license agreement from NVIDIA CORPORATION is strictly prohibited.
-..
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# NVIDIA CORPORATION and its licensors retain all intellectual property
+# and proprietary rights in and to this software, related documentation
+# and any modifications thereto. Any use, reproduction, disclosure or
+# distribution of this software and related documentation without an express
+# license agreement from NVIDIA CORPORATION is strictly prohibited.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
..
Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
NVIDIA CORPORATION and its licensors retain all intellectual property
and proprietary rights in and to this software, related documentation
and any modifications thereto. Any use, reproduction, disclosure or
distribution of this software and related documentation without an express
license agreement from NVIDIA CORPORATION is strictly prohibited.
..
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# NVIDIA CORPORATION and its licensors retain all intellectual property
# and proprietary rights in and to this software, related documentation
# and any modifications thereto. Any use, reproduction, disclosure or
# distribution of this software and related documentation without an express
# license agreement from NVIDIA CORPORATION is strictly prohibited.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/workflow/scripts/app_spec.yaml` around lines 1 - 9, The copyright header
in the app_spec.yaml file uses invalid YAML syntax with ".." delimiters instead
of proper YAML comments. Replace the ".." at the beginning (before "Copyright")
and the ".." at the end (after the closing line about strict prohibition) with
the "#" prefix to make them valid YAML comments. Each line in the copyright
header block should start with "#" followed by a space, consistent with proper
YAML comment formatting.

Comment on lines +35 to +36
cat tmp/tcp_echo.txt
sleep 1000

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing output artifact creation.

Cross-referencing with the workflow definition in parallel_workflow_v2.yaml:34-73, the client script should write the TCP echo output to the workflow output directory. The embedded script includes:

cat tmp/tcp_echo.txt > {{output}}/tcp_echo.txt

between lines 35 and 36.

Without this line, the workflow will not produce the expected tcp_echo.txt output artifact.

📦 Proposed fix to add missing output artifact
 cat tmp/tcp_echo.txt
+cat tmp/tcp_echo.txt > {{output}}/tcp_echo.txt
 sleep 1000
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cat tmp/tcp_echo.txt
sleep 1000
cat tmp/tcp_echo.txt
cat tmp/tcp_echo.txt > {{output}}/tcp_echo.txt
sleep 1000
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/workflow/scripts/client.sh` around lines 35 - 36, The client.sh script
is missing a line that writes the TCP echo output to the workflow output
directory. Between the existing `cat tmp/tcp_echo.txt` command and the `sleep
1000` command, add a new line that redirects the TCP echo file content to the
workflow output directory using the {{output}} placeholder syntax. This ensures
the workflow produces the expected tcp_echo.txt output artifact as required by
the parallel_workflow_v2.yaml configuration.

Comment on lines +248 to +257
osmo dataset delete ${BUCKET_NAME}/${DATASET_NAME}_{{workflow_id}} --force --all

echo "[Info] Get Deleted Dataset Info"
osmo dataset info ${BUCKET_NAME}/${DATASET_NAME}_{{workflow_id}}
if [ $? -ne 0 ]; then
echo "[Info] Deleted Dataset doesn't exist"
else
fail=1
echo "[Info] Deleted Dataset still Exists"
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate dataset delete exit status before post-delete checks.

Line 248 runs a destructive delete but never checks its return code. This can mask a failed deletion and make later assertions ambiguous.

Suggested fix
 osmo dataset delete ${BUCKET_NAME}/${DATASET_NAME}_{{workflow_id}} --force --all
+if [ $? -ne 0 ]; then
+    echo "[Delete Failed] Failed to delete dataset"
+    fail=1
+fi
 
 echo "[Info] Get Deleted Dataset Info"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
osmo dataset delete ${BUCKET_NAME}/${DATASET_NAME}_{{workflow_id}} --force --all
echo "[Info] Get Deleted Dataset Info"
osmo dataset info ${BUCKET_NAME}/${DATASET_NAME}_{{workflow_id}}
if [ $? -ne 0 ]; then
echo "[Info] Deleted Dataset doesn't exist"
else
fail=1
echo "[Info] Deleted Dataset still Exists"
fi
osmo dataset delete ${BUCKET_NAME}/${DATASET_NAME}_{{workflow_id}} --force --all
if [ $? -ne 0 ]; then
echo "[Delete Failed] Failed to delete dataset"
fail=1
fi
echo "[Info] Get Deleted Dataset Info"
osmo dataset info ${BUCKET_NAME}/${DATASET_NAME}_{{workflow_id}}
if [ $? -ne 0 ]; then
echo "[Info] Deleted Dataset doesn't exist"
else
fail=1
echo "[Info] Deleted Dataset still Exists"
fi
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 248-248: Double quote to prevent globbing and word splitting.

(SC2086)


[warning] 248-248: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 248-248: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 248-248: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 248-248: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[info] 251-251: Double quote to prevent globbing and word splitting.

(SC2086)


[warning] 251-251: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 251-251: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 251-251: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 251-251: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[style] 252-252: Check exit code directly with e.g. 'if ! mycmd;', not indirectly with $?.

(SC2181)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/workflow/scripts/dataset_cli.sh` around lines 248 - 257, The osmo
dataset delete command on line 248 does not validate its exit status before
proceeding to check if the dataset was deleted. Add an exit status check
immediately after the osmo dataset delete
${BUCKET_NAME}/${DATASET_NAME}_{{workflow_id}} command using the same pattern as
the osmo dataset info check below it, so that if the delete operation itself
fails, the script can detect and report this failure rather than letting it
proceed to the post-delete validation checks which may give misleading results.

Comment on lines +9 to +12
set -ex
# Open a TCP server in listening mode using port 24831
# Wait for 50 seconds before closing the connection
nc -w 50 -l -p 24831 < /tmp/hello.txt

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing output artifact creation.

Cross-referencing with the workflow definition in parallel_workflow_v2.yaml:12-33, the server script should also write the output artifact after the netcat command. The embedded script includes:

cat /tmp/hello.txt > {{output}}/hello.txt

Without this line, the workflow will not produce the expected hello.txt output artifact.

📦 Proposed fix to add missing output artifact
 nc -w 50 -l -p 24831 < /tmp/hello.txt
+cat /tmp/hello.txt > {{output}}/hello.txt
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
set -ex
# Open a TCP server in listening mode using port 24831
# Wait for 50 seconds before closing the connection
nc -w 50 -l -p 24831 < /tmp/hello.txt
set -ex
# Open a TCP server in listening mode using port 24831
# Wait for 50 seconds before closing the connection
nc -w 50 -l -p 24831 < /tmp/hello.txt
cat /tmp/hello.txt > {{output}}/hello.txt
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/workflow/scripts/server.sh` around lines 9 - 12, The server.sh script is
missing the output artifact creation step that is required by the workflow
definition. After the netcat command (nc -w 50 -l -p 24831), add a line that
copies the /tmp/hello.txt file to the output artifact directory using the syntax
cat /tmp/hello.txt > {{output}}/hello.txt to ensure the workflow produces the
expected hello.txt output artifact.

…porting

Three more migration-induced staging failures found by sweeping all 26
originally-failing scenarios:

1. **14 workflow yamls hardcoded ``platform: cpu-x86``** in their
   ``resources:`` blocks. The previous commit's WorkflowBuilder
   platform/bucket auto-injection only helps yamls that templated the
   field as ``{{platform}}`` — the hardcoded ones still tried to
   schedule on cpu-x86 and failed on isaac-nightly. Replace
   ``platform: cpu-x86`` → ``platform: {{platform}}`` only in the
   deeply-indented ``resources:`` block (>=4 spaces). Leaves the
   top-level ``default-values: platform: cpu-x86`` literal default
   untouched, so KIND deploys + tests that don't set
   OETF_DEFAULT_PLATFORM continue to default to cpu-x86 as before.

2. **CLI submit cwd was the bazel sandbox**, not the staged temp_dir
   where ``_copy_localpath_files_to_dir`` writes the rewritten spec +
   copied localpath files. The CLI dutifully resolved ``localpath:
   folder_input_0`` against the wrong dir and failed:

     Error message: The localpath input does not exist!

   Pass ``cwd=temp_dir`` to ``subprocess.run`` for the submit invocation
   so the rewritten basename-relative localpaths resolve correctly.

3. **CLI submit error reporting buried real errors under version
   warnings.** The osmo CLI emits a 3-line ``WARNING: New client X.Y.Z
   available...`` block to stderr on every invocation, the previous
   error path captured only ``stderr[:500]`` and stopped, so the actual
   error (``Error message: ...`` on stdout, or 4xx body on stderr)
   never reached the test log. ``_raise_cli_submission_error`` now
   strips the version-warning lines from stderr and includes both
   stdout and stderr (up to 1000 chars each) in the error body — every
   CLI failure now points at the real cause without manual log diving.

Verification (sweep of all 26 originally-failing scenarios on
staging.osmo.nvidia.com / isaac-nightly / OETF_DEFAULT_PLATFORM=
x86-5090):

  24/26 PASS (was 0/26 before the migration fixes)

The 2 remaining failures are NOT migration breakage:
  - ``app-cli`` task-level workflow failure (the workflow submits + runs
    on staging for 105s, then the user task exits 1 — pre-existing test/
    workflow logic issue unrelated to !6103).
  - ``folder-input`` requires the ``team-osmo`` Swift data credential
    registered on the OSMO instance for localpath uploads; Jenkins
    provides this via ``dataStorageCreds``, local sandbox doesn't.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/oetf/runner_fixture.py (1)

601-606: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Debug output shows original args, not the enriched args.

The debug print on line 636 uses self._args (original) instead of the derived args (with auto-injected platform/bucket). This could confuse developers when debugging: the stderr message might show args=[] while the workflow actually received platform=cpu.

🔧 Suggested fix
         print(
             f"OETF submitted: {workflow_id} ({workflow_url}) "
             f"[spec={os.path.basename(self._spec_path)} pool={self._pool} "
-            f"client={self._client} args={list(self._args)}]",
+            f"client={self._client} args={args}]",
             file=sys.stderr, flush=True,
         )

Also applies to: 636-636

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/oetf/runner_fixture.py` around lines 601 - 606, The debug print
statement that outputs the arguments is using the original self._args variable
instead of the enriched args variable that has been populated with auto-injected
platform and bucket values in the preceding lines. Locate the debug print
statement that references self._args on or near line 636 and change it to use
the local args variable instead, so that the debug output accurately reflects
the arguments that will actually be passed to the workflow, including any
platform or bucket values that were automatically appended.
🧹 Nitpick comments (1)
test/oetf/runner_fixture.py (1)

401-411: 💤 Low value

Minor edge case: incomplete warning block could suppress subsequent errors.

If the CLI emits a WARNING: New client line but never outputs the expected curl ... install.sh terminator (e.g., format change or truncated output), skip remains True and all subsequent stderr lines are lost. This could hide actual error messages.

Consider resetting skip after N lines or when encountering an obviously non-warning line:

♻️ Defensive fix (optional)
     cleaned_stderr_lines = []
     skip = False
+    skip_line_count = 0
     for line in raw_stderr.splitlines():
         if line.startswith("WARNING: New client"):
             skip = True
+            skip_line_count = 0
             continue
         if skip:
+            skip_line_count += 1
+            # Safety: reset after 10 lines to avoid losing real errors
+            if skip_line_count > 10:
+                skip = False
             if line.startswith("curl ") and "install.sh" in line:
                 skip = False
             continue
         cleaned_stderr_lines.append(line)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/oetf/runner_fixture.py` around lines 401 - 411, The stderr filtering
logic in the loop that processes raw_stderr has a vulnerability where the skip
flag can remain True indefinitely if the expected curl install.sh terminator
line is never encountered, causing all subsequent error messages to be
suppressed. Add defensive logic to reset the skip flag after a reasonable number
of lines have been skipped (such as a maximum line count threshold) or when an
obviously non-warning line is encountered, ensuring that temporary skip states
do not permanently suppress stderr output from being captured.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@test/oetf/runner_fixture.py`:
- Around line 601-606: The debug print statement that outputs the arguments is
using the original self._args variable instead of the enriched args variable
that has been populated with auto-injected platform and bucket values in the
preceding lines. Locate the debug print statement that references self._args on
or near line 636 and change it to use the local args variable instead, so that
the debug output accurately reflects the arguments that will actually be passed
to the workflow, including any platform or bucket values that were automatically
appended.

---

Nitpick comments:
In `@test/oetf/runner_fixture.py`:
- Around line 401-411: The stderr filtering logic in the loop that processes
raw_stderr has a vulnerability where the skip flag can remain True indefinitely
if the expected curl install.sh terminator line is never encountered, causing
all subsequent error messages to be suppressed. Add defensive logic to reset the
skip flag after a reasonable number of lines have been skipped (such as a
maximum line count threshold) or when an obviously non-warning line is
encountered, ensuring that temporary skip states do not permanently suppress
stderr output from being captured.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 12674653-3fa5-4785-bcee-86659824a8ab

📥 Commits

Reviewing files that changed from the base of the PR and between 3fce3f1 and 15f8d29.

📒 Files selected for processing (15)
  • test/oetf/runner_fixture.py
  • test/workflow/bad_command.yaml
  • test/workflow/bad_image.yaml
  • test/workflow/empty_command.yaml
  • test/workflow/exec_timeout.yaml
  • test/workflow/host_mount.yaml
  • test/workflow/invalid_mount.yaml
  • test/workflow/invalid_src_dest_mount.yaml
  • test/workflow/multi_arch_containers.yaml
  • test/workflow/osmo_client_test.yaml
  • test/workflow/per_group_exec_timeout.yaml
  • test/workflow/regex_workflow.yaml
  • test/workflow/resource_type.yaml
  • test/workflow/serial_workflow_multi_arch.yaml
  • test/workflow/serial_workflow_update_dataset.yaml
✅ Files skipped from review due to trivial changes (6)
  • test/workflow/exec_timeout.yaml
  • test/workflow/multi_arch_containers.yaml
  • test/workflow/serial_workflow_update_dataset.yaml
  • test/workflow/per_group_exec_timeout.yaml
  • test/workflow/bad_image.yaml
  • test/workflow/bad_command.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant