benchmarks hardening! #1078

sbryngelson · 2025-12-05T16:32:44Z

User description

PR Type

Enhancement, Tests

Description

Add comprehensive error handling and validation to benchmark execution
- Early file existence checks
- Return code validation after case execution
- Summary file and target data validation
- Detailed error reporting with failed case tracking
Implement per-test timeout mechanism (1 hour) with signal handling
Add early abort logic when failure rate exceeds 30% threshold
Improve job monitoring in CI workflows with real-time output streaming
Reduce benchmark workflow timeout from 1400 to 480 minutes

Diagram Walkthrough

flowchart LR
  A["Benchmark Execution"] --> B["Early File Validation"]
  A --> C["Run Case with Error Handling"]
  C --> D["Check Return Code"]
  D --> E["Validate Summary File"]
  E --> F["Validate Target Data"]
  F --> G["Add to Results"]
  D -->|Failure| H["Track Failed Cases"]
  E -->|Failure| H
  F -->|Failure| H
  H --> I["Report Failed Cases"]
  I --> J["Raise Exception"]
  
  K["Test Execution"] --> L["Set Timeout Alarm"]
  L --> M["Run Test Case"]
  M --> N["Cancel Alarm"]
  M -->|Timeout| O["Raise TestTimeoutError"]
  
  P["Test Loop"] --> Q["Check Failure Rate"]
  Q -->|Rate >= 30%| R["Abort Early"]

File Walkthrough

Relevant files

Error handling

bench.py `Add comprehensive error handling and validation to benchmarks` toolchain/mfc/bench.py Added early validation to check if benchmark case files exist before execution Wrapped case execution in try-except block with comprehensive error handling Added return code validation after running each benchmark case Added validation for summary file existence and required target data fields Implemented failed case tracking and reporting with detailed error messages Added success indicator for completed cases Changed exception handling in diff function to log errors instead of silently passing	+85/-16
test.py `Add timeout mechanism and early abort on high failure rate` toolchain/mfc/test/test.py Added signal import for timeout handling Introduced TEST_TIMEOUT_SECONDS constant (3600 seconds/1 hour) and timeout thresholds Created TestTimeoutError exception class for timeout scenarios Implemented timeout_handler function using SIGALRM signal Wrapped entire test case execution in try-except-finally block with timeout alarm Added early abort logic that triggers when failure rate exceeds 30% after 20 completed tests Added detailed error messages for timeout scenarios with log file references	+97/-57

Configuration changes

bench.yml `Improve job monitoring with real-time output and error handling` .github/workflows/bench.yml Updated job name to include interface parameter in display name Reduced timeout-minutes from 1400 to 480 minutes Replaced simple background job execution with comprehensive submit_and_monitor function Added job ID extraction and output file monitoring with tail streaming Implemented job status checking via squeue and exit code validation Added error handling for job submission failures and non-zero exit codes Improved real-time output visibility during benchmark execution	+79/-5
test.yml `Update job naming and reduce workflow timeout` .github/workflows/test.yml Updated self-hosted job name to display cluster and device information Reduced timeout-minutes from 1400 to 480 minutes	+2/-2

CodeAnt-AI Description

Add per-test 1-hour timeout, fail-fast on high failure rates, stricter benchmark validation, and improved CI bench monitoring

What Changed

Tests now abort and report a clear error (with path to the test log) if an individual test hangs for more than 1 hour.
Test runner will stop the whole test run early when at least 20 tests have completed and >=30% have failed, failing fast to surface systemic issues.
Benchmark runner validates each case file exists up front, captures per-case exit codes, verifies summary files include required metrics (exec and simulation grind), and accumulates and reports failed cases instead of crashing mid-run.
CI benchmark job submission now monitors and streams cluster job output, checks job submission and exit status, and fails the workflow when bench jobs or summaries are missing.
Workflow job names now include the device interface in the label and overall job timeouts were reduced from 1400 to 480 minutes to limit excessively long CI runs.

Impact

✅ Fewer hung tests
✅ Shorter CI runs on systemic failures
✅ Clearer benchmark failure logs

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

Summary by CodeRabbit

New Features
- Real-time SLURM job streaming, automated submit-and-monitor, and parallel benchmark orchestration
- Dynamic, descriptive CI job names including cluster/device/interface
Bug Fixes
- Improved error detection, tracebacks, clearer failure summaries, and per-case result validation
- Enforced test timeouts, multi-rank diagnostics, early-abort on high failure rates
Chores
- Reduced CI job timeouts and explicit Phoenix/Frontier matrix variants
- Updated ignore patterns
Behavior Changes
- Job submissions no longer block waiting for SLURM; one job time limit increased

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Note

Adds robust benchmark validation, per-test 1h timeouts with fail-fast, and real-time SLURM job monitoring in CI with shorter timeouts.

Benchmarks (toolchain/mfc/bench.py):
- Validate case files upfront; capture return codes; ensure summary file/fields (exec, simulation grind) exist.
- Track/report failed cases; improved error messages; safer diff error handling.
Tests (toolchain/mfc/test/test.py):
- Per-test 1h timeout using thread-safe timer; clearer timeout errors.
- Fail-fast: abort run when ≥20 cases and ≥30% failures.
- Extra diagnostics for multi-rank post-process; validate Silo/HDF5 outputs.
CI Workflows:
- Bench (.github/workflows/bench.yml): dynamic job names incl. interface; submit-and-monitor both runs; real-time output via new script; strict exit-code handling; artifacts on Frontier.
- Test (.github/workflows/test.yml): clearer self-hosted job names; expanded matrix; reduced timeout-minutes to 480.
- New script: .github/scripts/monitor_slurm_job.sh to stream SLURM output, detect completion, and verify exit status.

^{Written by Cursor Bugbot for commit b30fea3. Configure here.}

codeant-ai · 2025-12-05T16:32:49Z

CodeAnt AI is reviewing your PR.

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

coderabbitai · 2025-12-05T16:33:08Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Workflows reduced self-hosted timeouts and made job names include interface; bench step now runs a parallel orchestration script. Added SLURM submit/monitor orchestration scripts. Strengthened bench/test runners with timeouts, early‑abort, traceback logging, and per-case validation. Minor SBATCH header and .gitignore updates.

Changes

Cohort / File(s)	Summary
Workflow updates `/.github/workflows/bench.yml`, `/.github/workflows/test.yml`	Reduced self-hosted timeout (1400→480 min); job names now include `interface` conditionally; `bench.yml` replaced dual submit calls with a single parallel runner; `test.yml` replaced generic matrix with explicit Phoenix/Frontier include entries (cluster_name, device, interface, lbl).
SLURM orchestration scripts (new) `/.github/scripts/run_parallel_benchmarks.sh`, `/.github/scripts/submit_and_monitor_bench.sh`, `/.github/scripts/monitor_slurm_job.sh`	New scripts to submit and monitor SLURM jobs, stream live output with squeue→sacct fallback, determine exit codes, and run PR vs master benchmarks in parallel; verify YAML outputs and propagate failures.
Benchmark runner: stronger error handling `toolchain/mfc/bench.py`	Added traceback logging, per-case try/except/finally, case-file and summary validation, `failed_cases` aggregation and exception on failures, and explicit error handling around diff/speedup logic; minor CLI/string fixes.
Test runner: timeouts & early-abort `toolchain/mfc/test/test.py`	Added `abort_tests` event and constants (`MIN_CASES_BEFORE_ABORT`, `FAILURE_RATE_THRESHOLD`, `TEST_TIMEOUT_SECONDS`); timeout-wrapped PRE/SIM/POST steps, per-silo post-processing (`_process_silo_file`) and multi-rank debug helper; failure-rate early-abort and detailed per-case logs.
Submit header tweaks `/.github/workflows/phoenix/submit-bench.sh`, `/.github/workflows/frontier/submit-bench.sh`	Phoenix SBATCH walltime increased (03:00→04:00); removed `-W` wait-for-termination option from SBATCH headers.
Repository ignore patterns `.gitignore`	Replaced two entries with new glob patterns: added `isolation_rules/` and `.supercode/`, removed older specific entries.
Bootstrap pip logic `toolchain/bootstrap/python.sh`	Reworked pip bootstrap: prefer importing pip, then `ensurepip`, falling back to `get-pip.py`; update PATH after get-pip.py install.

Sequence Diagram(s)

sequenceDiagram
    participant GH as GitHub Actions
    participant Runner as run_parallel_benchmarks.sh
    participant Submit as submit_and_monitor_bench.sh
    participant Monitor as monitor_slurm_job.sh
    participant SLURM as SLURM (sbatch / squeue / sacct / scontrol)
    participant FS as Filesystem (bench outputs / YAMLs)

    GH->>Runner: start parallel bench (device, interface, cluster)
    Runner->>Submit: launch PR bench (pr dir, device, interface, cluster) & background PID
    Runner->>Submit: launch Master bench (master dir, device, interface, cluster) & background PID
    Submit->>SLURM: sbatch bench job
    SLURM-->>Submit: job_id
    Submit->>Monitor: monitor_slurm_job.sh job_id output_file
    Monitor->>FS: wait for output file (retry/backoff)
    FS-->>Monitor: file appears
    Monitor->>FS: tail -f while job present in squeue
    loop job running
      Monitor->>SLURM: squeue check
      SLURM-->>Monitor: running
      Monitor->>FS: stream appended lines
    end
    alt job no longer in squeue
      Monitor->>SLURM: sacct / scontrol fallback -> terminal state & exitcode
    end
    Monitor->>FS: wait for output file growth to stabilize
    Monitor-->>Submit: return exitcode
    Submit-->>Runner: report exitcode and YAML path
    Runner->>FS: verify PR/master YAMLs exist
    Runner-->>GH: final status (non-zero if any failure or missing YAML)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Inspect thread-safety and abort coordination in toolchain/mfc/test/test.py.
Validate summary parsing, failed_cases aggregation, and exception raising in toolchain/mfc/bench.py.
Review SLURM monitoring fallbacks, retry/backoff, file-stabilization, and exit‑code parsing in .github/scripts/monitor_slurm_job.sh.
Verify submit/parallel orchestration and YAML checks in .github/scripts/submit_and_monitor_bench.sh and .github/scripts/run_parallel_benchmarks.sh.
Confirm workflow matrix includes, conditional interface naming, and reduced timeouts in workflow files.

Possibly related PRs

Case checking moved to Python! #1066 — related changes introducing Python case validation and nearby test/bench validation logic.
CI test #1054 — related edits to SBATCH headers and submit behavior for phoenix/frontier submit scripts.

Suggested labels

Review effort 4/5, size:XXL

Poem

🐇 I hop through logs where SLURM carrots hide,
I tail the streams and keep both benches side-by-side.
Timeouts and tracebacks, I sniff and then report,
YAMLs gathered neatly in my little fort.
CI hums—two runs, one rabbit, one report.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title 'benchmarks hardening!' is vague and generic, using informal language that doesn't clearly describe the specific changes (error handling, timeouts, CI monitoring).	Replace with a more specific title like 'Add benchmark validation, test timeouts, and SLURM job monitoring' to clearly convey the main improvements.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering error handling, timeouts, failure rate thresholds, and CI improvements with clear diagrams and file-by-file walkthroughs.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b5cb016 and 66176e7.

📒 Files selected for processing (1)

toolchain/bootstrap/python.sh (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

toolchain/bootstrap/python.sh (1)

toolchain/util.sh (4)

warn (14-14)

log (11-11)

ok (13-13)

error (15-15)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)

GitHub Check: Coverage Test on CodeCov
GitHub Check: Code Cleanliness Check
GitHub Check: Georgia Tech | Phoenix (NVHPC) (cpu)
GitHub Check: Oak Ridge | Frontier (CCE) (gpu-omp)
GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Oak Ridge | Frontier (gpu-acc)
GitHub Check: Georgia Tech | Phoenix (cpu)
GitHub Check: Oak Ridge | Frontier (cpu)
GitHub Check: Georgia Tech | Phoenix (gpu-acc)
GitHub Check: Georgia Tech | Phoenix (gpu-omp)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Oak Ridge | Frontier (gpu-omp)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Build & Publish

🔇 Additional comments (1)

toolchain/bootstrap/python.sh (1)

43-77: Solid multi-step pip bootstrap with graceful fallback.

The cascading approach (check availability → ensurepip → get-pip.py) is a reasonable hardening strategy. The logic correctly:

Attempts installation only when needed (no venv and no pip)

Prefers ensurepip (standard library) over external download

Falls back to get-pip.py only if necessary

Updates PATH appropriately for user-site installations

The conditional PATH update (lines 72–75) is correctly scoped to only the get-pip.py path, since ensurepip installs to the system/venv location already in PATH. The existence check on line 73 prevents silent failures if the user-base directory doesn't exist.

The error handling with exit 1 statements and messaging via ok() and error() calls aligns with the codebase conventions. The suppression of ensurepip stderr (line 50) enables graceful fallback without cluttering logs.

Both python3 -m ensurepip --upgrade and python3 -m site --user-base are fully supported in Python 3.11+. The fallback to get-pip.py appropriately handles rare cases where ensurepip is disabled by Linux distributions.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2025-12-05T16:33:36Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue Early file existence check only validates the case path but not writability/creation of the summary and log paths, which could still fail later and be reported as generic errors; consider validating bench output directory and file paths up front. # Validate case file exists early if not os.path.exists(case.path): raise MFCException(f"Benchmark case file not found: {case.path}") Timeout Handling Using SIGALRM for per-test timeout may not work on non-Unix or in threads; also ensure long-running subprocesses are terminated on timeout to avoid orphaned jobs. TEST_TIMEOUT_SECONDS = 3600 class TestTimeoutError(MFCException): pass def timeout_handler(signum, frame): raise TestTimeoutError("Test case exceeded 1 hour timeout") Monitoring Robustness Job monitoring relies on parsing squeue/scontrol output and tailing a single file; failures before file creation or differing SLURM formats could cause false negatives—consider stronger checks and timeouts. - name: Bench (Master v. PR) run: \| set -e # Function to submit and monitor submit_and_monitor() { local dir=$1 local device=$2 local interface=$3 local cluster=$4 cd "$dir" # Submit job submit_output=$(bash .github/workflows/$cluster/submit-bench.sh \ .github/workflows/$cluster/bench.sh $device $interface 2>&1) job_id=$(echo "$submit_output" \| grep -oP 'Submitted batch job \K[0-9]+' \|\| echo "") job_slug="bench-$device-$interface" output_file="${job_slug}.out" if [ -z "$job_id" ]; then echo "ERROR: Failed to submit job" echo "$submit_output" return 1 fi echo "Submitted batch job $job_id" echo "Monitoring output file: $output_file" # Wait for file to appear (check job status if it takes a while) echo "Waiting for job to start..." while [ ! -f "$output_file" ]; do # Check if job failed to start if ! squeue -j "$job_id" &>/dev/null && [ ! -f "$output_file" ]; then echo "ERROR: Job $job_id finished without creating output file" return 1 fi sleep 5 done echo "=== Streaming output for job $job_id ===" # Stream output while job runs tail -f "$output_file" & tail_pid=$! # Wait for job to complete (will wait up to GitHub Actions timeout) while squeue -j "$job_id" &>/dev/null; do sleep 5 done # Stop tailing kill $tail_pid 2>/dev/null \|\| true echo "" echo "=== Final output ===" cat "$output_file" # Check exit status exit_code=$(scontrol show job "$job_id" 2>/dev/null \| grep -oP 'ExitCode=\K[0-9]+' \|\| echo "0:0") if [ "$exit_code" != "0" ] && [ "$exit_code" != "0:0" ]; then echo "ERROR: Job $job_id failed with exit code $exit_code" return 1 fi } # Run both jobs with monitoring (submit_and_monitor pr ${{ matrix.device }} ${{ matrix.interface }} ${{ matrix.cluster }}) & pr_pid=$! (submit_and_monitor master ${{ matrix.device }} ${{ matrix.interface }} ${{ matrix.cluster }}) & master_pid=$! wait $pr_pid && pr_exit=$? \|\| pr_exit=$? wait $master_pid && master_exit=$? \|\| master_exit=$? if [ $pr_exit -ne 0 ] \|\| [ $master_exit -ne 0 ]; then exit 1 fi

.github/workflows/bench.yml

toolchain/mfc/test/test.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

toolchain/mfc/test/test.py

toolchain/mfc/bench.py

toolchain/mfc/test/test.py

Copilot

Pull request overview

This PR introduces "benchmarking hardening" features to improve the robustness and reliability of MFC's testing and benchmarking infrastructure. The changes add timeout mechanisms, early failure detection, enhanced error handling, and better job monitoring for SLURM-based workflows.

Key changes:

Added per-test timeout (1 hour) and early abort on high failure rates (30% after 20 tests) to the test suite
Enhanced benchmark validation with comprehensive error checking and failed case tracking
Improved GitHub Actions workflows with better job monitoring and reduced timeout (from 1400 to 480 minutes)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
toolchain/mfc/test/test.py	Added signal-based timeout handler (1 hour per test) and early abort logic when failure rate exceeds 30% after 20 tests
toolchain/mfc/bench.py	Added comprehensive validation (file existence, return codes, summary validation) and improved error reporting with failed case tracking
.github/workflows/test.yml	Updated job name formatting to include interface and reduced timeout from 1400 to 480 minutes
.github/workflows/bench.yml	Reduced timeout to 480 minutes and added detailed SLURM job monitoring with output streaming and exit code validation

toolchain/mfc/test/test.py

.github/workflows/bench.yml

toolchain/mfc/test/test.py

codeant-ai · 2025-12-05T16:37:44Z

CodeAnt AI finished reviewing your PR.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

.github/workflows/bench.yml (2)
105-164: Consider extracting submit_and_monitor to a standalone script.

This ~60-line function inline in YAML is hard to test and maintain. Extracting it to .github/scripts/monitor_slurm_job.sh would improve reusability and testability. This was also noted in a previous review.

146-149: Transient squeue failures could cause premature exit.

If squeue fails transiently (network blip, scheduler hiccup), the loop exits immediately even though the job may still be running. Consider adding a brief grace period or retry.
             # Wait for job to complete (will wait up to GitHub Actions timeout)
-            while squeue -j "$job_id" &>/dev/null; do
+            squeue_failures=0
+            while true; do
+              if squeue -j "$job_id" &>/dev/null; then
+                squeue_failures=0
+              else
+                squeue_failures=$((squeue_failures + 1))
+                # Allow a few transient failures before concluding job is done
+                if [ $squeue_failures -ge 3 ]; then
+                  break
+                fi
+              fi
               sleep 5
             done

🧹 Nitpick comments (2)

toolchain/mfc/test/test.py (1)
337-339: Remove extraneous f prefix from strings without placeholders.
-                cons.print(f"[bold red]This suggests a systemic issue (bad build, broken environment, etc.)[/bold red]")
-                cons.print(f"[bold red]Aborting remaining tests to fail fast.[/bold red]\n")
+                cons.print("[bold red]This suggests a systemic issue (bad build, broken environment, etc.)[/bold red]")
+                cons.print("[bold red]Aborting remaining tests to fail fast.[/bold red]\n")
toolchain/mfc/bench.py (1)
112-128: cons.unindent() may be skipped if validation fails without explicit continue.

If an exception is raised inside the try block (line 124-126), cons.unindent() at line 128 is still called. However, if validation_failed is True (line 112), you call cons.unindent() at 114 then continue, skipping line 128—which is correct. The logic is sound but fragile.

Consider using a finally block to guarantee unindent:
         try:
+            # ... existing try block code ...
+        except Exception as e:
+            cons.print(f"[bold red]ERROR[/bold red]: Unexpected error running {case.slug}: {e}")
+            failed_cases.append(case.slug)
+        finally:
             cons.unindent()
-
-        cons.unindent()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c380cd and 8c01b99.

📒 Files selected for processing (4)

.github/workflows/bench.yml (3 hunks)
.github/workflows/test.yml (1 hunks)
toolchain/mfc/bench.py (5 hunks)
toolchain/mfc/test/test.py (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-24T21:50:16.713Z

Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T21:50:16.713Z
Learning: Add and update focused tests for changes; ensure no regressions in golden test outputs without clear justification

Applied to files:

toolchain/mfc/bench.py

🧬 Code graph analysis (2)

toolchain/mfc/bench.py (3)

toolchain/mfc/common.py (3)

MFCException (31-32)

system (34-41)

file_load_yaml (65-70)

toolchain/mfc/state.py (1)

ARG (85-93)

toolchain/mfc/printer.py (1)

unindent (19-24)

toolchain/mfc/test/test.py (6)

toolchain/mfc/common.py (3)

MFCException (31-32)

file_write (44-54)

find (165-174)

toolchain/mfc/test/case.py (1)

compute_tolerance (248-276)

toolchain/mfc/state.py (2)

ARG (85-93)

items (33-34)

toolchain/mfc/packer/pack.py (3)

has_NaNs (90-96)

save (46-88)

find (34-35)

toolchain/mfc/packer/tol.py (1)

compare (50-97)

toolchain/mfc/test/cases.py (1)

modify_example_case (1035-1052)

🪛 Ruff (0.14.7)

toolchain/mfc/bench.py

42-42: Avoid specifying long messages outside the exception class

(TRY003)

124-124: Do not catch blind exception: Exception

(BLE001)

137-137: Avoid specifying long messages outside the exception class

(TRY003)

211-211: Do not catch blind exception: Exception

(BLE001)

toolchain/mfc/test/test.py

36-36: Unused function argument: signum

(ARG001)

36-36: Unused function argument: frame

(ARG001)

37-37: Avoid specifying long messages outside the exception class

(TRY003)

224-224: Avoid specifying long messages outside the exception class

(TRY003)

228-228: Avoid specifying long messages outside the exception class

(TRY003)

231-231: Avoid specifying long messages outside the exception class

(TRY003)

239-239: Avoid specifying long messages outside the exception class

(TRY003)

256-256: Avoid specifying long messages outside the exception class

(TRY003)

270-270: Avoid specifying long messages outside the exception class

(TRY003)

277-277: Avoid specifying long messages outside the exception class

(TRY003)

280-280: Avoid specifying long messages outside the exception class

(TRY003)

283-283: Avoid specifying long messages outside the exception class

(TRY003)

295-299: Avoid specifying long messages outside the exception class

(TRY003)

338-338: f-string without any placeholders

Remove extraneous f prefix

(F541)

339-339: f-string without any placeholders

Remove extraneous f prefix

(F541)

340-342: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Code Cleanliness Check
GitHub Check: Coverage Test on CodeCov
GitHub Check: Build & Publish

🔇 Additional comments (12)

.github/workflows/test.yml (3)

96-96: Dynamic job naming is clear and informative.

The conditional formatting nicely handles the interface suffix, making job names like "Georgia Tech | Phoenix (gpu-acc)" or "Oak Ridge | Frontier (cpu)" more descriptive for debugging CI failures.

100-100: Significant timeout reduction from 1400 to 480 minutes.

This is a substantial reduction (from ~23 hours to 8 hours). Verify this is sufficient for the longest-running test configurations, particularly on Frontier with GPU tests.

101-115: Matrix expansion and runner configuration look correct.

The new matrix properly covers CPU and GPU variants across both clusters. The runs-on configuration correctly uses the phoenix group with cluster-specific labels.

.github/workflows/bench.yml (2)

26-26: Job naming and timeout updates align with test.yml.

Consistent naming pattern and timeout reduction across both workflows improves maintainability.

Also applies to: 76-76

173-174: Exit status capture pattern has subtle issues.

The pattern wait $pid && var=$? || var=$? always sets var to 0 when the process succeeds because && short-circuits. However, since you check both values at the end, this works correctly for the failure case.

toolchain/mfc/test/test.py (2)

294-301: Good use of try/except/finally for alarm cleanup.

The finally block ensures the alarm is always cancelled, preventing spurious timeouts on subsequent tests. The exception chaining with from exc preserves the original traceback.

26-37: Timeout constants and exception class are well-defined.

The 1-hour timeout and 30% failure threshold are reasonable defaults. Consider making these configurable via command-line arguments for flexibility in different environments.

toolchain/mfc/bench.py (5)

40-42: Good early validation of benchmark case files.

Failing fast with a clear error message is better than failing deep in the execution with a cryptic error.

64-128: Robust error handling with comprehensive validation.

The try/except block properly handles:

Non-zero return codes

Missing summary files

Missing target data

Unexpected exceptions

The structure ensures failed_cases is always updated on failure. However, there's a potential issue with cons.unindent() placement.

124-126: Broad exception catch is acceptable here for robustness.

In benchmark orchestration code, catching all exceptions ensures one failing case doesn't abort the entire benchmark run. The error is logged with context, and the case is tracked as failed.

131-137: Clear failure summary with actionable output.

Listing all failed cases at the end and raising with the count helps quickly identify systemic issues. Good practice for CI diagnostics.

211-213: Improved error logging in diff() replaces silent pass.

This is a significant improvement—previously, speedup computation failures were silently ignored. Now they're logged with context and properly set the error flag.

.github/workflows/bench.yml

toolchain/mfc/test/test.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

toolchain/mfc/test/test.py (3)
202-204: SIGALRM in worker threads is still a critical issue.

This has been flagged in previous reviews but appears unaddressed. The test scheduler uses WorkerThread (see sched.py), and signal.signal()/signal.alarm() only work in the main thread—calling from worker threads raises ValueError: signal only works in main thread.

Consider one of these approaches:

Guard with a thread check and skip timeout in worker threads

Use threading.Timer with a flag-based approach

Move timeout handling to the subprocess level with subprocess.run(timeout=...)

264-266: Directory existence check still missing.

This was flagged in previous reviews. os.listdir() will raise FileNotFoundError if the silo_hdf5/p0 directory doesn't exist (which may happen for test cases that don't produce silo outputs).
+            silo_dir = os.path.join(case.get_dirpath(), 'silo_hdf5', 'p0')
+            if not os.path.isdir(silo_dir):
+                # No silo outputs for this case; skip post-processing check
+                pass
+            else:
-            for silo_filepath in os.listdir(os.path.join(case.get_dirpath(), 'silo_hdf5', 'p0')):
+                for silo_filepath in os.listdir(silo_dir):
342-344: Exception propagation from worker task may break scheduler.

As noted in previous reviews, raising MFCException here propagates through sched.sched, potentially leaving worker threads in an undefined state. Consider using a shared abort flag that the scheduler checks between tasks for a cleaner shutdown.

🧹 Nitpick comments (1)

toolchain/mfc/test/test.py (1)

340-341: Remove unnecessary f-string prefixes.

These strings have no placeholders, so the f prefix is extraneous.

-                    cons.print(f"[bold red]This suggests a systemic issue (bad build, broken environment, etc.)[/bold red]")
-                    cons.print(f"[bold red]Aborting remaining tests to fail fast.[/bold red]\n")
+                    cons.print("[bold red]This suggests a systemic issue (bad build, broken environment, etc.)[/bold red]")
+                    cons.print("[bold red]Aborting remaining tests to fail fast.[/bold red]\n")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c01b99 and 2e1eb87.

📒 Files selected for processing (1)

toolchain/mfc/test/test.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

toolchain/mfc/test/test.py (2)

toolchain/mfc/common.py (1)

MFCException (31-32)

toolchain/mfc/state.py (1)

ARG (85-93)

🪛 Ruff (0.14.7)

toolchain/mfc/test/test.py

36-36: Unused function argument: signum

(ARG001)

36-36: Unused function argument: frame

(ARG001)

37-37: Avoid specifying long messages outside the exception class

(TRY003)

224-224: Avoid specifying long messages outside the exception class

(TRY003)

228-228: Avoid specifying long messages outside the exception class

(TRY003)

231-231: Avoid specifying long messages outside the exception class

(TRY003)

239-239: Avoid specifying long messages outside the exception class

(TRY003)

256-256: Avoid specifying long messages outside the exception class

(TRY003)

270-270: Avoid specifying long messages outside the exception class

(TRY003)

277-277: Avoid specifying long messages outside the exception class

(TRY003)

280-280: Avoid specifying long messages outside the exception class

(TRY003)

283-283: Avoid specifying long messages outside the exception class

(TRY003)

295-299: Avoid specifying long messages outside the exception class

(TRY003)

340-340: f-string without any placeholders

Remove extraneous f prefix

(F541)

341-341: f-string without any placeholders

Remove extraneous f prefix

(F541)

342-344: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: Code Cleanliness Check
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Coverage Test on CodeCov
GitHub Check: cubic · AI code reviewer
GitHub Check: Build & Publish

🔇 Additional comments (2)

toolchain/mfc/test/test.py (2)

26-38: LGTM on timeout and abort constants.

The timeout infrastructure is well-structured. The TestTimeoutError properly extends MFCException, and the constants are reasonable (1-hour timeout, 30% failure threshold after 20 tests).

Note: The unused signum and frame parameters in timeout_handler are required by the signal handler signature, so the Ruff warnings can be safely ignored.

294-301: Good timeout exception handling with cleanup.

The finally block ensures the alarm is always cancelled, preventing lingering signals. The re-raised exception provides helpful context including the log file path.

cubic-dev-ai

2 issues found across 4 files

Prompt for AI agents (all 2 issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="toolchain/mfc/test/test.py">

<violation number="1" location="toolchain/mfc/test/test.py:203">
P0: `signal.signal()` and `signal.alarm()` cannot be called from worker threads - they only work in the main thread. Since `_handle_case` is executed via `sched.sched()` which uses `WorkerThread`, this will raise `ValueError: signal only works in main thread`. Consider using `threading.Timer` or a timeout wrapper that works across threads instead.</violation>
</file>

<file name=".github/workflows/bench.yml">

<violation number="1" location=".github/workflows/bench.yml:159">
P2: Slurm reports exit codes as `X:Y`, but this regex only captures the first component. If a job finishes with `0:9` (success but terminated by signal), `$exit_code` becomes `0` and the failure check is bypassed, so signaled jobs are treated as success. Capture the full `X:Y` pair (or parse both fields) and ensure any non-`0:0` result fails the workflow.</violation>
</file>

_{Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR}

toolchain/mfc/test/test.py

.github/workflows/bench.yml

cubic-dev-ai

2 issues found across 4 files (changes from recent commits).

Prompt for AI agents (all 2 issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="toolchain/mfc/bench.py">

<violation number="1" location="toolchain/mfc/bench.py:127">
P2: The `finally` block will cause double `cons.unindent()` calls. Lines 81 and 89 still have explicit `cons.unindent()` before `continue`, but since `finally` always executes (even on `continue`), the unindent will be called twice in those error paths. Remove the `cons.unindent()` calls at lines 81 and 89, similar to how it was removed for the `validation_failed` case.</violation>
</file>

<file name=".github/scripts/monitor_slurm_job.sh">

<violation number="1" location=".github/scripts/monitor_slurm_job.sh:57">
P2: Exit code defaults to `0:0` (success) if `scontrol` or `grep` fails. This could mask real job failures when job information is unavailable. Consider failing explicitly if exit code cannot be determined.</violation>
</file>

_{Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR}

toolchain/mfc/bench.py

.github/scripts/monitor_slurm_job.sh

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

toolchain/mfc/bench.py (2)

58-83: Fix double cons.unindent() on error paths to avoid printer stack underflow

In the non-zero return-code and missing-summary branches you call cons.unindent() and then continue, but cons.unindent() is also executed in the finally block. This will unindent twice for failing cases and can corrupt the printer’s indentation stack or raise underflow errors.

Rely on the finally block for unindent and drop the inner calls:

-            if rc != 0:
-                cons.print(f"[bold red]ERROR[/bold red]: Case {case.slug} failed with exit code {rc}")
-                cons.print(f"[bold red]      Check log at: {log_filepath}[/bold red]")
-                failed_cases.append(case.slug)
-                cons.unindent()
-                continue
+            if rc != 0:
+                cons.print(f"[bold red]ERROR[/bold red]: Case {case.slug} failed with exit code {rc}")
+                cons.print(f"[bold red]      Check log at: {log_filepath}[/bold red]")
+                failed_cases.append(case.slug)
+                continue
@@
-            if not os.path.exists(summary_filepath):
-                cons.print(f"[bold red]ERROR[/bold red]: Summary file not created for {case.slug}")
-                cons.print(f"[bold red]      Expected: {summary_filepath}[/bold red]")
-                failed_cases.append(case.slug)
-                cons.unindent()
-                continue
+            if not os.path.exists(summary_filepath):
+                cons.print(f"[bold red]ERROR[/bold red]: Summary file not created for {case.slug}")
+                cons.print(f"[bold red]      Expected: {summary_filepath}[/bold red]")
+                failed_cases.append(case.slug)
+                continue

Also applies to: 124-129

1-10: Move traceback import to the module top to satisfy lint and avoid repeated imports

Pylint is flagging import traceback inside the except block (C0415 Import outside toplevel). Since this code path is not particularly hot, it’s simpler and clearer to import at the top-level and just use it in the handler.

Suggested change:

-import os, sys, uuid, subprocess, dataclasses, typing, math
+import os, sys, uuid, subprocess, dataclasses, typing, math, traceback
@@
-            except Exception as e:
-                import traceback
+            except Exception as e:
                 cons.print(
                     f"[bold red]ERROR[/bold red]: Failed to compute speedup for {target.name} in {slug}: {e}\n"
                     f"{traceback.format_exc()}"
                 )

Also applies to: 211-217

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e1eb87 and f79f4d1.

📒 Files selected for processing (5)

.github/scripts/monitor_slurm_job.sh (1 hunks)
.github/workflows/bench.yml (3 hunks)
.github/workflows/test.yml (1 hunks)
toolchain/mfc/bench.py (5 hunks)
toolchain/mfc/test/test.py (5 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

.github/workflows/bench.yml (1)

.github/workflows/phoenix/submit-bench.sh (1)

usage (5-7)

toolchain/mfc/test/test.py (2)

toolchain/mfc/common.py (4)

MFCException (31-32)

file_write (44-54)

find (165-174)

does_command_exist (127-138)

toolchain/mfc/state.py (1)

ARG (85-93)

toolchain/mfc/bench.py (2)

toolchain/mfc/common.py (3)

MFCException (31-32)

system (34-41)

file_load_yaml (65-70)

toolchain/mfc/printer.py (1)

unindent (19-24)

🪛 actionlint (1.7.9)

.github/workflows/test.yml

105-105: label "gt" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

109-109: label "gt" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

113-113: label "gt" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

118-118: label "frontier" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

122-122: label "frontier" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

126-126: label "frontier" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-slim", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 GitHub Actions: Lint Toolchain

toolchain/mfc/test/test.py

[error] 230-230: pylint: R1702 Too many nested blocks (6/5).

toolchain/mfc/bench.py

[error] 212-212: pylint: C0415 Import outside toplevel ('traceback').

🪛 Ruff (0.14.7)

toolchain/mfc/test/test.py

184-186: Avoid specifying long messages outside the exception class

(TRY003)

233-233: Abstract raise to an inner function

(TRY301)

233-233: Avoid specifying long messages outside the exception class

(TRY003)

238-238: Abstract raise to an inner function

(TRY301)

238-238: Avoid specifying long messages outside the exception class

(TRY003)

246-246: Avoid specifying long messages outside the exception class

(TRY003)

250-250: Avoid specifying long messages outside the exception class

(TRY003)

253-253: Avoid specifying long messages outside the exception class

(TRY003)

261-261: Avoid specifying long messages outside the exception class

(TRY003)

278-278: Avoid specifying long messages outside the exception class

(TRY003)

294-294: Avoid specifying long messages outside the exception class

(TRY003)

301-301: Avoid specifying long messages outside the exception class

(TRY003)

304-304: Avoid specifying long messages outside the exception class

(TRY003)

307-307: Avoid specifying long messages outside the exception class

(TRY003)

319-323: Avoid specifying long messages outside the exception class

(TRY003)

toolchain/mfc/bench.py

42-42: Avoid specifying long messages outside the exception class

(TRY003)

124-124: Do not catch blind exception: Exception

(BLE001)

137-137: Avoid specifying long messages outside the exception class

(TRY003)

211-211: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: cubic · AI code reviewer
GitHub Check: Coverage Test on CodeCov
GitHub Check: Code Cleanliness Check
GitHub Check: Georgia Tech | Phoenix (NVHPC) (cpu)
GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Build & Publish

🔇 Additional comments (4)

.github/workflows/bench.yml (1)

25-27: Nice extraction of submit/monitor logic and parallel PR/master runs

The new submit_and_monitor helper plus .github/scripts/monitor_slurm_job.sh integration gives you much clearer control over submission errors, job IDs, and exit-code handling, while preserving parallel PR/master execution and avoiding set -e pitfalls in the wait logic. The dynamic job name and reduced timeout also look reasonable.

Also applies to: 76-77, 100-143

.github/workflows/test.yml (1)

95-101: Matrix reshaping and dynamic self-hosted job naming look consistent

Explicit Phoenix/Frontier variants, the lbl-driven labels, and the dynamic name format align cleanly with the later build/test steps and log handling. The reduced timeout is also appropriate for CI hardening, assuming your self-hosted runners are configured with these labels.

Also applies to: 104-129
toolchain/mfc/test/test.py (2)
26-37: Early-abort mechanism via abort_tests event is structured cleanly

The combination of:

global abort_tests event,

per-case pre-check in handle_case,

failure-rate computation gated on a minimum number of completed tests, and

a post-scheduler check in test() that surfaces a clear aggregated failure,

gives you a deterministic, thread-safe way to fast‑fail runs with systemic breakage without throwing from worker threads. This is a solid improvement over raising directly inside _handle_case.

Also applies to: 179-187, 333-373

30-41: Timer-based timeout only detects timeouts after case.run returns; add note and fix pylint nesting error

Using threading.Timer with a flag is safe in worker threads, but it can't interrupt a blocked case.run call — if case.run hangs (e.g., stuck MPI job), the worker thread will remain blocked and the timeout will only be noticed after (or if) case.run returns. That means this mechanism detects slow tests but does not strictly bound wall-clock time for truly hung runs.

Two concrete suggestions:

Document/verify behavior of case.run
Confirm (and, if possible, document) that case.run itself uses a timeout-capable mechanism (e.g., subprocess with timeout= or equivalent), or that hung simulations are otherwise impossible/handled elsewhere; otherwise the 1‑hour timeout expectation is misleading.

Resolve pylint R1702 ("too many nested blocks") cheaply
Given this function is already structurally complex and you're explicitly disabling other size checks, the quickest unblock is to extend the existing pylint disable:
-# pylint: disable=too-many-locals, too-many-branches, too-many-statements, trailing-whitespace
-def _handle_case(case: TestCase, devices: typing.Set[int]):
+# pylint: disable=too-many-locals, too-many-branches, too-many-statements, too-many-nested-blocks, trailing-whitespace
+def _handle_case(case: TestCase, devices: typing.Set[int]):
If you prefer not to disable, you'd need to extract parts of _handle_case (e.g., golden-file comparison and test_all post-processing) into helpers to reduce nesting.

.github/scripts/monitor_slurm_job.sh

sbryngelson · 2025-12-05T17:21:04Z

/improve

.github/scripts/monitor_slurm_job.sh

.github/workflows/bench.yml

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

.github/scripts/monitor_slurm_job.sh

toolchain/mfc/bench.py

codecov · 2025-12-08T23:48:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.16%. Comparing base (8d9a83b) to head (66176e7).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1078   +/-   ##
=======================================
  Coverage   44.16%   44.16%           
=======================================
  Files          71       71           
  Lines       20197    20197           
  Branches     1970     1970           
=======================================
  Hits         8921     8921           
  Misses      10148    10148           
  Partials     1128     1128

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codeant-ai · 2025-12-09T04:09:43Z

CodeAnt AI is running Incremental review

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87fcd74 and f468fe5.

📒 Files selected for processing (3)

.github/scripts/run_parallel_benchmarks.sh (1 hunks)
.github/scripts/submit_and_monitor_bench.sh (1 hunks)
.github/workflows/bench.yml (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

.github/scripts/submit_and_monitor_bench.sh (3)

.github/workflows/phoenix/submit-bench.sh (1)

usage (5-7)

.github/workflows/frontier/submit.sh (1)

usage (5-7)

toolchain/mfc/bench.py (1)

bench (19-76)

.github/scripts/run_parallel_benchmarks.sh (2)

toolchain/mfc/bench.py (1)

bench (19-76)

.github/workflows/phoenix/submit-bench.sh (1)

usage (5-7)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: Code Cleanliness Check
GitHub Check: Georgia Tech | Phoenix (NVHPC) (cpu)
GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Coverage Test on CodeCov
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Georgia Tech | Phoenix (gpu-acc)
GitHub Check: Georgia Tech | Phoenix (cpu)
GitHub Check: Oak Ridge | Frontier (gpu-omp)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Georgia Tech | Phoenix (gpu-omp)
GitHub Check: Build & Publish

🔇 Additional comments (5)

.github/scripts/run_parallel_benchmarks.sh (1)

1-78: Script structure and exit code capture look sound.

The parallel job orchestration and exit code capture pattern (if ! wait "$pid"; then exit=$?; fi) is correct. The file path expectations (lines 60-61) for YAML files in pr/ and master/ directories should align with the centralized orchestration model.

Confirm that submit_and_monitor_bench.sh creates output files in the expected locations (pr/bench--.yaml and master/bench--.yaml) matching the paths referenced on lines 60–61.

.github/workflows/bench.yml (3)

26-26: Dynamic job naming with conditional interface suffix looks good.

The formatted job name string now conditionally appends the interface: ${{ matrix.interface != 'none' && format('-{0}', matrix.interface) || '' }}. This improves job labeling in the UI.

76-76: Timeout reduction to 480 minutes documented and aligned with PR objectives.

The timeout has been reduced from 1400 to 480 minutes as part of hardening. Verify that this is adequate for all benchmark configurations (especially Frontier GPU jobs on lines 65–72) by checking historical run times.

100-101: Centralized orchestration script integration is clean.

The bench step now delegates to a single script, replacing inline SLURM monitoring logic. The parameter order (device, interface, cluster) must match the expectations in run_parallel_benchmarks.sh.

Confirm that script parameters on line 101 match the positional argument order expected by .github/scripts/run_parallel_benchmarks.sh (currently: device, interface, cluster).

.github/scripts/submit_and_monitor_bench.sh (1)

24-26: Job ID extraction is portable and robust.

The sed pattern 's/.*Submitted batch job $[0-9][0-9]*$.*/\1/p' is POSIX-portable and correctly extracts the job ID from sbatch output. The job slug construction and output file naming are clear.

.github/scripts/submit_and_monitor_bench.sh

coderabbitai · 2025-12-09T04:14:16Z

.github/scripts/submit_and_monitor_bench.sh

+# Verify the YAML output file was created
+yaml_file="${job_slug}.yaml"
+if [ ! -f "$yaml_file" ]; then
+  echo "[$dir] ERROR: Expected output file not found: $yaml_file"
+  echo "[$dir] Directory contents:"
+  ls -la *.yaml 2>/dev/null || echo "  No YAML files found"
+  exit 1
+fi
+
+echo "[$dir] Verified output file exists: $yaml_file ($(stat -f%z "$yaml_file" 2>/dev/null || stat -c%s "$yaml_file" 2>/dev/null) bytes)"


⚠️ Potential issue | 🟠 Major

YAML file existence check is incomplete — gap vs. PR objectives.

The script verifies the YAML output file exists (lines 41–48) but does not validate its contents. According to the PR objectives, the hardening should include "Check summary file existence and required target-data fields (execution and simulation metrics such as grind)". The current validation only checks for file presence, not schema or required fields.

Consider parsing the YAML and checking for required fields (e.g., execution time, simulation metrics) to align with the stated objectives.

Would you like me to generate a helper function to validate the YAML structure and required fields? This could be added to the script or extracted as a reusable utility.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

.github/scripts/submit_and_monitor_bench.sh (1)
44-54: Output validation is appropriately scoped; cross-platform stat flags are functional but inelegant.

File existence check for the YAML output is correct (lines 46–51). Line 49's fallback directory listing with error suppression aids debugging.

Line 53 uses conditional stat flags (-f%z for macOS, -c%s for Linux) to report file size. While this works, it's a common cross-platform portability pattern that can be simplified:
# Simpler alternative using wc
echo "[$dir] Verified output file exists: $yaml_file ($(wc -c < "$yaml_file") bytes)"
This is a minor style point; the current approach is functional.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f468fe5 and d9e06f6.

📒 Files selected for processing (3)

.github/scripts/run_parallel_benchmarks.sh (1 hunks)
.github/scripts/submit_and_monitor_bench.sh (1 hunks)
.github/workflows/bench.yml (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/bench.yml

🧰 Additional context used

🧬 Code graph analysis (1)

.github/scripts/run_parallel_benchmarks.sh (3)

.github/workflows/phoenix/submit-bench.sh (1)

usage (5-7)

toolchain/mfc/bench.py (1)

bench (19-76)

.github/workflows/frontier/submit.sh (1)

usage (5-7)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: Code Cleanliness Check
GitHub Check: Georgia Tech | Phoenix (NVHPC) (gpu-acc)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Georgia Tech | Phoenix (gpu-acc)
GitHub Check: Georgia Tech | Phoenix (gpu-omp)
GitHub Check: Georgia Tech | Phoenix (cpu)
GitHub Check: Coverage Test on CodeCov
GitHub Check: Build & Publish

🔇 Additional comments (2)

.github/scripts/run_parallel_benchmarks.sh (1)

1-81: Solid parallel orchestrator with robust error handling and exit-code tracking.

The script correctly launches both benchmark jobs in parallel, captures exit codes reliably using the if ! wait pattern, and validates output files with helpful debugging output. Strict error handling via set -euo pipefail and || true guards in the ls fallbacks ensure the script doesn't mask real failures.

.github/scripts/submit_and_monitor_bench.sh (1)

17-41: Correctly captures script directory before changing workdir — critical fix in place.

Line 18 captures SCRIPT_DIR before the cd "$dir" on line 21, then line 40 uses "${SCRIPT_DIR}/monitor_slurm_job.sh" to reference the monitoring script from the PR location. This ensures the script works correctly whether benchmarking the PR or master checkout, since monitor_slurm_job.sh only exists in the PR branch. This resolves the critical path-resolution issue flagged in the previous review.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

.github/scripts/submit_and_monitor_bench.sh (1)

46-56: YAML validation still incomplete — gap vs. PR objectives persists.

This duplicates a concern from the previous review: the script only validates file existence but does not check required fields. Per PR objectives, hardening should include "Check summary file existence and required target-data fields (execution and simulation metrics such as grind)".

Consider parsing the YAML and asserting that required fields exist (e.g., execution, simulation_metrics, or similar keys defined by the benchmark output schema).

🧹 Nitpick comments (2)

.github/scripts/submit_and_monitor_bench.sh (2)
40-40: Missing error handling for monitoring script return code.

The monitor_slurm_job.sh script is invoked but its exit status is not checked. If monitoring fails, the script continues and may produce misleading success messages. With set -euo pipefail, the script should exit on failure, but explicit error handling here would be clearer.

Consider capturing the return code and providing context if monitoring fails:
-# Use the monitoring script from PR (where this script lives)
-bash "${SCRIPT_DIR}/monitor_slurm_job.sh" "$job_id" "$output_file"
+# Use the monitoring script from PR (where this script lives)
+if ! bash "${SCRIPT_DIR}/monitor_slurm_job.sh" "$job_id" "$output_file"; then
+  echo "[$dir] ERROR: Monitoring script failed for job $job_id"
+  exit 1
+fi
24-25: Optional: Pre-validate submit script path for clarity.

The relative path .github/workflows/$cluster/submit-bench.sh assumes the directory structure exists in $dir. Adding an early existence check could catch configuration errors sooner:
 # Submit job
+submit_script=".github/workflows/$cluster/submit-bench.sh"
+if [ ! -f "$submit_script" ]; then
+  echo "[$dir] ERROR: Submit script not found: $submit_script"
+  exit 1
+fi
-submit_output=$(bash .github/workflows/$cluster/submit-bench.sh \
+submit_output=$(bash "$submit_script" \
   .github/workflows/$cluster/bench.sh "$device" "$interface" 2>&1)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e17eb27 and db77fb9.

📒 Files selected for processing (1)

.github/scripts/submit_and_monitor_bench.sh (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

.github/scripts/submit_and_monitor_bench.sh (3)

.github/workflows/phoenix/submit-bench.sh (1)

usage (5-7)

.github/workflows/frontier/submit.sh (1)

usage (5-7)

toolchain/mfc/bench.py (1)

bench (19-76)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)

GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Oak Ridge | Frontier (gpu-omp)
GitHub Check: Oak Ridge | Frontier (cpu)
GitHub Check: Georgia Tech | Phoenix (gpu-acc)
GitHub Check: Code Cleanliness Check
GitHub Check: Coverage Test on CodeCov
GitHub Check: Georgia Tech | Phoenix (NVHPC) (cpu)
GitHub Check: Build & Publish

🔇 Additional comments (2)

.github/scripts/submit_and_monitor_bench.sh (2)

17-18: LGTM: Script directory handling correctly addresses prior critical path issue.

The use of SCRIPT_DIR (line 17–18) before the cd "$dir" (line 21) and its reference at line 40 properly resolves the previous critical issue where .github/scripts/monitor_slurm_job.sh would fail on master branch. This is a solid fix.

Also applies to: 40-40

58-58: stat format strings are already portable; current syntax is correct.

The unquoted format syntax stat -f%z and stat -c%s works fine on both BSD and GNU systems for simple format specifiers like %z and %s. Quoting is only necessary when format strings contain spaces, tabs, or shell-special characters. While adding quotes (stat -f '%z' and stat -c '%s') is reasonable for defensive programming and readability, it is not required to fix a portability issue—the current code is portable as-is.

Likely an incorrect or invalid review comment.

codeant-ai · 2025-12-09T13:41:52Z

CodeAnt AI is running Incremental review

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

.github/scripts/monitor_slurm_job.sh (3)
137-137: Consider portability of stat command.

Line 137 uses stat -c%s (GNU), which will fail on BSD/macOS with stat -f%z. While this script targets SLURM clusters (typically Linux), past reviewers flagged this as a portability concern. If the script may run on heterogeneous systems, consider a portable wrapper or helper function as suggested in prior reviews.

If portability across BSD/Linux is required, you can wrap the file-size check:
get_file_size() {
  stat -c%s "$1" 2>/dev/null || stat -f%z "$1" 2>/dev/null || echo -1
}

# Then later:
size=$(get_file_size "$output_file")
56-75: Sophisticated tail multiplexing is correct but complex; ensure edge cases are tested.

The exec 3< <(stdbuf -oL -eL tail -f ...) pattern with non-blocking reads (line 67, timeout 0.1s) and burst-read capping (lines 72–74) is a clever approach to stream output while monitoring job status without blocking. However, this sophistication introduces multiple tail process references (lines 57, 130, 153) and cleanup responsibilities. The drain logic (lines 115–126) with a 10k-line safety limit helps prevent infinite loops, but edge cases (e.g., rapid log spikes, stdbuf buffering delays) should be validated in testing.

Consider testing scenarios where:

Output file grows faster than the 0.1s read timeout can consume

The drain loop hits the 10k-line safety limit during a large flush

stdbuf buffering delays output availability during job completion

Verify that no output is lost or truncated in these scenarios by comparing final file size against streamed lines.

Also applies to: 115-126

84-95: Exit code retrieval handles simple jobs well; consider filtering for array jobs.

Lines 84 and 171 use head -n1 to extract state/exit code from sacct. For non-array jobs or single-step jobs, this works correctly. However, if benchmarks use SLURM job arrays or multi-step jobs, sacct may return multiple lines, and head -n1 could select the wrong entry (e.g., an array element step instead of the parent job). Past reviewers suggested filtering by terminal state or by parent job ID before extraction.

If job arrays are out of scope for this monitoring script, this is fine. Otherwise, consider the more selective sacct query suggested in earlier reviews:
# Filter for terminal states before extracting
sacct_line=$(sacct -j "$job_id" --format=JobID,State,ExitCode --noheader --parsable2 2>/dev/null | \
  awk -F'|' '$2 ~ /COMPLETED|FAILED|CANCELLED|TIMEOUT|OUT_OF_MEMORY/ {print; exit}')
exit_code=$(echo "$sacct_line" | awk -F'|' '{print $3}')
Confirm whether benchmarks use SLURM job arrays. If yes, test with an array job to verify the script correctly reports the parent job's exit status, not a sub-step's exit code.

Also applies to: 171-174

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db77fb9 and 09cda88.

📒 Files selected for processing (2)

.github/scripts/monitor_slurm_job.sh (1 hunks)
.github/workflows/phoenix/submit-bench.sh (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)

GitHub Check: Oak Ridge | Frontier (cpu)
GitHub Check: Oak Ridge | Frontier (gpu-acc)
GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Georgia Tech | Phoenix (gpu-omp)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Georgia Tech | Phoenix (gpu-acc)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Oak Ridge | Frontier (CCE) (gpu-omp)
GitHub Check: Oak Ridge | Frontier (CCE) (gpu-acc)
GitHub Check: Georgia Tech | Phoenix (NVHPC) (gpu-omp)
GitHub Check: Georgia Tech | Phoenix (NVHPC) (gpu-acc)
GitHub Check: Coverage Test on CodeCov
GitHub Check: Georgia Tech | Phoenix (NVHPC) (cpu)
GitHub Check: Code Cleanliness Check
GitHub Check: Build & Publish

🔇 Additional comments (2)

.github/workflows/phoenix/submit-bench.sh (1)

45-45: Good alignment with benchmarking hardening objectives.

The SLURM job time limit increase (03:00:00 → 04:00:00) provides necessary headroom for extended benchmark case execution while staying within the reduced GitHub workflow timeout (480 minutes). This complements the suite-level timeout reductions.

.github/scripts/monitor_slurm_job.sh (1)

5-13: Well-implemented hardening with comprehensive error handling and retry logic.

The script successfully incorporates past feedback: exponential backoff for transient failures (lines 36–48), explicit failure when exit codes cannot be determined (lines 178–182), and dual-fallback exit code resolution (scontrol → sacct, lines 163–175). The cleanup trap and set -euo pipefail prevent orphaned processes and catch unexpected errors. Job completion detection now properly guards against both immediate squeue failures and false positives via sacct.

Also applies to: 26-50, 159-191

codeant-ai · 2025-12-10T02:34:21Z

CodeAnt AI is running Incremental review

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

.github/scripts/run_parallel_benchmarks.sh (1)
7-14: Input validation is adequate but could be explicit.

The script validates argument count but doesn't explicitly validate that device, interface, and cluster are non-empty. Consider adding a check to prevent silent failures if arguments are passed as empty strings.
 if [ $# -ne 3 ]; then
     echo "Usage: $0 <device> <interface> <cluster>"
     exit 1
 fi

 device="$1"
 interface="$2"
 cluster="$3"
+
+if [ -z "$device" ] || [ -z "$interface" ] || [ -z "$cluster" ]; then
+    echo "ERROR: device, interface, and cluster cannot be empty"
+    exit 1
+fi

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a77897 and 8e164b6.

📒 Files selected for processing (1)

.github/scripts/run_parallel_benchmarks.sh (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)

GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Code Cleanliness Check
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Oak Ridge | Frontier (gpu-omp)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Georgia Tech | Phoenix (NVHPC) (gpu-omp)
GitHub Check: Oak Ridge | Frontier (CCE) (gpu-omp)
GitHub Check: Georgia Tech | Phoenix (NVHPC) (cpu)
GitHub Check: Georgia Tech | Phoenix (NVHPC) (gpu-acc)
GitHub Check: Oak Ridge | Frontier (CCE) (gpu-acc)
GitHub Check: Coverage Test on CodeCov
GitHub Check: Build & Publish

🔇 Additional comments (1)

.github/scripts/run_parallel_benchmarks.sh (1)

64-80: File verification is robust with good debugging output.

The verification logic correctly checks for YAML file existence and provides helpful debugging information (directory listing, log tail) if files are missing. The use of || true to prevent script failure on ls errors is appropriate.

.github/scripts/run_parallel_benchmarks.sh

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e164b6 and cfad099.

📒 Files selected for processing (1)

.github/scripts/run_parallel_benchmarks.sh (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Code Cleanliness Check
GitHub Check: Coverage Test on CodeCov
GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Build & Publish

🔇 Additional comments (2)

.github/scripts/run_parallel_benchmarks.sh (2)

25-25: Good: set -o pipefail correctly applied in subshells.

The prior critical issue regarding lost exit codes in pipelines has been properly addressed by adding set -o pipefail at the start of each subshell. This ensures the pipeline returns the exit status of submit_and_monitor_bench.sh, leaving the while-loop output prefixing intact.

Also applies to: 29-29

63-80: Verify log file naming convention.

Line 78 assumes the master benchmark log file uses the extension .out (e.g., master/bench-${device}-${interface}.out). Confirm this filename matches the actual output generated by submit_and_monitor_bench.sh.

.github/scripts/run_parallel_benchmarks.sh

into benchmark-hardening

codeant-ai · 2025-12-10T14:46:05Z

CodeAnt AI is running Incremental review

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

.github/scripts/run_parallel_benchmarks.sh (1)
39-51: Critical: Exit code from wait is still lost due to ! operator (unresolved from prior review).

The pattern if ! wait "$pr_pid"; then pr_exit=$? captures the exit status of the ! operator (which is 0 after successful inversion), not the actual exit code of the background process. This causes the error-handling logic to always report both jobs as successful, even if they fail.

The first critical issue (pipefail in subshells) was correctly addressed on lines 25–29, but this second issue remains unfixed. The prior review flagged this at lines 39–51, and it is still present.

Apply this diff to capture exit codes directly:
 # Wait and capture exit codes reliably
 pr_exit=0
 master_exit=0
 
-if ! wait "$pr_pid"; then
-  pr_exit=$?
-  echo "PR job exited with code: $pr_exit"
-else
-  echo "PR job completed successfully"
-fi
+wait "$pr_pid"
+pr_exit=$?
+if [ "$pr_exit" -ne 0 ]; then
+  echo "PR job exited with code: $pr_exit"
+else
+  echo "PR job completed successfully"
 fi
 
-if ! wait "$master_pid"; then
-  master_exit=$?
-  echo "Master job exited with code: $master_exit"
-else
-  echo "Master job completed successfully"
-fi
+wait "$master_pid"
+master_exit=$?
+if [ "$master_exit" -ne 0 ]; then
+  echo "Master job exited with code: $master_exit"
+else
+  echo "Master job completed successfully"
+fi

🧹 Nitpick comments (1)

.gitignore (1)
94-95: Clarify the gitignore pattern syntax for .supercode/.

Line 95 uses **.supercode/, which matches prefix.supercode/ (any path ending with .supercode/). However, given the naming convention (starting with .), the intent is likely to match hidden directories .supercode/ at any depth. The pattern should be **/.supercode/ instead.

Apply this diff to clarify the pattern:
 **isolation_rules/
-**.supercode/
+**/.supercode/

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfad099 and 9dd52cc.

📒 Files selected for processing (2)

.github/scripts/run_parallel_benchmarks.sh (1 hunks)
.gitignore (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

.github/scripts/run_parallel_benchmarks.sh (3)

.github/workflows/phoenix/submit-bench.sh (1)

usage (5-7)

.github/workflows/frontier/submit-bench.sh (1)

usage (5-7)

toolchain/mfc/bench.py (1)

bench (19-76)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)

GitHub Check: Github (macos, mpi, debug, false)
GitHub Check: Github (ubuntu, mpi, no-debug, false)
GitHub Check: Github (macos, mpi, no-debug, false)
GitHub Check: Github (ubuntu, mpi, debug, false)
GitHub Check: Oak Ridge | Frontier (gpu-acc)
GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
GitHub Check: Georgia Tech | Phoenix (cpu)
GitHub Check: Github (ubuntu, mpi, debug, true)
GitHub Check: Oak Ridge | Frontier (cpu)
GitHub Check: Github (ubuntu, mpi, no-debug, true)
GitHub Check: Georgia Tech | Phoenix (gpu-acc)
GitHub Check: Georgia Tech | Phoenix (NVHPC) (gpu-omp)
GitHub Check: Georgia Tech | Phoenix (NVHPC) (gpu-acc)
GitHub Check: Oak Ridge | Frontier (CCE) (gpu-omp)
GitHub Check: Georgia Tech | Phoenix (NVHPC) (cpu)
GitHub Check: Oak Ridge | Frontier (CCE) (gpu-acc)
GitHub Check: Coverage Test on CodeCov
GitHub Check: Code Cleanliness Check
GitHub Check: Build & Publish

🔇 Additional comments (3)

.github/scripts/run_parallel_benchmarks.sh (3)

1-24: Setup and argument handling look good.

The shebang, set -euo pipefail, argument validation, and SCRIPT_DIR calculation are all correct and follow best practices.

25-31: Subshell pipefail handling is correct.

The set -o pipefail added to both subshells (lines 25 and 29) correctly addresses the prior critical issue where exit codes from submit_and_monitor_bench.sh were lost in pipelines. This ensures the background processes' exit statuses propagate correctly.

53-88: File verification logic is sound, but depends on fixing the exit code bug.

The output verification (checking for YAML files) and detailed error reporting (directory listings and log tails) are well-designed. However, this entire error-handling path is unreachable due to the broken exit code capture in lines 39–51: since both pr_exit and master_exit will always be 0, the check on line 54 will never trigger even if the jobs actually fail.

Once the exit code handling is corrected, this section will work as intended.

codeant-ai · 2025-12-10T20:54:40Z

CodeAnt AI is running Incremental review

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

benchmarks hardening!

e8090bd

Copilot AI review requested due to automatic review settings December 5, 2025 16:32

Copilot started reviewing on behalf of sbryngelson December 5, 2025 16:33 View session

qodo-code-review bot added the Review effort 3/5 label Dec 5, 2025

lint error

8c01b99

codeant-ai bot added the size:L This PR changes 100-499 lines, ignoring generated files label Dec 5, 2025

qodo-code-review bot reviewed Dec 5, 2025

View reviewed changes

.github/workflows/bench.yml Show resolved Hide resolved

qodo-code-review bot reviewed Dec 5, 2025

View reviewed changes

.github/workflows/bench.yml Outdated Show resolved Hide resolved

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Dec 5, 2025

View reviewed changes

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

Copilot finished reviewing on behalf of sbryngelson December 5, 2025 16:36

codeant-ai bot reviewed Dec 5, 2025

View reviewed changes

Copilot AI reviewed Dec 5, 2025

View reviewed changes

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

.github/workflows/bench.yml Outdated Show resolved Hide resolved

.github/workflows/bench.yml Outdated Show resolved Hide resolved

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

.github/workflows/bench.yml Outdated Show resolved Hide resolved

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

minor

2e1eb87

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

forks look nicer now i think

0668fed

cubic-dev-ai bot reviewed Dec 5, 2025

View reviewed changes

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

.github/workflows/bench.yml Outdated Show resolved Hide resolved

more fixing!

f79f4d1

cubic-dev-ai bot reviewed Dec 5, 2025

View reviewed changes

toolchain/mfc/bench.py Outdated Show resolved Hide resolved

.github/scripts/monitor_slurm_job.sh Outdated Show resolved Hide resolved

clean linter

76a6e63

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

.github/scripts/monitor_slurm_job.sh Outdated Show resolved Hide resolved

fix some more suggestions

7333098

sbryngelson requested a review from Copilot December 5, 2025 17:20

Copilot started reviewing on behalf of sbryngelson December 5, 2025 17:21 View session

qodo-code-review bot reviewed Dec 5, 2025

View reviewed changes

.github/scripts/monitor_slurm_job.sh Show resolved Hide resolved

.github/workflows/bench.yml Outdated Show resolved Hide resolved

Copilot AI reviewed Dec 8, 2025

View reviewed changes

.github/scripts/monitor_slurm_job.sh Show resolved Hide resolved

toolchain/mfc/bench.py Outdated Show resolved Hide resolved

sbryngelson added 2 commits December 8, 2025 15:56

cleanup on the monitoring

db91d6a

better monitoring

87fcd74

fixup

f468fe5

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

fix

d9e06f6

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

sbryngelson added 2 commits December 9, 2025 00:51

fixup

e17eb27

fix

db77fb9

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

fix up some printing

09cda88

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

sbryngelson added 2 commits December 9, 2025 09:41

no -W wait

3a77897

fix

8e164b6

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

.github/scripts/run_parallel_benchmarks.sh Outdated Show resolved Hide resolved

tweak

cfad099

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

.github/scripts/run_parallel_benchmarks.sh Outdated Show resolved Hide resolved

sbryngelson added 3 commits December 9, 2025 21:27

Merge branch 'master' into benchmark-hardening

53dce4a

more debug output

4a45ef6

Merge branch 'benchmark-hardening' of https://github.com/sbryngelson/mfc

9dd52cc

into benchmark-hardening

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

sbryngelson added 2 commits December 10, 2025 09:58

minor fixup

b5cb016

fix python

66176e7

sbryngelson merged commit b220ebf into MFlowCode:master Dec 11, 2025
35 of 36 checks passed

benchmarks hardening! #1078

benchmarks hardening! #1078

Uh oh!

Conversation

sbryngelson commented Dec 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

PR Type

Description

Diagram Walkthrough

File Walkthrough

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Summary by CodeRabbit

Uh oh!

codeant-ai bot commented Dec 5, 2025

Thanks for using CodeAnt! 🎉

Uh oh!

coderabbitai bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Pre-merge checks and finishing touches

Uh oh!

qodo-code-review bot commented Dec 5, 2025

PR Reviewer Guide 🔍

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codeant-ai bot commented Dec 5, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sbryngelson commented Dec 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 5, 2025 •

edited

Loading

codecov bot commented Dec 8, 2025 •

edited

Loading