Skip to content

ci: add CI-Ready GPU Brev launchable for E2E tests#1456

Open
ksapru wants to merge 3 commits intoNVIDIA:mainfrom
ksapru:feat/gpu-ci-launchable
Open

ci: add CI-Ready GPU Brev launchable for E2E tests#1456
ksapru wants to merge 3 commits intoNVIDIA:mainfrom
ksapru:feat/gpu-ci-launchable

Conversation

@ksapru
Copy link
Copy Markdown
Contributor

@ksapru ksapru commented Apr 3, 2026

Summary

This PR introduces a CI-Ready GPU Brev launchable (scripts/brev-launchable-ci-gpu.sh), mirroring the architecture of the CPU launchable to prevent repeated, expensive setup overhead in our GPU E2E tests. By pre-baking Docker, Node.js, the OpenShell CLI, the NVIDIA Container Toolkit, and pre-pulling Ollama models (qwen3:0.6b / qwen2.5:0.5b), this eliminates 5-10 minutes of bootstrapping during CI. Additionally, this PR mitigates CI flakiness by bumping default Vitest timeouts from 5s to 30s on heavy integration tests to ensure tests run smoothly on VM instances.

Related Issue

Fixes #1328

Changes

  • New GPU Startup Script: Added scripts/brev-launchable-ci-gpu.sh with robust GPU passthrough validation, graceful apt-lock handling, and reliable asynchronous Ollama polling defaults.
  • GPU Test Suite Enablement: Updated test/e2e/brev-e2e.test.js to run the GPU E2E suite conditionally when TEST_SUITE === "gpu".
  • Workflow Pipeline: Modified .github/workflows/nightly-e2e.yaml to integrate the GPU-specific launch pipeline path.
  • Test Timeout Fixes: Increased Vitest timeouts for test/install-preflight.test.js, test/nemoclaw-cli-recovery.test.js, and test/onboard.test.js from 5,000ms to 30,000ms to eliminate arbitrary timeout failures during heavy local and remote executions.

Type of Change

  • Code change for a new feature, bug fix, or refactor.
  • Code change with doc updates.
  • Doc only. Prose changes without code sample modifications.
  • Doc only. Includes code sample changes.

Testing

  • npx prek run --all-files passes (or equivalently make check).
  • npm test passes.
  • make docs builds without warnings. (for doc-only changes)

Checklist

General

Code Changes

  • Formatters applied — npx prek run --all-files auto-fixes formatting (or make format for targeted runs).
  • Tests added or updated for new or changed behavior.
  • No secrets, API keys, or credentials committed.
  • Doc pages updated for any user-facing behavior changes.

Doc Changes

  • Follows the style guide.
  • New pages include SPDX license header and frontmatter, if creating a new page.
  • Cross-references and links verified.

Signed-off-by: Krish Sapru ksapru@bu.edu

Summary by CodeRabbit

  • Tests

    • Added a GPU test option and a test that runs against an ephemeral GPU instance, validating PASS/FAIL output.
  • Chores

    • Switched GPU E2E to ephemeral GPU VMs with automated provisioning and teardown.
    • Added startup orchestration with runtime, Docker/GPU validation, optional model pre-pull, and readiness signaling.
    • Improved failure handling to collect and upload remote logs/artifacts.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Switches GPU E2E from a persistent self-hosted runner to an ephemeral ubuntu-latest runner that provisions a temporary Brev GPU instance, runs a startup script to prepare the VM (Docker, NVIDIA toolkit, Ollama, repo build), polls a readiness sentinel, executes remote GPU tests, collects logs/artifacts, and always deletes the instance.

Changes

Cohort / File(s) Summary
Workflow GPU Integration
.github/workflows/nightly-e2e.yaml
Replaces self-hosted GPU runner with ubuntu-latest; installs Brev CLI, creates a Brev GPU instance with t4 flavor and startup script, polls /var/run/nemoclaw-launchable-ready, runs tests via brev exec with exported env vars (e.g., OLLAMA_MODEL=qwen3:0.6b), copies remote logs via brev scp on failure, and always deletes the instance in teardown.
GPU VM Provisioning Script
scripts/brev-launchable-ci-gpu.sh
Adds new startup script that bootstraps a CI-ready GPU VM under strict mode: apt retries, Docker install/config, Node 22 install, OpenShell CLI install, repo clone/build, optional Docker image pre-pulls, NVIDIA Container Toolkit setup and GPU validation, Ollama install + model pull, readiness sentinel creation and logging.
E2E Test Suite
test/e2e/brev-e2e.test.js
Adds gpu to TEST_SUITE options and a conditional Vitest it.runIf(TEST_SUITE === "gpu") test that invokes remote test/e2e/test-gpu-e2e.sh, asserting "PASS" and absence of /FAIL:/ with a 900,000 ms timeout.

Sequence Diagram(s)

sequenceDiagram
    participant GHA as GitHub Actions (nightly-e2e)
    participant Brev as Brev CLI
    participant GPU as GPU Instance (ephemeral)
    participant Startup as Startup Script
    participant Tests as GPU E2E Test

    GHA->>GHA: Install Brev CLI
    GHA->>Brev: brev create (GPU instance, startup script, env)
    Brev->>GPU: Provision instance & run startup script
    GPU->>Startup: Execute provisioning (Docker, NVIDIA toolkit, Node, Ollama, repo)
    Startup->>GPU: Create readiness sentinel (/var/run/nemoclaw-launchable-ready)
    GHA->>GPU: Poll readiness (bounded retries)
    GPU-->>GHA: Ready
    GHA->>Brev: brev exec (run test/e2e/test-gpu-e2e.sh)
    Brev->>GPU: Execute tests remotely
    GPU->>Tests: Run GPU E2E suite (emit logs)
    Tests-->>GHA: Return results (PASS/FAIL) via brev exec
    GHA->>Brev: brev scp logs/artifacts (on failure or teardown)
    GHA->>Brev: brev delete (always teardown)
    GHA->>GHA: Upload artifacts to GitHub
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🐰
I hopped a cloud to spin a GPU bright,
With Brev I planted seeds that boot at night.
Ollama warmed and models softly sang,
Tests leapt through hoops and left the logs on-hand.
Ephemeral fields now hum with CI light.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The PR includes timeout increases to test files (30,000ms) not mentioned in the linked issue, which may be related improvements but are out-of-scope. Verify whether the Vitest timeout changes to install-preflight.test.js, nemoclaw-cli-recovery.test.js, and onboard.test.js are necessary for GPU CI stability or if they should be a separate PR.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main change: adding a GPU-ready Brev launchable for CI E2E tests, which is the primary objective of this PR.
Linked Issues check ✅ Passed All primary coding objectives from issue #1328 are met: GPU startup script created, NVIDIA Container Toolkit installed, Ollama pre-installed with test models, and nightly-e2e.yaml workflow integrated.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/nightly-e2e.yaml (1)

224-248: ⚠️ Potential issue | 🟠 Major

Delete the VM after failure logs are copied.

On failures, this always() step runs before both brev scp steps. That leaves nothing to copy, so the artifact uploads are empty right when diagnostics are needed most. Move teardown below the log-copy steps.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 224 - 248, The teardown step
"Tear down GPU instance" currently runs before the failure log copy steps,
causing logs to be deleted; move the "Tear down GPU instance" step (the block
that runs `brev delete e2e-gpu-nightly-${{ github.run_id }}` with if: always())
to after the "Copy install log on failure", "Upload install log on failure", and
"Copy test log on failure" steps so that the `brev scp` and upload actions can
run first and collect artifacts, keeping the teardown step's if: always()
behavior but ensuring it executes last.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 181-185: The "Install Brev CLI" step currently extracts the
archive directly to /usr/local/bin (and sets chmod) which can fail due to
permissions; change the step to either use sudo when writing to /usr/local/bin
or unpack into a writable user directory (e.g., $HOME/.local/bin) and ensure
that directory is created and added to PATH before subsequent steps; update the
step that references brev so it uses the updated install location (references:
the "Install Brev CLI" step and the target path /usr/local/bin or
$HOME/.local/bin and the brev binary name).
- Around line 193-195: The --startup-script argument passed to the brev create
command currently uses a remote URL which Brev CLI no longer accepts; update the
command that builds the instance (the brev create invocation with --name
"$INSTANCE_NAME" and --flavor "t4") to use the `@filepath` form by replacing the
URL value with `@scripts/brev-launchable-ci-gpu.sh` so the CLI reads the local
script file content instead of a URL.

In `@scripts/brev-launchable-ci-gpu.sh`:
- Around line 295-300: The until loop polling Ollama can hang indefinitely; add
a bounded timeout (e.g., MAX_WAIT_SECS) and check elapsed time or a counter
inside the loop that waits for http://localhost:11434, and if the timeout is
reached, kill the background process (use OLLAMA_PID), log a clear error, and
exit non‑zero so the CI fails fast; update the block that starts "ollama serve
>/dev/null 2>&1 &" and the subsequent loop that uses curl to implement this
timeout and cleanup.
- Around line 243-271: The GPU validation currently only logs warnings when
nvidia-smi is missing or the docker GPU test fails, allowing the startup to
continue; change the validation in the block that runs nvidia-smi and the docker
run to treat failures as fatal: when nvidia-smi is not found or returns
non-zero, or when docker run --gpus all nvidia/cuda:12.2.0-base nvidia-smi
fails, call an error handler (e.g., use error or fatal logging) and exit
non-zero (avoid creating the ready sentinel) instead of warn; update the code
around the nvidia-smi checks and the docker run command to exit with non-zero
status on failure so a broken GPU runtime is not marked ready.

In `@test/e2e/brev-e2e.test.js`:
- Around line 648-656: The test's GPU branch is never exercising GPU because
beforeAll always uses the CPU bootstrap (scripts/brev-launchable-ci-cpu.sh /
brev search cpu) and later asserts gpuEnabled: false; update the setup in the
beforeAll (or the bootstrap helper invoked there) to conditionally request a GPU
when TEST_SUITE === "gpu" — e.g., switch to the GPU launch script or a brev
search that returns a GPU launchable and ensure the resulting runtime's
gpuEnabled flag is true (references: TEST_SUITE, beforeAll,
scripts/brev-launchable-ci-cpu.sh, brev search cpu, gpuEnabled) so the "GPU E2E
suite passes on remote VM" test actually runs on a GPU environment.

---

Outside diff comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 224-248: The teardown step "Tear down GPU instance" currently runs
before the failure log copy steps, causing logs to be deleted; move the "Tear
down GPU instance" step (the block that runs `brev delete e2e-gpu-nightly-${{
github.run_id }}` with if: always()) to after the "Copy install log on failure",
"Upload install log on failure", and "Copy test log on failure" steps so that
the `brev scp` and upload actions can run first and collect artifacts, keeping
the teardown step's if: always() behavior but ensuring it executes last.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7119745b-c573-4f21-9c3e-b69f28d691f5

📥 Commits

Reviewing files that changed from the base of the PR and between f4a01cf and e6994ae.

📒 Files selected for processing (3)
  • .github/workflows/nightly-e2e.yaml
  • scripts/brev-launchable-ci-gpu.sh
  • test/e2e/brev-e2e.test.js

@ksapru ksapru force-pushed the feat/gpu-ci-launchable branch from b7eef04 to 3095400 Compare April 4, 2026 16:00
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/nightly-e2e.yaml (1)

224-249: ⚠️ Potential issue | 🔴 Critical

Critical: Teardown runs before log collection, preventing artifact retrieval.

The "Tear down GPU instance" step (line 224-227) uses if: always() and runs before the log copy steps (lines 228-234, 243-249). When tests fail:

  1. Teardown executes and deletes the instance
  2. Log copy steps attempt brev scp on a deleted instance
  3. Logs are lost despite || true preventing step failure

The log copy steps must execute before the instance is deleted.

🐛 Fix: Reorder steps so log collection precedes teardown
          brev exec "$INSTANCE_NAME" -- bash -c \
            "cd ~/NemoClaw && \
             export NEMOCLAW_NON_INTERACTIVE=${NEMOCLAW_NON_INTERACTIVE} && \
             export NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=${NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE} && \
             export NEMOCLAW_SANDBOX_NAME=${NEMOCLAW_SANDBOX_NAME} && \
             export NEMOCLAW_RECREATE_SANDBOX=${NEMOCLAW_RECREATE_SANDBOX} && \
             export NEMOCLAW_PROVIDER=${NEMOCLAW_PROVIDER} && \
             export OLLAMA_MODEL=qwen3:0.6b && \
             bash test/e2e/test-gpu-e2e.sh"

-     - name: Tear down GPU instance
-       if: always()
-       run: brev delete e2e-gpu-nightly-${{ github.run_id }} || true
-
      - name: Copy install log on failure
        if: failure()
        env:
          INSTANCE_NAME: e2e-gpu-nightly-${{ github.run_id }}
        run: |
          brev scp "$INSTANCE_NAME":/tmp/nemoclaw-gpu-e2e-install.log /tmp/nemoclaw-gpu-e2e-install.log || true

      - name: Upload install log on failure
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: gpu-e2e-install-log
          path: /tmp/nemoclaw-gpu-e2e-install.log
          if-no-files-found: ignore

      - name: Copy test log on failure
        if: failure()
        env:
          INSTANCE_NAME: e2e-gpu-nightly-${{ github.run_id }}
        run: |
          brev scp "$INSTANCE_NAME":/tmp/nemoclaw-gpu-e2e-test.log /tmp/nemoclaw-gpu-e2e-test.log || true

      - name: Upload test log on failure
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: gpu-e2e-test-log
          path: /tmp/nemoclaw-gpu-e2e-test.log
          if-no-files-found: ignore
+
+     - name: Tear down GPU instance
+       if: always()
+       run: brev delete e2e-gpu-nightly-${{ github.run_id }} || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 224 - 249, The teardown step
"Tear down GPU instance" currently runs before the log collection steps and
deletes the instance too early; move the "Tear down GPU instance" step to after
the "Copy install log on failure", "Upload install log on failure", and "Copy
test log on failure" steps so that brev scp can copy logs before deletion,
keeping its if: always() so it still runs on success/failure but ensuring its
position is last; verify the ENV INSTANCE_NAME usage in the copy steps remains
unchanged and preserve the upload step "Upload install log on failure" between
the copy and teardown steps.
🧹 Nitpick comments (1)
.github/workflows/nightly-e2e.yaml (1)

199-210: Polling timeout may be tight for GPU provisioning.

The polling loop waits a maximum of 10 minutes (20 iterations × 30s) for the instance to be ready. GPU instance provisioning plus the startup script execution (which installs Docker, NVIDIA Container Toolkit, and pre-pulls Ollama models) may exceed this window, especially during peak demand or cold starts.

Consider increasing the iteration count or adding visibility into why readiness failed:

♻️ Suggested improvement
          echo "Waiting for readiness sentinel..."
          export READY=0
-         for i in {1..20}; do
+         for i in {1..40}; do
+            echo "Poll attempt $i..."
             if brev exec "$INSTANCE_NAME" -- cat /var/run/nemoclaw-launchable-ready >/dev/null 2>&1; then
               READY=1
               break
             fi
-            sleep 30
+            sleep 15
          done

          if [ $READY -eq 0 ]; then
             echo "Instance did not become ready in time."
+            echo "Attempting to retrieve startup logs for diagnostics..."
+            brev exec "$INSTANCE_NAME" -- tail -100 /var/log/cloud-init-output.log 2>/dev/null || true
             exit 1
          fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/nightly-e2e.yaml around lines 199 - 210, The current
polling loop uses for i in {1..20} with 30s sleeps and checks
/var/run/nemoclaw-launchable-ready via brev exec "$INSTANCE_NAME" and may time
out during GPU provisioning; increase the iteration count (e.g., to 40 or more)
or lengthen the sleep to allow more time, and add diagnostic logging when a
check fails (include the iteration number, output/stderr from brev exec
"$INSTANCE_NAME" and the absence of /var/run/nemoclaw-launchable-ready) so
failures are visible before exiting; ensure the READY variable logic remains
unchanged (set READY=1 on success and exit nonzero if still 0 after the loop).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 224-249: The teardown step "Tear down GPU instance" currently runs
before the log collection steps and deletes the instance too early; move the
"Tear down GPU instance" step to after the "Copy install log on failure",
"Upload install log on failure", and "Copy test log on failure" steps so that
brev scp can copy logs before deletion, keeping its if: always() so it still
runs on success/failure but ensuring its position is last; verify the ENV
INSTANCE_NAME usage in the copy steps remains unchanged and preserve the upload
step "Upload install log on failure" between the copy and teardown steps.

---

Nitpick comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 199-210: The current polling loop uses for i in {1..20} with 30s
sleeps and checks /var/run/nemoclaw-launchable-ready via brev exec
"$INSTANCE_NAME" and may time out during GPU provisioning; increase the
iteration count (e.g., to 40 or more) or lengthen the sleep to allow more time,
and add diagnostic logging when a check fails (include the iteration number,
output/stderr from brev exec "$INSTANCE_NAME" and the absence of
/var/run/nemoclaw-launchable-ready) so failures are visible before exiting;
ensure the READY variable logic remains unchanged (set READY=1 on success and
exit nonzero if still 0 after the loop).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c2445429-0f1d-49a3-abc3-b9e5e2a0b0dd

📥 Commits

Reviewing files that changed from the base of the PR and between b7eef04 and f170256.

📒 Files selected for processing (3)
  • .github/workflows/nightly-e2e.yaml
  • scripts/brev-launchable-ci-gpu.sh
  • test/e2e/brev-e2e.test.js
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/e2e/brev-e2e.test.js
  • scripts/brev-launchable-ci-gpu.sh

@wscurran wscurran added Platform: Brev Support for Brev deployment CI/CD Use this label to identify issues with NemoClaw CI/CD pipeline or GitHub Actions. Local Models Running NemoClaw with local models enhancement: testing Use this label to identify requests to improve NemoClaw test coverage. labels Apr 4, 2026
@wscurran
Copy link
Copy Markdown
Contributor

wscurran commented Apr 4, 2026

✨ Thanks for submitting this pull request, which proposes a way to improve CI reliability by adding a GPU-ready Brev launchable and increasing test timeouts for smoother E2E execution.


Possibly related open issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Use this label to identify issues with NemoClaw CI/CD pipeline or GitHub Actions. enhancement: testing Use this label to identify requests to improve NemoClaw test coverage. Local Models Running NemoClaw with local models Platform: Brev Support for Brev deployment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ci: create Flavor 2 Brev launchable — CI-Ready GPU

2 participants