Skip to content

ci: use T4 + xlarge runners — re-enable CUDA/HIP, de-serialize Metal#112

Merged
robtaylor merged 2 commits into
mainfrom
feat/ci-runner-allocation
Jun 5, 2026
Merged

ci: use T4 + xlarge runners — re-enable CUDA/HIP, de-serialize Metal#112
robtaylor merged 2 commits into
mainfrom
feat/ci-runner-allocation

Conversation

@robtaylor

Copy link
Copy Markdown
Contributor

Summary

Puts the two new gpu-eda team-plan runners to work alongside the free self-hosted macos-runner-1:

  • tesla4-runner — 4 vCPU + 1 NVIDIA T4 (GitHub-hosted)
  • macos-runner-xlarge — M2 Pro, 5-core CPU / 8-core GPU, 14 GB RAM/storage (GitHub-hosted)

What changes

CUDA + HIP CI back online (on the T4, every push). Both jobs were if: ${{ false }} and pinned to the offline nvidia-runner-1 since 2026-05-01 — CUDA has had zero CI coverage since. They now run on tesla4-runner:

  • CUDA Tests: native on the T4.
  • HIP Tests (NVIDIA backend): the job builds with hipcc + HIP_PLATFORM=nvidia, so the T4 validates the HIP code path too. Native AMD/HIP still needs an AMD/ROCm runner (future).

This directly restores backend coverage and unblocks #104 (CUDA/HIP sim-timing parity).

Metal de-serialized (xlarge, gated to main/label). The three Metal jobs used to serialize on the single self-hosted runner (the reason we stacked #91#110#111). The two light jobs (Metal Tests, JTAG Minimal Cosim) now have a conditional runs-on:

runs-on: ${{ (github.ref == 'refs/heads/main' || contains(github.event.pull_request.labels.*.name, 'ci:metal-xl')) && 'macos-runner-xlarge' || 'macos-runner-1' }}
  • Routine PR pushes: stay on free macos-runner-1 (full coverage, no cost).
  • main / a ci:metal-xl-labelled PR: offload to the billed macos-runner-xlarge, running in parallel with the disk-heavy MCU SoC Metal Simulation job — which stays pinned to macos-runner-1 because xlarge has only 14 GB storage (big designs won't fit).

Concurrency group added (cancel-in-progress per ref) so rapid pushes don't pile up on the self-hosted / billed runners.

Cost posture

T4 runs on every push (GPU coverage was the big gap); the billed macOS xlarge is gated to main/label so routine PRs cost nothing extra on macOS. Added the ci:metal-xl label for opting a PR into xlarge.

⚠️ Confirm before merge

  • Runner label strings. I used the names exactly as given: tesla4-runner and macos-runner-xlarge. These must match the runner names registered in org settings. Note GitHub's standard Apple-Silicon larger-runner labels are macos-latest-xlarge / macos-15-xlarge — if macos-runner-xlarge isn't a custom larger-runner you named, the gated jobs won't schedule. Easy to adjust if so.
  • First CUDA/HIP run may be red. These haven't built in CI since May and main has moved ~45 commits; the first run is diagnostic, not a regression from this PR. They aren't required status checks, so they won't block merges.

Stacking

Stacked on #111 (base feat/bidir-tristate-readback). Cascade: #91#110#111#112. Each auto-retargets toward main as the one below merges.

Registered both labels in .github/actionlint.yaml; actionlint clean (remaining findings are pre-existing shellcheck).

@robtaylor robtaylor force-pushed the feat/ci-runner-allocation branch from a72c2e1 to d3c0b46 Compare June 5, 2026 01:32
@robtaylor robtaylor force-pushed the feat/bidir-tristate-readback branch from 3d63663 to df3fcc4 Compare June 5, 2026 11:01
@robtaylor robtaylor force-pushed the feat/ci-runner-allocation branch from 8a744d5 to 0e4849a Compare June 5, 2026 11:01
@robtaylor robtaylor force-pushed the feat/bidir-tristate-readback branch from df3fcc4 to 66fd50c Compare June 5, 2026 13:00
@robtaylor robtaylor force-pushed the feat/ci-runner-allocation branch from 0e4849a to f7642db Compare June 5, 2026 13:00
Base automatically changed from feat/bidir-tristate-readback to main June 5, 2026 15:46
robtaylor added 2 commits June 5, 2026 16:48
The gpu-eda team plan adds two GitHub-hosted larger runners: tesla4-runner
(4 vCPU + 1 NVIDIA T4) and macos-runner-xlarge (M2 Pro, 14 GB). Put them to
work alongside the free self-hosted macos-runner-1.

- CUDA Tests + HIP Tests (NVIDIA backend): un-gate (`if: false` removed) and
  move from the offline nvidia-runner-1 to tesla4-runner, every push. CUDA
  has had no CI coverage since 2026-05-01; the T4 runs CUDA natively and the
  HIP-on-NVIDIA codepath (HIP_PLATFORM=nvidia). Native AMD/HIP still needs an
  AMD runner.
- Metal Tests + JTAG Minimal Cosim: conditional runs-on — free self-hosted
  macos-runner-1 on routine PR pushes (full coverage, no cost), offloading to
  the billed macos-runner-xlarge on `main` or a `ci:metal-xl`-labelled PR so
  they run in parallel with the disk-heavy MCU SoC Metal job. That job stays
  pinned to macos-runner-1 (xlarge has only 14 GB storage).
- Add a workflow-level concurrency group (cancel-in-progress per ref) so
  rapid pushes don't pile up on the self-hosted / billed runners.
- Register tesla4-runner + macos-runner-xlarge in .github/actionlint.yaml.

Co-developed-by: Claude Code v2.1.162 (claude-opus-4-8)
Drop the `branches: [main, staged-aig-release]` filter on the pull_request
trigger. It filtered by *base* branch, so PRs stacked on other feature
branches got no CI until they cascaded down to a main base. Plain
`pull_request:` runs CI on every PR regardless of base. The push trigger
keeps its branch filter (we only want push-CI on main/staged-aig-release,
not on every feature-branch push — PRs cover those).

Co-developed-by: Claude Code v2.1.162 (claude-opus-4-8)
@robtaylor robtaylor force-pushed the feat/ci-runner-allocation branch from c55ee41 to d06a8ec Compare June 5, 2026 15:48
@robtaylor robtaylor merged commit d0e25a8 into main Jun 5, 2026
15 checks passed
@robtaylor robtaylor deleted the feat/ci-runner-allocation branch June 5, 2026 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant