Skip to content

RunsOn v2 → v3 migration #322

@mmcky

Description

@mmcky

RunsOn v2 → v3 migration

RunsOn v3 is now a stable production release. This issue tracks migrating our self-hosted GitHub Actions runner stack from v2 to v3 across the affected lecture/course repos.

Why now

v3 originates from AWS deprecating App Runner (announced April 2026), which the v2 control plane runs on. RunsOn rebuilt the orchestrator on plain ECS/Fargate. So v2 is ultimately on borrowed time tied to App Runner's eventual shutdown.

  • No forced shutdown — v2 keeps running for now; no emergency.
  • v2 support ends 30 July 2026 — after that, bug fixes & improvements land only on v3 (v2 keeps running until AWS actually switches off App Runner).
  • Action: migrate deliberately on our schedule before the deadline, not in a scramble when AWS pulls the plug.

What changes

v3 is a breaking release. RunsOn explicitly says do not upgrade the v2 stack in place — you deploy a fresh v3 stack with a new (separate) GitHub App. v2 and v3 can run side by side, which lets us cut over repo-by-repo by moving the App installation. The runs-on=${{ github.run_id }}/... workflow label stays generic; routing is determined by which App is installed on a repo.

The one in-repo change — the disk label:

Every RunsOn workflow we have uses .../disk=large. In v3 the disk= label is parsed but ignored (no longer provisions a volume) and only emits a deprecation warning. Replacement is volume=:

  • disk=largevolume=80gb
  • custom: volume=100gb:gp3:500mbs:4000iops (size:type:throughput:iops)

⚠️ Gotcha: if a repo is pointed at a v3 stack without updating the label, the runner silently boots with the image's default root volume instead of "large" — GPU build jobs (JAX caches, conda envs, datasets) can then fail with "no space left on device". So the label edit must happen together with the stack cutover for each repo — not before (v2 won't understand volume=) and not after.

Custom images (e.g. quantecon_ubuntu2404) are unaffected by v3 — image defs in .github/runs-on.yml stay valid — but the AMI id is account+region-specific, so deploy the v3 stack in the same AWS account/region so the AMI still resolves.

Affected repos

Repo RunsOn workflows Image
lecture-jax ci, cache, publish, collab quantecon_ubuntu2404 / ubuntu24-gpu-x64
lecture-python.myst ci, cache, publish, collab same
lecture-python-programming ci, cache, publish same
lecture-stats ci, cache, publish, collab same
iuj_feb_2026, scipy_tutorial_2026 course repos same

Not affected (no RunsOn): lecture-python-intro (uses ubuntu-latest + legacy quantecon-large), lecture-python-advanced.myst (ubuntu-latest).

Migration plan — pilot on lecture-stats

lecture-stats is the ideal test ground: it's slated for deprecation (low risk) yet exercises the full matrix — GPU family (g4dn.2xlarge), the custom AMI image, the RunsOn default GPU image (ubuntu24-gpu-x64 in collab.yml), and the disk label across all four workflow types. If it passes, the bigger repos are essentially copy-paste.

  1. Deploy the v3 stack in AWS (CloudFormation one-click, ~10 min), same account/region as v2.
  2. Register + install the new GitHub App on lecture-stats only.
  3. Combined cutover on lecture-stats: switch the 4 labels disk=largevolume=80gb in the same change that points it at the v3 App; confirm the AMI resolves.
  4. Run all four workflows green and confirm disk headroom is adequate.
  5. Roll out to lecture-jax, lecture-python.myst, lecture-python-programming (then course repos).
  6. Decommission the v2 stack once everything is cut over, before 30 July 2026.

AWS v3 install (CloudFormation path)

The v3 built-in CF template assumes github.com + RunsOn's embedded networking — correct for us. (GHES / existing-VPC would need the Terraform path; not our case.)

  • Prereqs: AWS CloudFormation perms (same account+region as v2); GitHub org admin on QuantEcon; existing RunsOn license key (plain or SSM ref — reuse ours, no trial needed).
  • Launch CloudFormation stack, key params:
    • GitHub org: QuantEcon
    • License key (or SSM ref)
    • Email for cost & alert reports
    • Environment name: e.g. v3 (distinct from v2, since they run side by side)
    • AppSize: small/medium/high/xhigh preset (replaces v2 CPU/mem/queue knobs) — small/medium is plenty for our repo count
    • Optional hardening: EnableWAF, EnableAdminRoutes
  • RunsOn auto-creates a private GitHub App (org-only; creds stay in our AWS account).
  • When the stack completes, open the RunsOnEntryPoint HTTPS URL from CloudFormation Outputs → click "Register app" → install on lecture-stats.
  • Stack creates: 1 VPC, 2 public + 2 private subnets, EC2 security group, API Gateway + Lambda ingress, 1 ECS/Fargate worker service. No App Runner.

Checklist

  • Deploy v3 CloudFormation stack (same AWS account/region as v2)
  • Register v3 GitHub App; install on lecture-stats
  • lecture-stats: disk=largevolume=80gb across ci/cache/publish/collab
  • Confirm quantecon_ubuntu2404 AMI resolves under v3
  • Validate all four lecture-stats workflows (green + disk headroom)
  • Roll out: lecture-jax
  • Roll out: lecture-python.myst
  • Roll out: lecture-python-programming
  • Roll out: course repos (iuj_feb_2026, scipy_tutorial_2026)
  • Decommission v2 stack (before 30 July 2026)

References

/cc @mmcky

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions