Skip to content

OCPEDGE-2737: Enable aarch64 native KVM support for agent-based installation#1908

Open
fonta-rh wants to merge 8 commits into
openshift-metal3:masterfrom
fonta-rh:aarch64-native-kvm
Open

OCPEDGE-2737: Enable aarch64 native KVM support for agent-based installation#1908
fonta-rh wants to merge 8 commits into
openshift-metal3:masterfrom
fonta-rh:aarch64-native-kvm

Conversation

@fonta-rh

@fonta-rh fonta-rh commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Enable IPI and agent-based installation (ABI) on native aarch64 KVM hosts (e.g. AWS Graviton bare metal). Five independent fixes that together unblock the full deployment path on aarch64:

  1. CPU model fix — Narrow the {% if is_aarch64 %} conditional in the VM template so native KVM falls through to host-passthrough instead of the emulation-only cortex-a57
  2. Hardcoded x86_64 references — Replace x86_64 strings with $(uname -m) / ${ARCH} across agent scripts and RHCOS boot image resolution
  3. CDROM bus fix — Set target.bus explicitly on all virt-xml CDROM attachment calls (scsi on aarch64, sata on x86_64) to prevent the virt machine type from defaulting to USB (which doesn't exist)
  4. Go tarball arch fix — Patch metal3-dev-env Ansible defaults after checkout to use {{ GOARCH }} instead of hardcoded amd64 in the Go download URL, so the correct arm64 tarball is fetched on aarch64 hosts
  5. CDROM eject on aarch64 — After VMs boot from the agent ISO, eject the CDROM media on aarch64 hosts. AAVMF firmware on the virt machine type does not reliably honor boot_order for SCSI CDROM vs virtio disk, causing VMs to reboot into the ISO instead of from disk after the agent writes the OS

Changes

File Fix
02_configure_host.sh Narrow CPU model conditional for native aarch64 KVM
agent/06_agent_create_cluster.sh Add CDROM_BUS variable; set target.bus on all 5 virt-xml CDROM lines; add eject_agent_iso() function for aarch64
agent/common.sh PXE boot filename: x86_64$(uname -m)
agent/iscsi_utils.sh iSCSI boot filenames (DHCP bootp + iPXE)
agent/07_agent_add_extraworker_nodes.sh Extra worker node ISO filename
agent/iso_no_registry.sh OVE ISO cleanup exclusion pattern
agent/01_agent_requirements.sh oc-mirror download URL
rhcos.sh RHCOS format key lookup (.architectures.x86_64 → host arch)
01_install_requirements.sh Sed patch for metal3-dev-env Go tarball arch

Test plan

  • Full OCP 4.22 fencing-IPI deployment on AWS c7g.metal (Graviton3, native aarch64 KVM) — 2m+1a arbiter, 50m41s, 35/35 COs Available
  • Both nodes + arbiter Ready, etcd stable at revision 8
  • Verified x86_64 deployments unaffected (all substitutions evaluate to x86_64 / sata on x86 hosts; Go sed patch is no-op when arch matches; CDROM eject is gated on aarch64)
  • 02_configure_host.sh: <os>, <features>, VNC conditionals still fire correctly (only CPU block narrowed)
  • Go 1.24.10 arm64 installed via patched metal3-dev-env, checksum verified
  • ABI end-to-end on aarch64 with CDROM eject fix (pending)

Supersedes #1910.

🤖 Generated with Claude Code

@openshift-ci openshift-ci Bot requested review from andfasano and mkowalski June 9, 2026 09:04
@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign derekhiggins for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown

Hi @fonta-rh. Thanks for your PR.

I'm waiting for a openshift-metal3 member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 9, 2026
Replace hardcoded x86_64 architecture strings with $(uname -m) or
${ARCH} so that agent-based installation scripts work on aarch64
hosts (e.g. AWS Graviton bare metal).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The metal3-dev-env baremetalvm.xml.j2 template hardcodes cortex-a57
for all aarch64 VMs. This CPU model only works under QEMU emulation;
native aarch64 KVM (e.g. AWS Graviton bare metal) requires
host-passthrough.

Narrow the CPU-section conditional from `{% if is_aarch64 %}` to
`{% if is_aarch64 and libvirt_domain_type == 'qemu' %}` so that
native KVM falls through to host-passthrough. The other three
is_aarch64 blocks (<os>, <features>, VNC) are unaffected — the sed
targets only the CPU block by matching its adjacent HTML comment line.

Tested on AWS c7g.metal (Graviton3) with OCP 4.22.0-rc.5 — full
fencing-IPI deployment with Pacemaker/STONITH operational.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fonta-rh fonta-rh force-pushed the aarch64-native-kvm branch from 15e7ec5 to 15c417f Compare June 9, 2026 10:27
On aarch64 (virt machine type), virt-xml defaults the CDROM bus to USB
when no bus is specified. The virt machine type has no USB controller,
causing "USB is disabled for this domain" errors when attaching the
agent ISO at step 06.

Set target.bus explicitly on all five virt-xml CDROM attachment calls:
sata on x86_64 (q35 default), scsi on aarch64 (matching the bus
already configured by 02_configure_host.sh).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fonta-rh fonta-rh changed the title NO-JIRA: Fix aarch64 CPU model for native KVM deployments NO-JIRA: Enable aarch64 native KVM support for agent-based installation Jun 9, 2026
@fonta-rh fonta-rh changed the title NO-JIRA: Enable aarch64 native KVM support for agent-based installation OCPEDGE-2737: Enable aarch64 native KVM support for agent-based installation Jun 9, 2026
@fonta-rh

fonta-rh commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/jira refresh

fonta-rh and others added 2 commits June 11, 2026 11:11
The pinned metal3-dev-env hardcodes `linux-amd64` in the go_tarball
template variable. The 01_install_requirements.sh script already
detects the host architecture and passes GOARCH as an Ansible extra
var, but the template ignores it.

Add a sed patch (alongside the existing Ansible version patch) to
make go_tarball use the GOARCH variable, so the correct Go binary
is downloaded on aarch64 hosts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On aarch64 (virt machine type), AAVMF firmware does not reliably honor
boot_order for SCSI CDROM vs virtio disk. After the agent-based installer
writes the OS to disk and reboots, VMs boot back into the installation
ISO instead of from disk, causing installing-pending-user-action timeout.

Fix: eject CDROM media after VMs have booted from the ISO. The CoreOS
live agent runs entirely in RAM, so the ISO is not needed after boot.
When the agent triggers a reboot after image write, the empty CDROM is
skipped and the VM boots from disk.

Gated on aarch64 only — x86_64 OVMF handles boot_order correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread 01_install_requirements.sh Outdated
Comment thread 02_configure_host.sh
# under qemu emulation; native KVM requires host-passthrough. Narrow the
# CPU conditional so native aarch64 falls through to host-passthrough,
# without affecting the <os> or <features> sections that must still fire.
TEMPLATE="${VM_SETUP_PATH}/roles/libvirt/templates/baremetalvm.xml.j2"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@dtantsur

Copy link
Copy Markdown
Member

/ok-to-test

@openshift-ci openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 12, 2026
fonta-rh and others added 2 commits June 12, 2026 14:23
90 seconds was not enough for AAVMF firmware to enumerate SCSI
devices and boot the ISO on the virt machine type. VMs were still
in UEFI initialization when the CDROM was ejected, leaving them
with no bootable media.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the fixed 5-minute sleep with polling of the assisted-service
REST API. The previous approach ejected the ISO while the agent was
still in the 'installing' phase, reading the CoreOS image from the
ISO to write to disk. This caused "Unable to read from the discovery
media" errors.

The new wait_for_hosts_installed() function polls
/api/assisted-install/v2/clusters/<id>/hosts every 30 seconds until
all hosts reach 'installed' status (disk write complete, about to
reboot). Only then is the CDROM ejected, ensuring AAVMF firmware
boots from disk instead of the ISO on the next boot cycle.

On timeout or error, the CDROM is still ejected (better than leaving
it attached) and wait_for_cluster_ready handles the failure reporting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Guard the Go tarball and CPU model sed patches with grep checks so they
become no-ops once the upstream fixes land in metal3-dev-env:
- metal3-io/metal3-dev-env#1694
- metal3-io/metal3-dev-env#1695

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fonta-rh fonta-rh force-pushed the aarch64-native-kvm branch from 1342b5a to e1ae058 Compare June 12, 2026 15:35
@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown

@fonta-rh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-bm e1ae058 link true /test e2e-metal-ipi-bm

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants