OCPEDGE-2737: Enable aarch64 native KVM support for agent-based installation#1908
OCPEDGE-2737: Enable aarch64 native KVM support for agent-based installation#1908fonta-rh wants to merge 8 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @fonta-rh. Thanks for your PR. I'm waiting for a openshift-metal3 member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Replace hardcoded x86_64 architecture strings with $(uname -m) or
${ARCH} so that agent-based installation scripts work on aarch64
hosts (e.g. AWS Graviton bare metal).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The metal3-dev-env baremetalvm.xml.j2 template hardcodes cortex-a57
for all aarch64 VMs. This CPU model only works under QEMU emulation;
native aarch64 KVM (e.g. AWS Graviton bare metal) requires
host-passthrough.
Narrow the CPU-section conditional from `{% if is_aarch64 %}` to
`{% if is_aarch64 and libvirt_domain_type == 'qemu' %}` so that
native KVM falls through to host-passthrough. The other three
is_aarch64 blocks (<os>, <features>, VNC) are unaffected — the sed
targets only the CPU block by matching its adjacent HTML comment line.
Tested on AWS c7g.metal (Graviton3) with OCP 4.22.0-rc.5 — full
fencing-IPI deployment with Pacemaker/STONITH operational.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
15e7ec5 to
15c417f
Compare
On aarch64 (virt machine type), virt-xml defaults the CDROM bus to USB when no bus is specified. The virt machine type has no USB controller, causing "USB is disabled for this domain" errors when attaching the agent ISO at step 06. Set target.bus explicitly on all five virt-xml CDROM attachment calls: sata on x86_64 (q35 default), scsi on aarch64 (matching the bus already configured by 02_configure_host.sh). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/jira refresh |
The pinned metal3-dev-env hardcodes `linux-amd64` in the go_tarball template variable. The 01_install_requirements.sh script already detects the host architecture and passes GOARCH as an Ansible extra var, but the template ignores it. Add a sed patch (alongside the existing Ansible version patch) to make go_tarball use the GOARCH variable, so the correct Go binary is downloaded on aarch64 hosts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On aarch64 (virt machine type), AAVMF firmware does not reliably honor boot_order for SCSI CDROM vs virtio disk. After the agent-based installer writes the OS to disk and reboots, VMs boot back into the installation ISO instead of from disk, causing installing-pending-user-action timeout. Fix: eject CDROM media after VMs have booted from the ISO. The CoreOS live agent runs entirely in RAM, so the ISO is not needed after boot. When the agent triggers a reboot after image write, the empty CDROM is skipped and the VM boots from disk. Gated on aarch64 only — x86_64 OVMF handles boot_order correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| # under qemu emulation; native KVM requires host-passthrough. Narrow the | ||
| # CPU conditional so native aarch64 falls through to host-passthrough, | ||
| # without affecting the <os> or <features> sections that must still fire. | ||
| TEMPLATE="${VM_SETUP_PATH}/roles/libvirt/templates/baremetalvm.xml.j2" |
|
/ok-to-test |
90 seconds was not enough for AAVMF firmware to enumerate SCSI devices and boot the ISO on the virt machine type. VMs were still in UEFI initialization when the CDROM was ejected, leaving them with no bootable media. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the fixed 5-minute sleep with polling of the assisted-service REST API. The previous approach ejected the ISO while the agent was still in the 'installing' phase, reading the CoreOS image from the ISO to write to disk. This caused "Unable to read from the discovery media" errors. The new wait_for_hosts_installed() function polls /api/assisted-install/v2/clusters/<id>/hosts every 30 seconds until all hosts reach 'installed' status (disk write complete, about to reboot). Only then is the CDROM ejected, ensuring AAVMF firmware boots from disk instead of the ISO on the next boot cycle. On timeout or error, the CDROM is still ejected (better than leaving it attached) and wait_for_cluster_ready handles the failure reporting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Guard the Go tarball and CPU model sed patches with grep checks so they become no-ops once the upstream fixes land in metal3-dev-env: - metal3-io/metal3-dev-env#1694 - metal3-io/metal3-dev-env#1695 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1342b5a to
e1ae058
Compare
|
@fonta-rh: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Enable IPI and agent-based installation (ABI) on native aarch64 KVM hosts (e.g. AWS Graviton bare metal). Five independent fixes that together unblock the full deployment path on aarch64:
{% if is_aarch64 %}conditional in the VM template so native KVM falls through tohost-passthroughinstead of the emulation-onlycortex-a57x86_64strings with$(uname -m)/${ARCH}across agent scripts and RHCOS boot image resolutiontarget.busexplicitly on allvirt-xmlCDROM attachment calls (scsion aarch64,sataon x86_64) to prevent thevirtmachine type from defaulting to USB (which doesn't exist)metal3-dev-envAnsible defaults after checkout to use{{ GOARCH }}instead of hardcodedamd64in the Go download URL, so the correct arm64 tarball is fetched on aarch64 hostsvirtmachine type does not reliably honorboot_orderfor SCSI CDROM vs virtio disk, causing VMs to reboot into the ISO instead of from disk after the agent writes the OSChanges
02_configure_host.shagent/06_agent_create_cluster.shCDROM_BUSvariable; settarget.buson all 5virt-xmlCDROM lines; addeject_agent_iso()function for aarch64agent/common.shx86_64→$(uname -m)agent/iscsi_utils.shagent/07_agent_add_extraworker_nodes.shagent/iso_no_registry.shagent/01_agent_requirements.shrhcos.sh.architectures.x86_64→ host arch)01_install_requirements.shTest plan
x86_64/sataon x86 hosts; Go sed patch is no-op when arch matches; CDROM eject is gated on aarch64)02_configure_host.sh:<os>,<features>, VNC conditionals still fire correctly (only CPU block narrowed)Supersedes #1910.
🤖 Generated with Claude Code